Logging Policy

Introduction

This Logging Policy was born out of an internal effort at Blueground to rethink observability as our tech footprint kept expanding. With more services, more data, and more engineers involved, we needed a consistent way to capture logs that were actually useful not just noise.

What started as a set of internal guidelines and reference snippets grew into a full policy we now use across our teams. We decided to open-source it in case others find themselves facing similar challenges.

We hope you’ll find it useful whether you adopt it as-is, adapt parts of it, or simply take inspiration for your own approach to logging.

Purpose

This article outlines standardized logging practices that engineering teams can adopt to ensure consistency and operational excellence across applications.

Log analysis is the most essential tool for monitoring, troubleshooting, and auditing software systems. The effectiveness of these processes hinges on the standardization of log formats and practices. This guide establishes the logging standards for applications at our organization, ensuring uniformity and efficacy across our technology stack.

Logs are the stream of aggregated, time-ordered events collected from the output streams of all running processes and backing services. Logs in their raw form are typically a text or structured format with one event per line. Logs have no fixed beginning or end, but flow continuously as long as the app is operating.

Terms of Use

This Logging Policy was originally created for internal use within Blueground. We are open-sourcing it in the hope that it’s useful to others, whether as a ready-to-use baseline or as inspiration for your own practices.
By using this Logging Policy (including the documentation and reference implementations), you agree to the following:

Free to Use and Adapt
- You may copy, modify, and use this Logging Policy in your own projects.
- You may not redistribute it as a product, package, or service under your own name. Sharing links to this repository or referencing it in your work is always welcome.
No Warranty
- This Logging Policy is provided “as is”, without warranty of any kind.
- It was designed around our own systems and may not fit every use case.
No Liability
- We are not responsible for any issues, damages, or losses that may result from your use of this Logging Policy.
- You assume all responsibility for adapting and applying it in your own environment.
Attribution
- If you use or adapt significant portions of this policy, a simple attribution back to this repository or mention of its source is appreciated (but not required).

The 10 commandments of logging

This policy fully endorses the 10 commandments of logging - which also happens to be a fun read 😆

https://www.masterzen.fr/2013/01/13/the-10-commandments-of-logging

What to log

Keep a balance between too much and too little logging.

Make sure you always log:

Application initialization
Logging initialization
Incoming requests
Outgoing requests
"Use case" initiation/completion
Application errors
Input and output validation failures
Authentication successes and failures
Authorization failures
Session management failures
Privilege elevation successes and failures
Other higher-risk events, like data import and export or bulk updates

Consider logging other events that address these use cases:

Troubleshooting
Monitoring and performance improvement
Testing
Understanding user behavior
Security and auditing

Log Format: JSON

JSON by default

All first-party applications should output their logs in JSON format without exceptions while running with default configuration (typically in production and staging deploys).

{
  "timestamp": "1714131374032",
  "logger": "foo.bar",
  "msg": "This is a log message",
  "host": "i-f823e12ac",
  "service": "example-application",
  "source": "k8s/spring-boot",
  "status": "info",
}

Example: log message in JSON

Why a structured Log Format?

✅ Allows for a custom log data model

✅ Easy parsing

Semi-structured log formats like SysLog and Commons Log Format, once favored for their compact size and human readability, present challenges due to error-prone parsing, often handled through Grok expressions. Although their smaller byte size was advantageous when storage costs were high, this benefit has diminished. Modern log management providers, such as Datadog or Loggly, charge the number of indexed log events (not their size*). Moreover, they address readability issues by converting JSON logs into easily interpretable formats, making structured formats more appealing despite their larger size. Finally, semi-structured log formats typically make assumptions about the data being logged, while JSON allows us to customize our log data model.

* Actually, they do charge bytes of ingested data, but these costs are a whole scale of magnitude lower than those of indexing.

TEXT in development

While JSON may be perfect for tools and log management providers, it makes it hard for humans to pinpoint and read the actual log message. In development mode, applications should take a hybrid approach and display each log event as follows:

 <TIMESTAMP> <STATUS> [<THREAD>] [<LOGGER>] <MSG>
 <attributes in JSON or key-value pairs>

TIMESTAMP: the log timestamp in the user/system data format/timezone
STATUS: the log level/status as string, e.g. INFO, WARN
LOGGER: the code component that emits the log
MSG: the log message

[Fri Apr 26 2024 14:36:17] | INFO | This is a log message
{
  "host": "i-f823e12ac",
  "service": "example-application",
  "source": "k8s/spring-boot"
}

Example: log message in TEXT + JSON

[Fri Apr 26 2024 14:36:17] | INFO | This is a log message
  host=i-f823e12ac
  service=example-application
  source=k8s/spring-boo

Example: log message in TEXT + Key-Value pairs

Use delimiters and/or colors.

Use a delimiter to separate between the different parts of the log message.

[Fri Apr 26 2024 14:36:17] | INFO | This is a log message

Or color code them using ANSI color escape sequences.

👇

Log Routing: STDOUT

A cloud-native app NEVER concerns itself with routing or storage of its output stream.

It should not attempt to write to or manage log files. Instead, each running process writes its event stream, unbuffered, to stdout. During local development, the developer will view this stream in the foreground of their terminal to observe the app's behavior.

In staging or production deploys, each process stream will be captured by the execution environment (Datadog agent for apps running in K8S), collated together with all other streams from the app, and routed to one or more final destinations for viewing and long-term archival. These archival destinations are not visible to or configurable by the app and instead are completely managed by the execution environment.

The stream can be sent to a log indexing and analysis system - we currently use Datadog - or a general-purpose data warehousing system such as Hadoop/Hive. These systems allow for great power and flexibility for introspecting an app's behavior over time, including:

Finding specific events in the past.
Large-scale graphing of trends (such as requests per minute).
Active alerting according to user-defined heuristics (such as an alert when the quantity of errors per minute exceeds a certain threshold).

What if I want to collect logs in a file during development?

Inside a terminal? Use tee to view logs on stdout but also store to a file

$ ./run-my-app | tee .logs/log.out

Inside an IDE?

Then, you should have the option to save the output to a file. For example, this is how to save the console output to a file using Intellij.

Log Levels

Applications should emit log events mapping to one or more of the following log levels.

Name	Severity	Description
FATAL	High	The service/app is going to stop or become unusable now. An engineer should definitely look into this soon.
ERROR	Mid	Fatal for a particular request, but the service/app continues servicing other requests. An engineer should look at this soon(ish).
WARN	Low	A note on something that should probably be looked at by an engineer eventually.
INFO	None	Detail on regular operation
DEBUG	None	Anything else, i.e., it is too verbose to be included at the "info" level.
TRACE	None	Logging from external libraries used by your app or very detailed application logging.

How do TRACE log messages look like?

The following types of messages are probably appropriate at the TRACE level:


Entering <class name>.<method name>, <argument name>: <argument value>, [<argument name>: <argument value>]
Method <class name>.<method name> completed [, returning: <return value>]
<class name>.<method name>: <description of some action being taken, complete with context information>
<class name>.<method name>: <description of some calculated value, or decision made, complete with context information>

Log Level Mapping

Once deployed, the application logs will be sent to our log management provider (Datadog).

For each log event to be mapped to the correct status on Datadog, you can do one of the following:

Configure a log status remapper to map your custom log level to Datadog's log status
Use the following codification (recommended)

{
  // Note that the name of the property 
  // is "status" not "level"
  "status": "TRACE|DEBUG|INFO|WARN|ERROR|FATAL"
}

For reference, here's how Datadog remaps each incoming status value:

Integers from 0 to 7 map to the Syslog severity standards
Strings beginning with emerg or f (case-insensitive) map to emerg (fatal)
Strings beginning with a (case-insensitive) map to alert
Strings beginning with c (case-insensitive) map to critical
Strings beginning with e (case-insensitive)—that do not match emerg —map to error
Strings beginning with w (case-insensitive) map to warning
Strings beginning with n (case-insensitive) map to notice
Strings beginning with i (case-insensitive) map to info
Strings beginning with d, trace or verbose (case-insensitive) map to debug
Strings beginning with o or s, or matching OK or Success (case-insensitive) map to OK
All others map to info

Log Data Model

Adopting a standardized data model for logging offers several key benefits:

Unified Understanding: Establishes a common framework for what constitutes a log event, ensuring clarity across all teams.
Clear Semantics: Provides log attributes with well-defined meanings, removing ambiguity and promoting consistent interpretation.
Enhanced Troubleshooting: Facilitates more effective and efficient problem-solving across applications in a microservices-based architecture.

Embracing a "Shift Left" approach, we aim for our log events to conform to this model right from their origin. In cases where direct adoption is impractical or complex, a Datadog log pipeline should be utilized to transform the log events into this data model.

Model definition

Consider a log event as a collection of attributes, each akin to a field in a flattened JSON object.

LOG EVENT = ATTRIBUTE1 + ATTRIBUTE2 + ...

Accordingly, the data model is described as a series of attribute specifications.

To maximize efficiency and leverage established practices, we build upon Datadog's default set of standard log attributes and split those attributes into functional domains like Network, HTTP, and more.

Reserved attributes

A "fixed" set of attributes that bear special semantics on Datadog. Reserved attributes are automatically identified and parsed.

Path	Type	Required	Description
host	string	Automatically added by theDatadog Agent	The name of the originating host as defined in metrics. Datadog automatically retrieves corresponding host tags from the matching host in Datadog and applies them to your logs. The Agent sets this value automatically. Example: `myapp-5c74d5d5d4-trqg8`
source	string	Automatically added by theDatadog Agent	This corresponds to the integration name, the technology from which the log originated. When it matches an integration name, Datadog automatically installs the corresponding parsers and facets. For example, nginx, PostgreSQL, and so on. Example: `nodejs`, `java`
status	string	Always	This corresponds to the level of a log. It is used to define patterns and has a dedicated layout in the Datadog Log UI. Example: `warn`, `info` etc
service	string	Automatically added by theDatadog Agent	The name of the application or service generating the log events. It is used to switch from Logs to APM, so make sure you define the same value when you use both products. Example: `myapp`
trace_id	string	Automatically added by theDatadog Agent	This corresponds to the Trace ID used for traces. It is used to correlate your log with its trace. Example: `6631532b000121232131241412412`
message	string	Always	By default, Datadog ingests the value of the message attribute as the body of the log entry. That value is then highlighted and displayed in Live Tail, where it is indexed for full-text search.
date	stringnumber	Always	The log event creation timestamp. Example: `2024-04-30T11:06:20.812538642Z`

Source code attributes

This attribute set helps identify the origin of the log event in the source code.

Path	Type	Required	Description
logger.name, name	string	Always	The name of the logger. Example: `TaskHandler`
logger.thread_name	string	Multithreaded apps	The name of the current thread when the log is fired.Example: `task_consumer_1`
logger.method_name	string	Optional	The class method name.Example: `handleTask`
logger.version	string	Optional	The version of the logger.Example: `1.2.3`

Network attributes

These attributes are related to the data used in network communication.
All fields and metrics are prefixed by network

Path	Type	Required	Description
network.bytes_read	number	Network requests	Total number of bytes transmitted from the client to the server when the log is emitted.Example: `2048`
network.bytes_written	number	Network requests	Total number of bytes transmitted from the server to the client when the log is emitted.Example: `2048`
network.client.external_ip,network.client.ip	string	Network requests	The IP address of the original client that initiated the inbound connection.Example: `192.11.22.02`
network.client.external_port,network.client.port	string	Network requests	The port of the original client that initiated the connection.Example: `1903`
network.client.internal_ip	string	Network requests	The IP address of the internal host proxying the connection. Typically, a load balancer or another pod in the same k8s cluster.Example: `10.244.0.22`
network.destination.ip	string	Network requests	For outbound connections, that is the destination IPExample: `192.11.22.02`
network.destination.port	number	Network requests	The remote port number of the outbound connection. Example: `443`

Applications should try to include those attributes in their network-related requests, both incoming and outgoing. See the full list of supported network and geolocation attributes.

Error attributes

These attributes are related to error-specific data and are required for all error events.

Path	Type	Required	Description
error.kind	string	Mapped by Datadog	The error type or kind (or code in some cases).e.g. `BadRequest` or `CardError`
error.message	string	Mapped by Datadog	A concise, human-readable, one-line message explaining the event.e.g. `Booking duration violates minimum stay rules`
error.stack	string	Errors	The stack trace or complementary information about the error.

HTTP attributes

Required for all events that log HTTP requests. As they describe the HTTP request itself, they should not be propagated to downstream log events (MDC) or logged out of context.

Path	Type	Required	Description
http.url	string	HTTP requests	The URL of the HTTP request
http.referer	string	HTTP requests	HTTP header field that identifies the address of the webpage that linked to the resource being requested.
http.method	string	HTTP requests	The port of the client that initiated the connection.
http.status_code	string	HTTP requests	The HTTP response status code.
http.useragent	string	HTTP requests	The `User-Agent` header received with the request.
http.version	string	HTTP requests	The version of HTTP used for the request.
http.url_details.host	string	HTTP requests	The HTTP host part of the URL.
http.url_details.port	number	HTTP requests	The HTTP port part of the URL.
http.url_details.path	string	HTTP requests	The HTTP path part of the URL.
http.url_details.queryString	object	HTTP requests	The HTTP query string parts of the URL decomposed as query params key/value attributes.
http.url_details.scheme	string	HTTP requests	The protocol name of the URL (HTTP or HTTPS).
http.useragent_details.os.family	string	HTTP requests	The OS family reported by the User-Agent.
http.useragent_details.browser.family	string	HTTP requests	The Browser Family reported by the User-Agent.
http.useragent_details.device.family	string	HTTP requests	The Device family reported by the User-Agent.

Performance attributes

Required in any log that describes a task whose performance we are interested in. E.g., an HTTP request, a DB operation or any I/O operation, a CPU crunching computation, etc.

Path	Type	Required	Description
duration	number (nanoseconds)	When applicable	A duration of any kind in nanoseconds: HTTP response time, database query time, latency, and so on. ⭐ Remap any durations within logs to this attribute because Datadog displays and uses it as a default measure for trace search.

User attributes

All execution flows starting from a user action, should capture the user information in their log events.

Path	Type	Logged	Description
usr.id	string	When part of a user request	The user identifier, e.g. `12752`, `0` for unknown users
usr.name	string	When part of a user request	The user's name, e.g. `John Doe`, `Unknown`
usr.email	string	When part of a user request	The user's email, e.g. `john@foo.bar`

Domain attributes

Domain attributes capture COMMON domain-specific information. By common we mean that those attributes are organization-wide and can be found across services and apps. Domain attributes are typically provided via a context propagation mechanism like MDC.

Path	Type	Logged	Description
domain.market	string	When applicable	The our organization market as a 3-letter city code. E.g. `NYC`, `IST`, `DXB`
domain.property.id	number	When applicable	The property id, e.g. `46245`
domain.property.code	string	When applicable	The property code, e.g. `NYC-345`
domain.booking.id	number	When applicable	The booking id, e.g. `663145`
domain.booking.code	string	When applicable	The booking code, e.g. `NYC-3405`
domain.booking.version	number	When applicable	The booking version, e.g. `1`
domain.task.id	number	When applicable	The task id, e.g. `663145`
domain.*	string	When applicable	Any domain-specific attribute related to the log event

Service attributes

Service attributes are service-specific domain attributes. To avoid name "collisions" that may prevent us from using features such as Log Facets , the service attributes are placed under a [service] key instead of domain.

Why both domain.* and [service].* attributes?

If service-specific domain attributes were placed under domain, we would get values under the same key with totally different semantics.

Example:

For Hermes channel may be slack while for PCM may be apartments.com. By placing channel under hermes.channel and pcm.channel respectively, we get the required isolation to build effective filters and facets.

Path	Type	Logged	Description
[service].*	any	When applicable	A service-specific domain attribute. E.g. `app1_name.channel`, `app2_name.paymentGateway`, `app3_name.channel`, `app4_name.priceStrategy`

Other attributes

Path	Type	Logged	Description
correlation_id	string	Always	A trace ID that is decoupled from APM. As such, it should be present in all log events regardless of whether it is sampled from APM or not. For traces starting from an external incoming HTTP request, it should default to `X-Amzn-Trace-Id`. See ALB request tracing.
entrypoint	string	Always	Identifies the "port" through which the request reached the application, e.g. `http/api` `kafka/consumer` `background_job`
clientinfo.id	string	service-to-service	The ID of the client service. Typically, the `pod UID` when referring to k8s service.
clientinfo.name	string	service-to-service	The name of the client service. E.g. `hermes`
team	string	Always	The team responsible for interpreting/monitoring this log event. E.g. `pricing` or `pms/bookings`

How to generate the correlation_id?

Format: version-timestamp-uniqueid where:

version: 1
timestamp: Epoch in seconds
uniqueid: KSUID

How to propagate the correlation_id?

JVM

Use Mapped Diagnostic Context

Node.js

Use AsyncLocalStorage - example with Express

The log message

Generic

Standard log events should use the following generic message format in the default configuration.

[SCOPE] TEXT (key=value)*

SCOPE: Optional context in square brackets. Typically, the name of a use case, command, or edge case being handled. E.g. [CreateBooking], [UpdateUser], [SendEmailJob]. The context is redundant when it's semantically equivalent to the logger.name
TEXT: The human-readable description of the log event. See Writing style for details.

(key=value): Key/Value pairs go at the end and are enclosed in parentheses.

Writing style

To maintain consistency and readability, adhere to the following guidelines for the tone, style, and tense of log messages:

Tone: Use a neutral and professional tone. Avoid slang, jargon, or overly technical language that may not be easily understood by all team members.
Style: Be concise and clear. Ensure that each log message provides sufficient context to understand the issue without being verbose. Use structured formats to facilitate easy parsing and analysis.
Tense: Use appropriate tenses based on the action being logged. Present tense can be used for ongoing actions, past tense for completed actions, and present perfect for recent completions relevant to the current context.
- Ongoing Actions (Present Tense): Use the present tense for actions that are currently happening.
  - Example: Starting the payment processing
- Completed Actions (Past Tense): Use past tense for actions that have already been completed.
  - Example: Payment processed successfully
- Recent Completions (Present Perfect): Use present perfect tense for actions that happened in the past but are still relevant (with a result in the present):
  - Example: File has been deleted

HTTP

HTTP log events follow a particular message format that allows for very fast eye scanning among hundreds of events. Since applications typically both accept and issue HTTP requests, we employ a slightly different format between the two.

INCOMING REQUEST

[req] METHOD PATH

Should we log incoming requests or only responses?

__Recommendation__: No by default, but make it available under a feature flag.

Logging incoming requests == increased logging management costs.
Logging incoming requests == increased log verbosity
Logging incoming requests == ability to identify poisonous requests that may hang the server before it gets a chance to log a response. Without logging incoming requests, poisonous requests could go "stealth".

OUTGOING RESPONSE

[res] METHOD PATH STATUS_CODE STATUS_TEXT (duration)

OUTGOING REQUEST (e.g., Axios, fetch, RestClient)

-> [req] [host] METHOD PATH

INCOMING RESPONSE

<- [res] [host] METHOD PATH STATUS_CODE STATUS_TEXT (duration)

Examples:

# INCOMING REQUEST
[req] GET /foo/bar?qux=baz
[req] POST /foo/bar

# OUTGOING RESPONSE
[res] GET /foo/bar?qux=baz 200 OK (0.17s)
[res] POST /foo/bar 201 CREATED (0.52s)

# OUTGOING REQUEST
-> [req] [foo-svc] GET /foo/bar?qux=baz
-> [req] [exchangerate.com] GET /api/rates

# INCOMING RESPONSE
<- [res] [foo-svc] GET /foo/bar?qux=baz 200 OK (0.2s)
<- [res] [exchangerate.com] GET /api/rates 200 OK (0.3s)

GraphQL

GraphQL requests are somewhat special HTTP requests

They cannot be differentiated by the HTTP path, which is the same for all, typically /graphql
Are highly polymorphic
- The same query may list different attributes
- The same request may contain multiple queries

For the reasons above, to fully capture GraphQL requests in our log events, we add extra, specific metadata under the graphql JSON property.

operationType: Whether it is a query, mutation, or subscription.
operationName: The name of the query or mutation being executed.
operationBody: The actual GraphQL query or mutation string.
variables: Any variables passed along with the query.
responseTime: The time taken to execute the query.
responseStatus: Success or error status of the query execution.
errors: Detailed error messages if the query fails.
performanceMetrics: Optional, but can include detailed timing information for various parts of the query execution.


export const schema = z.object({
  graphql: z.object({
    operations: z.array(
      z.object({
        operationType: z.union([
          z.literal("query"), 
          z.literal("mutation"), 
          z.literal("subscription")
        ]),
        operationName: z.string(),
        operationBody: z.string(),
        variables: z.object({}).passthrough(),
        responseTimeMs: z.number(),
        responseStatus: z.string(),
        errors: z.array(z.string()),
        performanceMetrics: z.object({
          parsingTimeMs: z.number(),
          validationTimeMs: z.number(),
          executionTimeMs: z.number()
        })
      })
    )
  })
})

graphql zod schema

Example:

{
  // ...
  "msg": "POST /graphql 200 OK",
  "graphql": {
    "operations": [
      {
        "operationType": "query",
        "operationName": "getUser",
        "operationBody": "query getUser($id: ID!) { user(id: $id) { id, name, email } }",
        "variables": {
          "id": "user_456"
        },
        "responseTimeMs": 125,
        "responseStatus": "success",
        "errors": null,
        "performanceMetrics": {
          "parsingTimeMs": 10,
          "validationTimeMs": 5,
          "executionTimeMs": 110
        }
      },
      {
        "operationType": "mutation",
        "operationName": "createUser",
        "operationBody": "mutation createUser($name: String!, $email: String!) { id }",
        "variables": {
          "name": "Joe",
          "email": "joe@test.com"
        },
        "responseTimeMs": 125,
        "responseStatus": "success",
        "errors": null,
        "performanceMetrics": {
          "parsingTimeMs": 10,
          "validationTimeMs": 5,
          "executionTimeMs": 110
        }
      }
    ]
  }
}

Best Practices

Correlation IDs

All apps should accept and propagate correlation IDs from all entry points to all exit points.

HTTP

Inbound HTTP requests should set the MDC from the x-correlation-id HTTP header
Outbound HTTP requests should set the x-correlation-id HTTP header from the MDC

Kafka

Message consumers should set the MDC from the x-correlation-id Message header
Message producers should set the Message x-correlation-id header from the MDC

RabbitMQ / AMQP

Message consumers should set the MDC from the correlation-id Message property
Message producers should set the Message correlation-id property from the MDC

Writing a good log message

Follow a consistent format: See the log message format.
Be clear and concise: Ensure that your log messages are easy to understand. Use simple and clear language to describe what is happening in the system.
Include Relevant Context: Provide enough context to make the log message useful for debugging. Include information such as user IDs, request IDs, or relevant state information. See logging attributes.

Some examples:

❌ Error in module X
✅ Failed to connect to database in user authentication module. (userID=12345) (action=login)

❌ User not found
✅ User not found during password reset request. (userID=98765) (requestID=abc123)

❌ Order failed
✅ Order processing failed due to invalid payment method. (orderID=54321) (paymentMethod=Bitcoin)

❌ Payment error
✅ Payment failed. (userID=12345) (orderID=98765) (errorCode=PAYMENT_DECLINED)

Don't log at the wrong layer

Log where you have the right context for the type of log event you want to create.

This is best described through an example:

😕 Bad


fun validateAddress(address: Address): Boolean {
    if (address.street.isNullOrBlank()) {
        // We don't have enough context here since validateAddress 
        // may be used in a dozen places.
        logger.info("Street can't be blank")
        return false
    }

    if (address.zipCode.length < 5) {
        // We don't have enough context here since validateAddress 
        // may be used in a dozen places.
        logger.info("Zip code should be at least 5 chars")
        return false
    }

    return true
}

fun createProperty(property: propertyDetails) {
    val isAddressValid = validateAddress(property.address)
    if (!isAddressValid) {
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
        if (!isAddressValid) {
        throw InvalidAddressError()
    }
}

😐 Better


fun validateAddress(address: Address): Boolean {
    if (address.street.isNullOrBlank()) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Street can't be blank. (street=${address.street})")
        return false
    }

    if (address.zipCode.length < 5) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Zip code should be at least 5 chars. (zipCode=${address.zipCode})")
        return false
    }

    return true
}

fun createProperty(property: propertyDetails) {
    val isAddressValid = validateAddress(property.address)
    if (!isAddressValid) {
        logger.info("[CreateProperty] Validation failed. Invalid address ({})", address)
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
    if (!isAddressValid) {
        logger.info("[CreateUserProfile] Validation failed. Invalid address ({})", address)
        throw InvalidAddressError()
    }
}

👌 Best


fun validateAddress(address: Address): AddressValidation {
    if (address.street.isNullOrBlank()) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Street can't be blank. (street=${address.street})")
        return AddressValidation.INVALID_STREET
    }

    if (address.zipCode.length < 5) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Zip code should be at least 5 chars. (zipCode=${address.zipCode})")
        return AddressValidation.INVALID_ZIP
    }

    return AddressValidation.OK
}

fun createProperty(property: propertyDetails) {
    val addressValidation = validateAddress(property.address)
    if (!addressValidation.isOK()) {
        logger.info("[CreateProperty] Address validation failed (type={})", addressValidation.type)
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
    if (!isAddressValid) {
        logger.info("[CreateUserProfile] Address validation failed (type={})", addressValidation.type)
        throw InvalidAddressError()
    }
}

Logging errors

When to log errors

Log all global exceptions e.g. via an UncaughtExceptionHandler on the JVM
Log at the layer where you are handling the error
Use log and rethrow with caution

Log and rethrow


try {
  // Some code that might throw an exception
  throw KaboomError("Simulated error")
} catch (e: RuntimeException) {
  // Transforming and rethrowing the exception
  val transformedException = BangError("Kaboom", KaboomError)
  logger.error("Kaboom: ${e.message}", e)
  throw transformedException
}

Why you should avoid

Duplicate logging
Loss of context if not done properly
Performance/cost (if on a hot path)

Why you may want it

If you need to log the error at different log levels
If you need to make sure you add contextual info at a lower level
If you want to transfrom an exception into a more meaningful one that fits the abstraction of the current layer.

What to log as an error

Unhandled exceptions
Unexpected Errors
- Database Errors
- Failed calls to external APIs or microservices.
- Critical business operations that fail, such as order processing or payment transactions.
Resource Failures
- Insufficient resources (e.g., memory, disk space).
- Failures to read or write critical files.
- Network connectivity issues affecting application performance.
HTTP 500s should also be logged as errors

How to log errors with slf4j


logger.error("Failed to do something: ${e.message}", e)

How to log errors with pino


logger.error({err}, "Failed to do something: ${err.message}")

Data Privacy & Log Redaction

When it comes to PII (SPII) we can generally split data in two groups.

Known: Where you know exactly the type and origin. E.g. User.email
Unknown: Where you know it may be personal information but don't know the exact type or key. E.g. metadata or request.headers

Out of the box, our logs are redacted on Datadog via the following ways:

The following types of personal or sensitive information are detected and redacted automatically:

Phone number
Email
Address
DoB
Passport/ID number
Social Security Number
Card numbers
JSON web tokens
API keys (AWS, Slack, etc)
Access tokens and secrets

IP and MAC addresses are not redacted on purpose for defensive security purposes

With that in mind, all applications should adhere to the following best practices

Don't include non-useful data in your logs (just because it's there).
Mask known in that you want complete control over how they are redacted from within the application. E.g. mask the 60% of the email address but leave domain intact.
Leave unknown data to be redacted by Datadog's security scanner.
Use RBAC for logs to limit access to known PII that you must include in your logs for effective troubleshooting/security.

Loggers & Reference Implementations