Logging Policy
Introduction
This Logging Policy was born out of an internal effort at Blueground to rethink observability as our tech footprint kept expanding. With more services, more data, and more engineers involved, we needed a consistent way to capture logs that were actually useful not just noise.
What started as a set of internal guidelines and reference snippets grew into a full policy we now use across our teams. We decided to open-source it in case others find themselves facing similar challenges.
We hope youβll find it useful whether you adopt it as-is, adapt parts of it, or simply take inspiration for your own approach to logging.
Purpose
This article outlines standardized logging practices that engineering teams can adopt to ensure consistency and operational excellence across applications.
Log analysis is the most essential tool for monitoring, troubleshooting, and auditing software systems. The effectiveness of these processes hinges on the standardization of log formats and practices. This guide establishes the logging standards for applications at our organization, ensuring uniformity and efficacy across our technology stack.
Logs are the stream of aggregated, time-ordered events collected from the output streams of all running processes and backing services. Logs in their raw form are typically a text or structured format with one event per line. Logs have no fixed beginning or end, but flow continuously as long as the app is operating.
Terms of Use
This Logging Policy was originally created for internal use within Blueground. We are open-sourcing it in the hope that itβs useful to others, whether as a ready-to-use baseline or as inspiration for your own practices.
By using this Logging Policy (including the documentation and reference implementations), you agree to the following:
Free to Use and Adapt
You may copy, modify, and use this Logging Policy in your own projects.
You may not redistribute it as a product, package, or service under your own name. Sharing links to this repository or referencing it in your work is always welcome.
No Warranty
This Logging Policy is provided βas isβ, without warranty of any kind.
It was designed around our own systems and may not fit every use case.
No Liability
We are not responsible for any issues, damages, or losses that may result from your use of this Logging Policy.
You assume all responsibility for adapting and applying it in your own environment.
Attribution
If you use or adapt significant portions of this policy, a simple attribution back to this repository or mention of its source is appreciated (but not required).
The 10 commandments of logging
This policy fully endorses the 10 commandments of logging - which also happens to be a fun read π
https://www.masterzen.fr/2013/01/13/the-10-commandments-of-logging
What to log
Keep a balance between too much and too little logging.
Make sure you always log:
Application initialization
Logging initialization
Incoming requests
Outgoing requests
"Use case" initiation/completion
Application errors
Input and output validation failures
Authentication successes and failures
Authorization failures
Session management failures
Privilege elevation successes and failures
Other higher-risk events, like data import and export or bulk updates
Consider logging other events that address these use cases:
Troubleshooting
Monitoring and performance improvement
Testing
Understanding user behavior
Security and auditing
Log Format: JSON
JSON by default
All first-party applications should output their logs in JSON format without exceptions while running with default configuration (typically in production and staging deploys).
{
"timestamp": "1714131374032",
"logger": "foo.bar",
"msg": "This is a log message",
"host": "i-f823e12ac",
"service": "example-application",
"source": "k8s/spring-boot",
"status": "info",
}Example: log message in JSON
Why a structured Log Format?
β Allows for a custom log data model
β Easy parsing
Semi-structured log formats like SysLog and Commons Log Format, once favored for their compact size and human readability, present challenges due to error-prone parsing, often handled through Grok expressions. Although their smaller byte size was advantageous when storage costs were high, this benefit has diminished. Modern log management providers, such as Datadog or Loggly, charge the number of indexed log events (not their size*). Moreover, they address readability issues by converting JSON logs into easily interpretable formats, making structured formats more appealing despite their larger size. Finally, semi-structured log formats typically make assumptions about the data being logged, while JSON allows us to customize our log data model.
* Actually, they do charge bytes of ingested data, but these costs are a whole scale of magnitude lower than those of indexing.
TEXT in development
While JSON may be perfect for tools and log management providers, it makes it hard for humans to pinpoint and read the actual log message. In development mode, applications should take a hybrid approach and display each log event as follows:
<TIMESTAMP> <STATUS> [<THREAD>] [<LOGGER>] <MSG>
<attributes in JSON or key-value pairs>TIMESTAMP: the log timestamp in the user/system data format/timezoneSTATUS: the log level/status as string, e.g.INFO,WARNLOGGER: the code component that emits the logMSG: the log message
[Fri Apr 26 2024 14:36:17] | INFO | This is a log message
{
"host": "i-f823e12ac",
"service": "example-application",
"source": "k8s/spring-boot"
}Example: log message in TEXT + JSON
[Fri Apr 26 2024 14:36:17] | INFO | This is a log message
host=i-f823e12ac
service=example-application
source=k8s/spring-booExample: log message in TEXT + Key-Value pairs
Use delimiters and/or colors.
Use a delimiter to separate between the different parts of the log message.
[Fri Apr 26 2024 14:36:17] | INFO | This is a log messageOr color code them using ANSI color escape sequences.
π

Log Routing: STDOUT
A cloud-native app NEVER concerns itself with routing or storage of its output stream.
It should not attempt to write to or manage log files. Instead, each running process writes its event stream, unbuffered, to stdout. During local development, the developer will view this stream in the foreground of their terminal to observe the app's behavior.
In staging or production deploys, each process stream will be captured by the execution environment (Datadog agent for apps running in K8S), collated together with all other streams from the app, and routed to one or more final destinations for viewing and long-term archival. These archival destinations are not visible to or configurable by the app and instead are completely managed by the execution environment.
The stream can be sent to a log indexing and analysis system - we currently use Datadog - or a general-purpose data warehousing system such as Hadoop/Hive. These systems allow for great power and flexibility for introspecting an app's behavior over time, including:
Finding specific events in the past.
Large-scale graphing of trends (such as requests per minute).
Active alerting according to user-defined heuristics (such as an alert when the quantity of errors per minute exceeds a certain threshold).
What if I want to collect logs in a file during development?
Inside a terminal? Use tee to view logs on stdout but also store to a file
$ ./run-my-app | tee .logs/log.outInside an IDE?
Then, you should have the option to save the output to a file. For example, this is how to save the console output to a file using Intellij.
Log Levels
Applications should emit log events mapping to one or more of the following log levels.
Name | Severity | Description |
|---|---|---|
FATAL | High | The service/app is going to stop or become unusable now. An engineer should definitely look into this soon. |
ERROR | Mid | Fatal for a particular request, but the service/app continues servicing other requests. An engineer should look at this soon(ish). |
WARN | Low | A note on something that should probably be looked at by an engineer eventually. |
INFO | None | Detail on regular operation |
DEBUG | None | Anything else, i.e., it is too verbose to be included at the "info" level. |
TRACE | None | Logging from external libraries used by your app or very detailed application logging. |
How do TRACE log messages look like?
The following types of messages are probably appropriate at the TRACE level:
Entering <class name>.<method name>, <argument name>: <argument value>, [<argument name>: <argument value>]
Method <class name>.<method name> completed [, returning: <return value>]
<class name>.<method name>: <description of some action being taken, complete with context information>
<class name>.<method name>: <description of some calculated value, or decision made, complete with context information>Log Level Mapping
Once deployed, the application logs will be sent to our log management provider (Datadog).
For each log event to be mapped to the correct status on Datadog, you can do one of the following:
Configure a log status remapper to map your
customlog level to Datadog'slog statusUse the following codification (recommended)
{
// Note that the name of the property
// is "status" not "level"
"status": "TRACE|DEBUG|INFO|WARN|ERROR|FATAL"
}For reference, here's how Datadog remaps each incoming status value:
Integers from 0 to 7 map to the Syslog severity standards
Strings beginning with
emergorf(case-insensitive) map toemerg (fatal)Strings beginning with
a(case-insensitive) map toalertStrings beginning with
c(case-insensitive) map tocriticalStrings beginning with
e(case-insensitive)βthat do not matchemergβmap toerrorStrings beginning with
w(case-insensitive) map towarningStrings beginning with
n(case-insensitive) map tonoticeStrings beginning with
i(case-insensitive) map toinfoStrings beginning with
d, trace or verbose (case-insensitive) map todebugStrings beginning with
oors, or matchingOKorSuccess(case-insensitive) map toOKAll others map to
info
Log Data Model
Adopting a standardized data model for logging offers several key benefits:
Unified Understanding: Establishes a common framework for what constitutes a log event, ensuring clarity across all teams.
Clear Semantics: Provides log attributes with well-defined meanings, removing ambiguity and promoting consistent interpretation.
Enhanced Troubleshooting: Facilitates more effective and efficient problem-solving across applications in a microservices-based architecture.
Embracing a "Shift Left" approach, we aim for our log events to conform to this model right from their origin. In cases where direct adoption is impractical or complex, a Datadog log pipeline should be utilized to transform the log events into this data model.
Model definition
Consider a log event as a collection of attributes, each akin to a field in a flattened JSON object.
LOG EVENT = ATTRIBUTE1 + ATTRIBUTE2 + ...
Accordingly, the data model is described as a series of attribute specifications.
To maximize efficiency and leverage established practices, we build upon Datadog's default set of standard log attributes and split those attributes into functional domains like Network, HTTP, and more.
Reserved attributes
A "fixed" set of attributes that bear special semantics on Datadog. Reserved attributes are automatically identified and parsed.
Path | Type | Required | Description |
|---|---|---|---|
host | string | Automatically added by theDatadog Agent | The name of the originating host as defined in metrics. Datadog automatically retrieves corresponding host tags from the matching host in Datadog and applies them to your logs. The Agent sets this value automatically. Example: |
source | string | Automatically added by theDatadog Agent | This corresponds to the integration name, the technology from which the log originated. When it matches an integration name, Datadog automatically installs the corresponding parsers and facets. For example, nginx, PostgreSQL, and so on. Example: |
status | string | Always | This corresponds to the level of a log. It is used to define patterns and has a dedicated layout in the Datadog Log UI. Example: |
service | string | Automatically added by theDatadog Agent | The name of the application or service generating the log events. It is used to switch from Logs to APM, so make sure you define the same value when you use both products. Example: |
trace_id | string | Automatically added by theDatadog Agent | This corresponds to the Trace ID used for traces. It is used to correlate your log with its trace. Example: |
message | string | Always | By default, Datadog ingests the value of the message attribute as the body of the log entry. That value is then highlighted and displayed in Live Tail, where it is indexed for full-text search. |
date | stringnumber | Always | The log event creation timestamp. Example: |
Source code attributes
This attribute set helps identify the origin of the log event in the source code.
Path | Type | Required | Description |
|---|---|---|---|
logger.name, name | string | Always | The name of the logger. Example: |
logger.thread_name | string | Multithreaded apps | The name of the current thread when the log is fired.Example: |
logger.method_name | string | Optional | The class method name.Example: |
logger.version | string | Optional | The version of the logger.Example: |
Network attributes
These attributes are related to the data used in network communication.
All fields and metrics are prefixed by network
Path | Type | Required | Description |
|---|---|---|---|
network.bytes_read | number | Network requests | Total number of bytes transmitted from the client to the server when the log is emitted.Example: |
network.bytes_written | number | Network requests | Total number of bytes transmitted from the server to the client when the log is emitted.Example: |
network.client.external_ip,network.client.ip | string | Network requests | The IP address of the original client that initiated the inbound connection.Example: |
network.client.external_port,network.client.port | string | Network requests | The port of the original client that initiated the connection.Example: |
network.client.internal_ip | string | Network requests | The IP address of the internal host proxying the connection. Typically, a load balancer or another pod in the same k8s cluster.Example: |
network.destination.ip | string | Network requests | For outbound connections, that is the destination IPExample: |
network.destination.port | number | Network requests | The remote port number of the outbound connection. Example: |
Applications should try to include those attributes in their network-related requests, both incoming and outgoing. See the full list of supported network and geolocation attributes.
Error attributes
These attributes are related to error-specific data and are required for all error events.
Path | Type | Required | Description |
|---|---|---|---|
error.kind | string | Mapped by Datadog | The error type or kind (or code in some cases).e.g. |
error.message | string | Mapped by Datadog | A concise, human-readable, one-line message explaining the event.e.g. |
error.stack | string | Errors | The stack trace or complementary information about the error. |
HTTP attributes
Required for all events that log HTTP requests. As they describe the HTTP request itself, they should not be propagated to downstream log events (MDC) or logged out of context.
Path | Type | Required | Description |
|---|---|---|---|
http.url | string | HTTP requests | The URL of the HTTP request |
http.referer | string | HTTP requests | HTTP header field that identifies the address of the webpage that linked to the resource being requested. |
http.method | string | HTTP requests | The port of the client that initiated the connection. |
http.status_code | string | HTTP requests | The HTTP response status code. |
http.useragent | string | HTTP requests | The |
http.version | string | HTTP requests | The version of HTTP used for the request. |
http.url_details.host | string | HTTP requests | The HTTP host part of the URL. |
http.url_details.port | number | HTTP requests | The HTTP port part of the URL. |
http.url_details.path | string | HTTP requests | The HTTP path part of the URL. |
http.url_details.queryString | object | HTTP requests | The HTTP query string parts of the URL decomposed as query params key/value attributes. |
http.url_details.scheme | string | HTTP requests | The protocol name of the URL (HTTP or HTTPS). |
http.useragent_details.os.family | string | HTTP requests | The OS family reported by the User-Agent. |
http.useragent_details.browser.family | string | HTTP requests | The Browser Family reported by the User-Agent. |
http.useragent_details.device.family | string | HTTP requests | The Device family reported by the User-Agent. |
Performance attributes
Required in any log that describes a task whose performance we are interested in. E.g., an HTTP request, a DB operation or any I/O operation, a CPU crunching computation, etc.
Path | Type | Required | Description |
|---|---|---|---|
duration | number (nanoseconds) | When applicable | A duration of any kind in nanoseconds: HTTP response time, database query time, latency, and so on. β Remap any durations within logs to this attribute because Datadog displays and uses it as a default measure for trace search. |
User attributes
All execution flows starting from a user action, should capture the user information in their log events.
Path | Type | Logged | Description |
|---|---|---|---|
string | When part of a user request | The user identifier, e.g. | |
string | When part of a user request | The user's name, e.g. | |
usr.email | string | When part of a user request | The user's email, e.g. |
Domain attributes
Domain attributes capture COMMON domain-specific information. By common we mean that those attributes are organization-wide and can be found across services and apps. Domain attributes are typically provided via a context propagation mechanism like MDC.
Path | Type | Logged | Description |
|---|---|---|---|
domain.market | string | When applicable | The our organization market as a 3-letter city code. E.g. |
number | When applicable | The property id, e.g. | |
domain.property.code | string | When applicable | The property code, e.g. |
number | When applicable | The booking id, e.g. | |
domain.booking.code | string | When applicable | The booking code, e.g. |
domain.booking.version | number | When applicable | The booking version, e.g. |
number | When applicable | The task id, e.g. | |
domain.* | string | When applicable | Any domain-specific attribute related to the log event |
Service attributes
Service attributes are service-specific domain attributes. To avoid name "collisions" that may prevent us from using features such as Log Facets , the service attributes are placed under a [service] key instead of domain.
Why both domain.* and [service].* attributes?
If service-specific domain attributes were placed under domain, we would get values under the same key with totally different semantics.
Example:
For Hermes channel may be slack while for PCM may be apartments.com. By placing channel under hermes.channel and pcm.channel respectively, we get the required isolation to build effective filters and facets.
Path | Type | Logged | Description |
|---|---|---|---|
[service].* | any | When applicable | A service-specific domain attribute. E.g. |
Other attributes
Path | Type | Logged | Description |
|---|---|---|---|
correlation_id | string | Always | A trace ID that is decoupled from APM. As such, it should be present in all log events regardless of whether it is sampled from APM or not. For traces starting from an external incoming HTTP request, it should default to |
entrypoint | string | Always | Identifies the "port" through which the request reached the application, e.g. |
string | service-to-service | The ID of the client service. Typically, the | |
string | service-to-service | The name of the client service. E.g. | |
team | string | Always | The team responsible for interpreting/monitoring this log event. E.g. |
How to generate the correlation_id?
Format: version-timestamp-uniqueid where:
version: 1timestamp: Epoch in secondsuniqueid: KSUID
How to propagate the correlation_id?
JVM
Node.js
The log message
Generic
Standard log events should use the following generic message format in the default configuration.
[SCOPE] TEXT (key=value)*SCOPE: Optional context in square brackets. Typically, the name of a use case, command, or edge case being handled. E.g.[CreateBooking],[UpdateUser],[SendEmailJob]. The context is redundant when it's semantically equivalent to thelogger.nameTEXT: The human-readable description of the log event. See Writing style for details.
(key=value): Key/Value pairs go at the end and are enclosed in parentheses.
Writing style
To maintain consistency and readability, adhere to the following guidelines for the tone, style, and tense of log messages:
Tone: Use a neutral and professional tone. Avoid slang, jargon, or overly technical language that may not be easily understood by all team members.
Style: Be concise and clear. Ensure that each log message provides sufficient context to understand the issue without being verbose. Use structured formats to facilitate easy parsing and analysis.
Tense: Use appropriate tenses based on the action being logged. Present tense can be used for ongoing actions, past tense for completed actions, and present perfect for recent completions relevant to the current context.
Ongoing Actions (Present Tense): Use the present tense for actions that are currently happening.
Example:
Starting the payment processing
Completed Actions (Past Tense): Use past tense for actions that have already been completed.
Example:
Payment processed successfully
Recent Completions (Present Perfect): Use present perfect tense for actions that happened in the past but are still relevant (with a result in the present):
Example:
File has been deleted
HTTP
HTTP log events follow a particular message format that allows for very fast eye scanning among hundreds of events. Since applications typically both accept and issue HTTP requests, we employ a slightly different format between the two.
INCOMING REQUEST
[req] METHOD PATH Should we log incoming requests or only responses?
__Recommendation__: No by default, but make it available under a feature flag.
Logging incoming requests == increased logging management costs.
Logging incoming requests == increased log verbosity
Logging incoming requests == ability to identify poisonous requests that may hang the server before it gets a chance to log a response. Without logging incoming requests, poisonous requests could go "stealth".
OUTGOING RESPONSE
[res] METHOD PATH STATUS_CODE STATUS_TEXT (duration)OUTGOING REQUEST (e.g., Axios, fetch, RestClient)
-> [req] [host] METHOD PATHINCOMING RESPONSE
<- [res] [host] METHOD PATH STATUS_CODE STATUS_TEXT (duration)Examples:
# INCOMING REQUEST
[req] GET /foo/bar?qux=baz
[req] POST /foo/bar
# OUTGOING RESPONSE
[res] GET /foo/bar?qux=baz 200 OK (0.17s)
[res] POST /foo/bar 201 CREATED (0.52s)
# OUTGOING REQUEST
-> [req] [foo-svc] GET /foo/bar?qux=baz
-> [req] [exchangerate.com] GET /api/rates
# INCOMING RESPONSE
<- [res] [foo-svc] GET /foo/bar?qux=baz 200 OK (0.2s)
<- [res] [exchangerate.com] GET /api/rates 200 OK (0.3s)GraphQL
GraphQL requests are somewhat special HTTP requests
They cannot be differentiated by the HTTP path, which is the same for all, typically
/graphqlAre highly polymorphic
The same query may list different attributes
The same request may contain multiple queries
For the reasons above, to fully capture GraphQL requests in our log events, we add extra, specific metadata under the graphql JSON property.
operationType: Whether it is a query, mutation, or subscription.operationName: The name of the query or mutation being executed.operationBody: The actual GraphQL query or mutation string.variables: Any variables passed along with the query.responseTime: The time taken to execute the query.responseStatus: Success or error status of the query execution.errors: Detailed error messages if the query fails.performanceMetrics: Optional, but can include detailed timing information for various parts of the query execution.
export const schema = z.object({
graphql: z.object({
operations: z.array(
z.object({
operationType: z.union([
z.literal("query"),
z.literal("mutation"),
z.literal("subscription")
]),
operationName: z.string(),
operationBody: z.string(),
variables: z.object({}).passthrough(),
responseTimeMs: z.number(),
responseStatus: z.string(),
errors: z.array(z.string()),
performanceMetrics: z.object({
parsingTimeMs: z.number(),
validationTimeMs: z.number(),
executionTimeMs: z.number()
})
})
)
})
})graphql zod schema
Example:
{
// ...
"msg": "POST /graphql 200 OK",
"graphql": {
"operations": [
{
"operationType": "query",
"operationName": "getUser",
"operationBody": "query getUser($id: ID!) { user(id: $id) { id, name, email } }",
"variables": {
"id": "user_456"
},
"responseTimeMs": 125,
"responseStatus": "success",
"errors": null,
"performanceMetrics": {
"parsingTimeMs": 10,
"validationTimeMs": 5,
"executionTimeMs": 110
}
},
{
"operationType": "mutation",
"operationName": "createUser",
"operationBody": "mutation createUser($name: String!, $email: String!) { id }",
"variables": {
"name": "Joe",
"email": "joe@test.com"
},
"responseTimeMs": 125,
"responseStatus": "success",
"errors": null,
"performanceMetrics": {
"parsingTimeMs": 10,
"validationTimeMs": 5,
"executionTimeMs": 110
}
}
]
}
}Best Practices
Correlation IDs
All apps should accept and propagate correlation IDs from all entry points to all exit points.
HTTP
Inbound HTTP requests should set the MDC from the
x-correlation-idHTTP headerOutbound HTTP requests should set the
x-correlation-idHTTP header from the MDC
Kafka
Message consumers should set the MDC from the
x-correlation-idMessage headerMessage producers should set the Message
x-correlation-idheader from the MDC
RabbitMQ / AMQP
Message consumers should set the MDC from the
correlation-idMessage propertyMessage producers should set the Message
correlation-idproperty from the MDC
Writing a good log message
Follow a consistent format: See the log message format.
Be clear and concise: Ensure that your log messages are easy to understand. Use simple and clear language to describe what is happening in the system.
Include Relevant Context: Provide enough context to make the log message useful for debugging. Include information such as user IDs, request IDs, or relevant state information. See logging attributes.
Some examples:
β Error in module X
β
Failed to connect to database in user authentication module. (userID=12345) (action=login)β User not found
β
User not found during password reset request. (userID=98765) (requestID=abc123)β Order failed
β
Order processing failed due to invalid payment method. (orderID=54321) (paymentMethod=Bitcoin)β Payment error
β
Payment failed. (userID=12345) (orderID=98765) (errorCode=PAYMENT_DECLINED)Don't log at the wrong layer
Log where you have the right context for the type of log event you want to create.
This is best described through an example:
π Bad
fun validateAddress(address: Address): Boolean {
if (address.street.isNullOrBlank()) {
// We don't have enough context here since validateAddress
// may be used in a dozen places.
logger.info("Street can't be blank")
return false
}
if (address.zipCode.length < 5) {
// We don't have enough context here since validateAddress
// may be used in a dozen places.
logger.info("Zip code should be at least 5 chars")
return false
}
return true
}
fun createProperty(property: propertyDetails) {
val isAddressValid = validateAddress(property.address)
if (!isAddressValid) {
throw InvalidAddressError()
}
}
fun createUserProfile(profile: ProfileDetails) {
val isAddressValid = validateAddress(profile.address)
if (!isAddressValid) {
throw InvalidAddressError()
}
}π Better
fun validateAddress(address: Address): Boolean {
if (address.street.isNullOrBlank()) {
// Only a trace log makes sense here to troubleshoot
// validateAddress itself if needed
logger.trace("Street can't be blank. (street=${address.street})")
return false
}
if (address.zipCode.length < 5) {
// Only a trace log makes sense here to troubleshoot
// validateAddress itself if needed
logger.trace("Zip code should be at least 5 chars. (zipCode=${address.zipCode})")
return false
}
return true
}
fun createProperty(property: propertyDetails) {
val isAddressValid = validateAddress(property.address)
if (!isAddressValid) {
logger.info("[CreateProperty] Validation failed. Invalid address ({})", address)
throw InvalidAddressError()
}
}
fun createUserProfile(profile: ProfileDetails) {
val isAddressValid = validateAddress(profile.address)
if (!isAddressValid) {
logger.info("[CreateUserProfile] Validation failed. Invalid address ({})", address)
throw InvalidAddressError()
}
}π Best
fun validateAddress(address: Address): AddressValidation {
if (address.street.isNullOrBlank()) {
// Only a trace log makes sense here to troubleshoot
// validateAddress itself if needed
logger.trace("Street can't be blank. (street=${address.street})")
return AddressValidation.INVALID_STREET
}
if (address.zipCode.length < 5) {
// Only a trace log makes sense here to troubleshoot
// validateAddress itself if needed
logger.trace("Zip code should be at least 5 chars. (zipCode=${address.zipCode})")
return AddressValidation.INVALID_ZIP
}
return AddressValidation.OK
}
fun createProperty(property: propertyDetails) {
val addressValidation = validateAddress(property.address)
if (!addressValidation.isOK()) {
logger.info("[CreateProperty] Address validation failed (type={})", addressValidation.type)
throw InvalidAddressError()
}
}
fun createUserProfile(profile: ProfileDetails) {
val isAddressValid = validateAddress(profile.address)
if (!isAddressValid) {
logger.info("[CreateUserProfile] Address validation failed (type={})", addressValidation.type)
throw InvalidAddressError()
}
}Logging errors
When to log errors
Log all global exceptions e.g. via an
UncaughtExceptionHandleron the JVMLog at the layer where you are handling the error
Use log and rethrow with caution
Log and rethrow
try {
// Some code that might throw an exception
throw KaboomError("Simulated error")
} catch (e: RuntimeException) {
// Transforming and rethrowing the exception
val transformedException = BangError("Kaboom", KaboomError)
logger.error("Kaboom: ${e.message}", e)
throw transformedException
}
Why you should avoid
Duplicate logging
Loss of context if not done properly
Performance/cost (if on a hot path)
Why you may want it
If you need to log the error at different log levels
If you need to make sure you add contextual info at a lower level
If you want to transfrom an exception into a more meaningful one that fits the abstraction of the current layer.
What to log as an error
Unhandled exceptions
Unexpected Errors
Database Errors
Failed calls to external APIs or microservices.
Critical business operations that fail, such as order processing or payment transactions.
Resource Failures
Insufficient resources (e.g., memory, disk space).
Failures to read or write critical files.
Network connectivity issues affecting application performance.
HTTP 500s should also be logged as errors
How to log errors with slf4j
logger.error("Failed to do something: ${e.message}", e)How to log errors with pino
logger.error({err}, "Failed to do something: ${err.message}")Data Privacy & Log Redaction
When it comes to PII (SPII) we can generally split data in two groups.
Known: Where you know exactly the type and origin. E.g.
User.emailUnknown: Where you know it may be personal information but don't know the exact type or key. E.g.
metadataorrequest.headers
Out of the box, our logs are redacted on Datadog via the following ways:
The following types of personal or sensitive information are detected and redacted automatically:
Phone number
Email
Address
DoB
Passport/ID number
Social Security Number
Card numbers
JSON web tokens
API keys (AWS, Slack, etc)
Access tokens and secrets
IP and MAC addresses are not redacted on purpose for defensive security purposes
With that in mind, all applications should adhere to the following best practices
Don't include non-useful data in your logs (just because it's there).
Mask known in that you want complete control over how they are redacted from within the application. E.g. mask the 60% of the email address but leave domain intact.
Leave unknown data to be redacted by Datadog's security scanner.
Use RBAC for logs to limit access to known PII that you must include in your logs for effective troubleshooting/security.