API observability for Cloudflare Workers

Capturing API logs and metrics for operations monitoring, alerting and troubleshooting by integrating AWS CloudWatch as a thrid party observability endpoint.

November 1st, 2020

TL;DR

This is the final post in a series focussing on the use of Cloudflare Workers as APIs. In this post, observability is implemented in Cloudflare Workers to capture logs and generate metrics that reflect API concepts and enable operations use cases such as monitoring, alerting and troubleshooting. The observability endpoint is created using AWS CloudWatch, but the same concept can be applied to any third party service. A video demo shows observability in action after running some load against the API. I conclude the series with a perspective on the future of cloud, edge and serverless computing.

The challenge

The default option for logging in Cloudflare Workers is to use the logging methods available on the global console object:

console.log('this is a log message')

Those log messages are ephemeral and simply sent to the standard output/error streams. While this is useful for local development and debugging in the workers.dev environment, for production use it would be useful to persist these logs to support potential troubleshooting scenarios.

Having logs would be a good thing, but it would be even better if we had a mechanism to observe key metrics of the API to know if/when we need to review the logs. For example, useful metrics might be:

  • A count of requests that can be grouped by API id, API resource, HTTP method, path and response status code
  • The duration of requests that can be grouped by API id, API resource, HTTP method, path and response status code

Having metrics like that allows you to observe the health of your APIs and configure alerts based on criteria such as increased rates of error responses or increasing duration of requests. Having the API id, resource, method and path allows you to quickly identify unhealthy components of the solution, and to quickly identify the logs and timeframe you need to augment any investigations.

Cloudflare has a GraphQL API available to query various data, including some Worker metrics. However, the metrics are Worker-centric, rather than API centric (because I've implemented custom API concepts). So you can get high-level counts of requests and errors, and also percentiles of Worker CPU usage, but you don't get a view of the overall request duration or the lower level API concepts.

I'm going to implement an observability framework that emits logs and metrics that reflect the API concepts to a persistent store, in this case, I'm going to use AWS CloudWatch, but you could apply a similar approach to other services.

Considerations

To create logs and metrics with AWS CloudWatch there are 4 things you need to do:

  1. Create a log group
  2. Create a log stream in the log group
  3. Write the log messages to the log stream
  4. Write the metrics/dimensions to a namespace (the namespace is created implicitly)

This is a representation of the entity model: Illustration of AWS CloudWatch entity model

I want to create a log group per API, I can do that as part of the service provisioning. I want a log stream per API request, similar to how AWS Lambda uses them. I'll create a metrics namespace per API. That means for each API request I need to do 3 things:

  1. Create a log stream in the API log group
  2. Write the log messages to the new log stream
  3. Write the metrics/dimensions to the API namespace

Let's look at the approaches I can take.

Directly to CloudWatch

This approach would see the Worker integrate with CloudWatch directly. You can't use the JavaScript AWS SDK in a Worker environment because it doesn't use fetch standard HTTP APIs, so it would require you to go directly to the CloudWatch Logs and Metrics HTTP APIs using a Worker environment compatible signature v4 signing process, with associated crypto requirements:

Illustration of direct integration from a worker to AWS CloudWatch

The trouble with this approach is the complexity of implementing the v4 signing process and appropriately abstracting the crypto functions so it works locally and in the Worker environment as per substitutable dependencies and execution context from a previous post. Since I need to make 3 API calls, each requiring v4 signing, and therefore crypto, I'm concerned about the CPU time I'd be burning, and potentially exceeding the limits. So I think this approach is off the table.

Via API Gateway and Lambda

I want to think about moving the 3 API calls to the AWS side of the solution so that the Worker only has to make one API call. This approach sees the introduction of a backing API on the AWS side that the Worker can hit using an API key. The API will be created in API Gateway and integrated with a Lambda function that can use the AWS SDK to create the CloudWatch resources:

An illustration showing the integration from a worker to AWS CloudWatch via API Gateway and Lambda

The trouble with this approach is the estimated cost of running the Lambda function to persist the logs and metrics is orders of magnitude greater than the cost of running the entire API in Cloudflare. I can't justify spending that much money on persisting logs and metrics.

Via API Gateway service integration

There's a middle ground solution between those approaches that provides balance between complexity and cost. API Gateway has an integration option that enables you to map requests though to other AWS service HTTP APIs:

An illustration showing the integration from a worker to AWS CloudWatch direct via API Gateway

This is an interesting approach because it removes the prohibitive cost of running Lambda functions, plus it allows us to keep the Worker to AWS integration simple via an API key, i.e. no crypto/CPU concerns. However, it does mean that we need to go back to making 3 API calls from the Worker, and there is some complexity in crafting the AWS CloudWatch integrations in API Gateway. I'm not too concerned about the 3 API calls since they're network bound, and as crypto is not required in this solution, this won't impact the CPU usage too much. On balance, I think this is the most viable approach.

Let's implement it.

Implementation

There's a repo with the code created for this post. I use snippets of the code in this post for brevity and to convey the concept, but I'll link to the original file when I do this.

AWS resources

Firstly I need to create the AWS infrastructure resources. One of the reasons I'm using the Serverless framework is because I knew I wanted to create resources in Cloudflare and AWS, and it provides a common and extensible framework for doing that.

For the AWS provider, the Serverless framework is an abstraction layer over CloudFormation and other service SDKs. To provision AWS resources via the SDKs I need an IAM user with appropriate permissions, and to save the user's access details in my .env file using the well-documented environment variables names the SDKs are looking for, I also specify my default region:

AWS_REGION=eu-west-1
AWS_ACCESS_KEY_ID=*****************
AWS_SECRET_ACCESS_KEY=*****************

Now I can create a service definition template for the resources. It's worth saying that if I had a Lambda function to deploy, this would be less effort than it currently is, as the Serverless framework does a lot of the work creating the API Gateway resources for you. But for this post I don't, I just have API mappings to CloudWatch, and since the Serverless framework is function centric, there's no entry point to have the resources created for me. Therefore, for this post, I'll create them explicitly. There's quite a lot of YAML so I'll just link to the template in the repo and pick out the interesting bits.

The first gotcha comes when you define the API methods as CloudFormation resources (as opposed to via an OpenApi spec presumably), you need to ensure the deployment depends on one of those methods (line 59), and you need a dummy stage for some reason (line 63):

restApiDeployment:
  Type: AWS::ApiGateway::Deployment
  DependsOn: observabilityLogsStreamMethod  Properties:
    RestApiId:
      Ref: restApi
    StageName: 'Dummy'

The above is kind of documented(examples section), although the dummy stage name bit isn't mentioned anymore.

Let's look at the definition of the API method we are depending on above as it has some interesting bits on the integration with CloudWatch:

observabilityLogsStreamMethod:
  Type: AWS::ApiGateway::Method
  Properties:
    AuthorizationType: NONE
    ApiKeyRequired: true
    HttpMethod: 'PUT'    Integration:
      Credentials:
        Fn::GetAtt:
          - observabilityRole
          - Arn
      IntegrationHttpMethod: 'POST'      Type: 'AWS'
      Uri: arn:aws:apigateway:#{AWS::Region}:logs:action/CreateLogStream      RequestParameters:
        integration.request.header.X-Amz-Target: "'Logs_20140328.CreateLogStream'"        integration.request.header.Content-Type: "'application/x-amz-json-1.1'"      PassthroughBehavior: NEVER
      RequestTemplates:
        'application/json': "{\n \"logGroupName\": \"${self:provider.resourceContext}-${self:provider.logGroupName}\",\n \"logStreamName\": \"$input.path('$.stream')\"\n}"      IntegrationResponses:
        - StatusCode: 200
        - StatusCode: 400
          SelectionPattern: '400'
        - StatusCode: 500
          SelectionPattern: '500'
    MethodResponses:
      - StatusCode: 200
        ResponseParameters:
          method.response.header.Content-Type: true
      - StatusCode: 400
        ResponseParameters:
          method.response.header.Content-Type: true
      - StatusCode: 500
        ResponseParameters:
          method.response.header.Content-Type: true
    ResourceId:
      Ref: observabilityLogsStreamResource
    RestApiId:
      Ref: restApi

The first thing to notice is on lines 146 ad 152. In my API I want to use the PUT method to create a log stream, but when you're integrating with AWS service APIs, you usually need to use the POST method, so this is telling API Gateway to map my PUT requests to POST requests.

The integration type is defined on line 154 wit the ARN for the action you want to use, in this case, the CreateLogSteam method for CloudWatch Logs.

It's not enough to specify the ARN though, you have to pass a couple of headers along with the requests too, shown on lines 156-7. I couldn't find explicit mention of these in the documentation, but eventually, you'll find them in the sample requests. The quotation is important too.

The last interesting bit is the request template mapping on line 160. The create log stream method takes two parameters, the log group name and the log stream name. My API method will only take one parameter, the stream name, as I use a log group per API, and the log group will be created when the service is provisioned (line 74 in the template). In this template I'm defining the JSON body payload for the HTTP POST request, I'm pre-configuring the logGroupName with the log group I've created for this API, and specifying that the logStreamName should be mapped from the stream parameter from the incoming request.

The rest of the template is fairly standard stuff for defining the resources, methods, responses and configuring it to use an API key. There are similar definitions for writing the logs and metrics.

Now I can deploy the resources:

$ npm run deploy:aws

> cloudflare-workers-observability@1.0.0 deploy:aws /src/cloudflare-workers-observability
> dotenv -- serverless deploy --config infrastructure/aws.yml --verbose

Serverless: Packaging service...
...
Stack Outputs
ApiEndpoint: https://xxxx.execute-api.eu-west-1.amazonaws.com/prod
ServerlessDeploymentBucketName: backing-api-prod-serverlessdeploymentbucket-xxxx
ApiHost: xxxx.execute-api.eu-west-1.amazonaws.com

Serverless: Removing old service artifacts from S3...

Done. Time to integrate it into my Workers.

Worker integration

As per my previous post on enhancing the development experience, I'm going to follow the principles of having substitutable dependencies and execution context.

For logs, I want the logger to be easily accessible to all code and for log messages to go to standard output regardless of the execution context. When running in Cloudflare, I also want the messages to be persisted so they can be flushed to my observability endpoint at the end of the request. I'll create a logger factory abstraction that changes the logging behaviour when running in Cloudflare to persist the logs in-memory.

For metrics, I'm only interested in creating/seeing them when running in Cloudflare. Most of the metrics I want can be derived from data in the original request or the response, the exception is duration, for that, I'll need to track the start and end time of the request. I'll create an observability abstraction that encapsulates the steps to create the stream, log messages and metrics.

Finally, I'll modify the execution context to dispatch the observability data at the end of the request:

Illustration of the observability worker integration architecture

Logging

When code running in a Worker needs to log messages it will create a logger from the current exectution context in the global scope, this will enable the logger to be scoped so that any log messages from the logger can be contextualised with an appropriate prefix. Let's look at the logger implementation for the Cloudflare context:

const log = ({ stdout, level, prefix, message, err }) => {  /* eslint-disable no-empty */
  try {
    let formattedMessage = `[${prefix}] ${message}`

    if (err) {
      formattedMessage = `${formattedMessage}\n${JSON.stringify(err, replaceErrors)}`
    }

    context.log.push({      timestamp: Date.now(),      level,      message: formattedMessage,    })
    if (err) {
      stdout(`[${prefix}] ${message}`, err)
    } else {
      stdout(`[${prefix}] ${message}`)
    }
  } catch (ignore) {}
}

const logDebug = (prefix) => (message) => {
  /* eslint-disable no-empty */
  try {
    const debug = context.environment.variable('DEBUG')
    if (debug) {
      log({ stdout: console.info, level: 'debug', prefix, message })
    }
  } catch (ignore) {}
}

const logInfo = (prefix) => (message) =>
  log({ stdout: console.info, level: 'info', prefix, message })

const logError = (prefix) => (message, err) =>
  log({ stdout: console.error, level: 'error', prefix, message, err })

const createLogger = (prefix) => {  return {    debug: logDebug(prefix),    info: logInfo(prefix),    error: logError(prefix),  }}

The factory method for creating the logger is highlighted from line 51. That returns an object with specialised implementations of the various log methods that have been configured to use the prefix. Ultimately everything is a partially applied version of the generic log method defined on line 12. The persistence of the log messages is highlighted from line 21, which is just appending to a global list of log entries.

Let's look at how the execution context has been modified:

module.exports.init = () => {  global.context = {
    environment,
    log: [],    logger: require('./logger'),    async respond({ request, route, statusCode, statusText = '', body, headers }) {
      await observability.send({ request, route, statusCode })      return new Response(body, {
        status: statusCode,
        statusText,
        headers,
      })
    },
    startTime: Date.now(),  }
}

I've updated the context so that it needs to be initialised with a call to init (line 1), which creates the global context. The logger factory is made available in the execution context on line 23 and the global in-memory log persistence is on the line above.

The reason I changed the global context to require initialisation after being required is to ensure the start time (line 32) is captured correctly on each request (otherwise it's cached in the require cache).

The start time is used to calculate the request duration as a metric. The metrics are created dynamically in the observability abstraction, which happens just before sending the response to the client on line 25.

As I've changed the context to require initialisation, there's a corresponding change required in the Worker entry points:

- require('../context')
+ const context = require('../context')
addEventListener('fetch', (event) => {
+ context.init()
  const resource = require('../../../resources/account')
  event.respondWith(resource(event))
})

Note the context.init() call has to happen inside the event scope, as it turns out that calls to Date.now() executed in global scope return zero.

Abstraction layers

Observability is just one of the backing services that will be available to my API, so I want to create some abstraction layers that factor out the concerns appropriately:

Illustration of layered backing service integration

Going from bottom to top, the bottom-most layer represents the remote APIs, for observability it's in AWS, but it could be anywhere, or there could be APIs in multiple places.

The next layer up is the API access layer, this layer abstracts the API access so that integration specific things like the endpoint, protocol, host, API keys, required headers etc are dealt with here, so calling code doesn't need to worry about them. This layer binds you to a specific underlying service, but you can swap around this layer accordingly if you have APIs in multiple locations, or indeed you want to change where they are.

The next layer up deals with specifics relating to the type of service, so in this case, it's observability, but there could be other types. This layer presents an abstraction of the service concepts and internally deals with details such as transforming the data into integration specific formats, sequencing the API calls, i.e. create stream -> write logs -> write metrics, etc. It doesn't assume anything about the execution context, so data relating to things like the HTTP request and response have to passed in the abstract from any execution context specific implementation. Essentially this layer decouples you from the underlying service and the execution context.

The topmost layer binds you to the execution context and deals with excecution context specific details, i.e. the Worker integration, this is where individual data elements get extracted from execution context specific concepts such as environment variables and the HTTP request and response.

Let's look at the interesting bits of the top layer, the Worker specific layer:

const observability = require('../backing-services/observability')
const response = await observability.send({
  apiId: apiFromRequest(request),
  resource: route,
  httpMethod: request.method,
  httpStatusCode: statusCode,
  startTime: context.startTime,
  // cf-request-id will be null when debugging via the Cloudflare dashboard, so create a unique id
  correlationId: request.headers.get('cf-request-id') || `cloudflare-worker-dashboard-${uuid()}`,  log: context.log,
})

On line 14 the observability service integration layer is introduced as a dependency and is configured with details extracted from the Worker context:

  • apidId: a function resolves the API id from the request, this is looking at the environment to detect if this is a production or logical environment, as discussed in my post on enhancing the development experience
  • resource: the RESTful concept of a resource is mapped to the resolved route, as discussed in my post on middleware
  • the HTTP method and status code are mapped from the Worker request and response
  • the start time required for the duration metric is extracted from the execution context, and so is the log

The correlation id on line 22 is being linked to the Cloudflare request id found in the request headers, this is done so the request can be traced and contextualised from Cloudflare right back to the origin - to support troubleshooting use cases. However, this header doesn't exist when you're running in the workers.dev dashboard, so an alternative unique id is provided to distinguish actual requests from dashboard requests.

Let's move on to the observability service integration layer:

const prepareData = ({
  apiId,
  resource,
  httpMethod,
  httpStatusCode,
  startTime,
  correlationId,
  log,
}) => {
  const now = Date.now()  const duration = now - startTime  const streamId = `${datePrefix(now)}/${correlationId}`  const earliestTimestamp = log[0].timestamp

  const metaLogEntry = {    timestamp: earliestTimestamp,    level: 'START',    message: `Duration: ${duration}ms, API: ${apiId}, resource: ${resource}, httpMethod: ${httpMethod}, httpStatusCode: ${httpStatusCode}`,  }  const logEntries = [metaLogEntry, ...log].map(format(correlationId))
  const metrics = createMetrics({    apiId,    resource,    httpMethod,    httpStatusCode,    duration,  })
  return {
    streamId,
    log: logEntries,
    metrics,
  }
}

The majority of the work in this layer is preparing/transforming the data. The duration metric is calculated from line 78 using the current time and the start time.

A unique log stream id is generated on line 80 using the date and the correlation id, this is similar to how the logs from AWS Lambda functions are named, i.e. 2020/10/28/061187936e00009c158c396000000001, it helps narrow down the window of logs when troubleshooting with a timeframe and/or a Cloudflare request id.

Again, similar to Lambda, a meta log entry is inserted into the beginning of the log stream summarising the request from line 83.

Next, the metrics are created (line 90), let's see what that does:

const createMetrics = ({ apiId, resource, httpMethod, httpStatusCode, duration }) => {
  let metrics = []

  const countMetric = {    MetricName: 'Count',    Unit: 'Count',    StorageResolution: 1,    Value: 1,  }
  metrics = metrics.concat(
    createMetricsWithDimensions({      apiId,      resource,      httpMethod,      httpStatusCode,      metric: countMetric,    }),  )

  const durationMetric = {    MetricName: 'Duration',    Unit: 'Milliseconds',    StorageResolution: 1,    Value: duration,  }
  metrics = metrics.concat(
    createMetricsWithDimensions({      apiId,      resource,      httpMethod,      httpStatusCode,      metric: durationMetric,    }),  )

  return metrics
}

I'm creating metrics to track the count of requests starting on line 108, and to track the duration of requests from line 125. These metrics are used as a template to create metrics augmented with dimensions (from lines 116 and 133), let's have a look at the dimensions :

const createMetricsWithDimensions = ({ apiId, resource, httpMethod, httpStatusCode, metric }) => {
  const apiDimension = [
    {
      Name: 'Api',      Value: apiId,    },
  ]

  const resourceDimension = [
    {
      Name: 'Resource',      Value: resource,    },
  ]

  const methodDimension = [
    {
      Name: 'Method',      Value: httpMethod.toUpperCase(),    },
  ]

  const statusCodeDimension = [
    {
      Name: 'StatusCode',      Value: httpStatusCode.toString(),    },
  ]

  const all = [...apiDimension, ...resourceDimension, ...methodDimension, ...statusCodeDimension]
  return [
    { ...metric, Dimensions: apiDimension },
    { ...metric, Dimensions: resourceDimension },
    { ...metric, Dimensions: methodDimension },
    { ...metric, Dimensions: statusCodeDimension },
    { ...metric, Dimensions: all },
  ]
}

I'm creating 5 metrics from each template metric, one each for the API id, resource, HTTP method and status code, and then one that combines all of those dimensions. This will allow me to observe metrics for a specific dimension, i.e. how many 500 status codes in the last 5 minutes for API X, or on a combination of those dimensions, i.e. 401 status codes for API X on the account resource for the POST method in the last 5 minutes.

That about covers the implementation, I think it needs a demo to tie it all together.

Demo

To demo observability in action, I modified the auth middleware to introduce a random sleep period between 1 and 2 seconds so there's some deviation in duration across requests.

I also have a little artillery load test script to run some load against my API for a couple of minutes that will vary the response to fail authentication (401), fail validation (400) or succeed (200) based on headers that are present, as shown in my post on middleware.

Watch the demo:

There you go, that's how you can add observability to your Cloudflare Workers API.

The repo for this post works out of the box as long as you configure it with your Cloudflare and AWS account details, as per the README.

Wrapping up

That's the final post in this series. I hope you've enjoyed this series and the journey to deliver APIs at the edge with Cloudflare Workers. While I was writing this series Cloudflare announced the beta of "Durable Objects", I was hoping to get access to the beta and to finish with a post on that, but sadly I didn't get access. I think global data persistence is the final frontier for edge-based computing and serverless architectures, I keep hearing about Fauna and their efforts to create a global data store. I think the ultimate experience of cloud computing should be to simply deploy code to a cloud service and have it run at scale and near your end users without having to concern ourselves with choosing regions and the trade-offs in multi-region architectures. That's the dream, and I don't think we're very far away. I'm excited about that future and I'll be keeping close tabs on progress, and maybe writing a post or two along the way.

Check out the other posts in this series:

  1. Delivering APIs at the edge with Cloudflare Workers
  2. Blue / Green deployments for Cloudflare Workers
  3. Enhancing the development experience for Cloudflare Workers
  4. A middleware architecture for Cloudflare Workers
  5. API observability for Cloudflare Workers