TL;DR

On a recent project I was looking for a simple way to run a serverless API in front of a large scale/high throughput service. The services I'd usually turn to are becoming frustrating with the various options, trade-offs you have to consider and the hoops you have to jump through to get the features and availability you want. It just felt like there should be a simpler way to deploy highly available and scalable APIs using serverless technology.

Someone mentioned Cloudflare Workers as a way of running code at scale, so I took a look. I share the journey, the use cases for it and look at some of the constraints and challenges faced, with anecdotes from the experience. This is a multi-part post that culminates in showing you how to deliver your next API with Cloudflare Workers.

The idea

Cloudflare Workers is a Serverless cloud service that runs at "the edge" and can process all of the HTTP requests/responses to and from your web estate. Typically they're used to run small snippets of code that inspect and modify HTTP requests and responses, for example inspecting headers, performing redirects and things like that. There's loads of things you can do.

Since a Worker is just code we can run on an arbitrary HTTP request, we could in theory just use them as our API resources.

The global scale, resilience and performance achieved by running an API on such a massively globally distributed network without having to worry about multi-region architectures and other deployment complexity is very compelling.

This feels more like how running services in the cloud should be. Easily deploy some code and run it globally. That's the future.

What are the use cases?

Running on Cloudflare Workers would be suited to API requests that require a bit of CPU processing, can rely on eventually consistent data and/or back-on to other highly available and scalable services.

Ultimately your API is as reliable as the least reliable thing in your architecture, so if your Workers are backing on to services that are less likely to be available, then unless you can deal with that through eventual consistency, caching, Worker based failure detection and load distribution, disabling features or whatever, then this approach might not be for you.

On the matter of state. KV is an eventually consistent, read optimised, key/value store you can leverage to manage state in your Workers. So if it suits your use case you could manage your state there, otherwise you're going to need to integrate with an external and suitably available and scalable state service.

Workers as filters

This use case is around providing a utility filter for other services in your architecture. For example, suppose in a microservice architecture each service is handling authentication verification and authorisation:

The overhead of verifying the authentication tokens is duplicated many times. While each service may want to perform it's own specific authorisation, you could improve the performance of your solution by performing the authentication verification once at the edge and remove the responsibility from the other services:

Of course this solution requires that all traffic to your services can only arrive via Cloudflare, i.e. using IP restrictions, and therefore through your filters.

Workers as abstraction layers

This use case is around creating an abstraction layer around another service, aka "wrapper API" or proxy. Take a typical service integration:

As with software engineering we might want to apply SOLID principles to this kind of integration, especially if it's a 3rd party integration. The abstraction layer enables some of these principles and de-risks the integration while allowing you to augment the solution to plug any feature gaps:

This might also be useful if you want to mitigate vendor lock-in and/or provide a migration path from a short term solution to a longer term solution. I suppose you could also perform Worker based failure detection and load distribution across services:

Creating abstraction layers around large scale/high throughput services is sometimes difficult to justify in terms of cost and operational overhead and complexity, but with Cloudflare Workers you could do this with relatively little overhead/cost.

What are the constraints?

Cloudflare have published the Worker platform limits and they're worth a peruse if you're thinking about using it. Some of the limits can be significantly improved by moving to the "bundled" version for a reasonable $5 a month. Some limits of note are:

CPU runtime

At first glance 10-50ms seems low, but it's important to note that this is CPU time you use on the edge servers per request, it's not your request duration. So when your Worker is waiting for asynchronous I/O to complete, it's not counting towards your CPU usage. For example, I know a Worker can comfortably perform the following tasks within the 50ms limit:

Decode an OAuth token
Download the signing keys from the token issuer
Cache the signing keys in KV
Verify the token
Validate the request payload
Wait for a backing service in AWS to persist some data
Cache a copy of the persisted data in KV
Dispatch logs and metrics to another backing service

Also Cloudflare recently announced an unbounded/limitless offering which provides a solution where this limit is a problem.

Programming environment

You have two options for programming Workers: JavaScript or any WebAssembly compatible language. A quick look at both approaches showed that the JavaScript approach seemed more mature and benefited from better community engagement and tooling support. The Worker JavaScript environment is aligned to Web Workers, so writing JavaScript for Workers is more akin to writing a Worker in a browser than a server-side environment like Node.js. When adding dependencies you'll need to check which APIs they depend on as they may not run in a Worker context. I found issues with some dependencies that used Node crypto APIs and HTTP APIs other than fetch. It's worth reviewing the runtime APIs to understand what's available in a Worker context.

Worker script size

The maximum size for a Worker script is 1MB. This shouldn’t be an issue when using webpack to bundle your JavaScript, and if you use a (smaller) script per Worker rather than sharing a (large) script across all Workers. Although we did see an issue with this when we added the moment package to perform some date processing - the default package size is very large due to the locale files, but you can optimise it (or replace it).

Note: the script size limitation is no longer 1MB, recently it got bumped up to 25MB.

What are the challenges?

All platforms have challenges, especially with more complex requirements such as being available all of the time. While implementing an API in Cloudflare Workers you might face these challenges:

Delivery experience

Deliverability of the API is crucial. From the outset we want to know how we're going to deploy to production and how we can roll-back/forward/sideways with zero downtime. We're also going to have services deployed on other platforms we need to integrate together, things like databases, so we want a consistent tooling experience and process around that. We'll also have CI/CD pipelines delivering all this stuff using blue/green deployments. All of this will inform the local development process too.

Update: see my post on Blue / Green deployments for Cloudflare Workers

Development experience

When building solutions, engineers want a fast local feedback cycle to quickly iterate on their work and deliver efficiently. Working with cloud services can significantly slowdown that cycle while you're waiting for code to deploy and execute.

Cloudflare provides the wrangler CLI to support local development and publishing of Workers, the dev mode aims to enable a faster local feedback cycle by listening to requests on a local server.

However, the ability to easily debug the code using local development tools such as VS Code is key to effective and efficient development.

It’s also worth considering the consistency of tooling between local development and CI/CD processes.

Update: see my post on Enhancing the development experience for Cloudflare Workers

API architecture and routing

When building APIs, your service/framework typically allows you to define API routes based on properties of the HTTP request. For RESTful APIs, the HTTP method and path are typically used to map requests to resource handlers. Popular API frameworks such as Express and ASP.NET Core allow you to define middleware that enable you to factor out common tasks into pipelines that can be applied in sequence to multiple API routes.

Update: see my post on A middleware architecture for Cloudflare Workers

Operations experience

Once the API is deployed, we want to keep an eye on it and make sure we can react to any issues.

Cloudflare offers some basic Worker metrics that you can periodically query via their GraphQL API, but it won’t give you an API centric view, or the ability to easily trigger alerts, so some custom metrics will be required to monitor the API effectively.

By default, log messages in Workers are ephemeral and simply sent to the standard output/error streams. This is ok to support local development and debugging in the Cloudflare workers.dev dashboard, but it would be useful to persist these logs from production workloads to support potential troubleshooting scenarios.

Update: see my post on API observability for Cloudflare Workers

Coming next

In coming posts I'll address the challenges above and demonstrate how you can deliver your next API using Cloudflare Workers.

Make sure you check out the other posts in this series:

Delivering APIs at the edge with Cloudflare Workers
Blue / Green deployments for Cloudflare Workers
Enhancing the development experience for Cloudflare Workers
A middleware architecture for Cloudflare Workers
API observability for Cloudflare Workers