8 min read
How We Transfer (Software) Risk Using Serverless
By: Jay Zeschin on Mar 22, 2021 10:00:25 PM
This post is adapted from a talk I gave at a 2020 CTO Summit event on emerging trends in technology.
Highwing is a data platform for commercial insurance, providing digital tools to connect brokers and carriers more efficiently. The problems we’re solving tend to involve lots of custom integrations, data flow, what I'd call "medium data'' (it tends to be complex but not particularly high-volume), interfacing with large enterprise clients, and navigating compliance and security concerns.
Risk and challenging problem spaces are not particularly unusual for an early-stage company. We have taken some hints from the broader insurance industry and use serverless technologies to delegate risks in order to focus more acutely on creating value for our users and customers. This post will dive into our journey with serverless application architecture, with a particular eye towards what we expected to get out of this approach when we first started and how those expectations fared as we got further along.
Highwing began almost three years ago. We incubated inside of an 800-person insurance brokerage and built our initial product offerings 100% from scratch, a total greenfield. So when we started to think about how we would develop our product, we saw some interesting parallels between insurance and technology strategy.
Insurance is fundamentally the act of contractually transferring liability (or risk) to a third party, usually in exchange for money, which allows us to reduce the financial downside from some future catastrophe (e.g., fire, flood, car crash). Risk transfer enables risk-taking, which is critical for innovation to occur. It's hard to try something new when potentially exposed to an unlimited downside!
Minimizing vulnerability to unnecessary risk while maximizing the benefit from areas of exposure is a combination of art and science, and it's also what good technologists do implicitly every day. As a startup, we're heavily exposed to risk across three areas:
- Execution (e.g., can we build it on time, before we run out of money, with the team & tech we have at hand?)
- Operation (e.g., can we run it effectively, securely, and at a reasonable cost?)
- Customer (e.g., will anyone use our product, does it solve a real problem, and will they be willing to pay for it?)
What drove us towards serverless was the concept that we could meaningfully reduce those exposures. We hoped to do that in three ways. First, we wanted to make the product easy to operate, both in terms of the individual skills and time required. Second, we wanted to tie our infrastructure and service costs to actual usage. Finally, we wanted to increase the speed at which we could take ideas from concepts to delivering value to our customers.
There are lots of definitions of serverless, so let's narrow in on what I mean here:
No managing server instances
This point might be obvious, but we don't provision/manage server instances at the OS tier. We don't staff for it, we don't want that responsibility, and we want our engineers focused on our competitive differentiators, which are primarily at the application tier. This focus relieves us from the need to address a whole class of operational and compliance challenges at which others (specifically hyper-scale cloud providers) excel.
Elastic scalability & usage-based pricing
We want infrastructure that can scale up/down automatically and bill us appropriately for our usage. Our usage patterns tend to be spiky, and many of our workloads are batch, so the benefits we see from this approach are highly impactful.
Infrastructure coupled to code, driven by application needs
In what (I think) has become common practice, our application architecture and the infrastructure that runs it move in lockstep, with our infrastructure defined in code and delivered continuously. Application decisions drive our infrastructure decisions directly.
By these criteria, managed services (e.g., Cognito, Sendgrid), functions-as-a-service (Lambda, Cloud Functions), and managed containers (GKE, AKS, ECS) all count as serverless. I'm going to focus on FaaS since that's our largest investment (and I believe what most people consider when they think "serverless"); however, many of the points also apply to managed services and managed containers. We're a big AWS user and utilize many of the core AWS serverless primitives.
It's been about two years since we first started down this road. Let's take a look at those three goals and see how we're doing against each of them.
Takeaway 1: A novel deployment model is not a substitute for a well-thought-out architecture
We had lots of early wins in terms of our delivery speed, though if we look at the overall path, I think the results are a bit more nuanced. Putting together a basic app and getting it deployed is a snap. Building on it is a snap. Deployments are (mostly) a snap. Compared to the complexity usually associated with getting a traditional app built and deployed, the fact that we've been focused solely on our own application's functionality has been a tremendous time-saver. Plus, the serverless tooling (Serverless Framework, AWS Amplify) has generally received lots of attention over the last few years and is quite good.
The deployment model will affect the associated architecture; however, a novel deployment model is not a substitute for a well-thought-out architecture. Much of the opinion we've seen focuses on "staying so small that complex architectures are not needed," which I don't believe is a sustainable strategy. It works for one-off utility functions, but not when the goal is exposing a broadly consistent interface. Most of the serverless frameworks that do exist are focused on deployment mechanics, not architecture. That said, we've found that there are architectural patterns that are a natural fit for serverless. We've seen significant value from event-driven architecture (which is a very natural fit for the serverless deployment model and many cloud provider primitives). We've also achieved greater modularity and testability when we apply hexagonal/clean architecture principles to isolate ourselves from the specifics of cloud SDKs, data stores, and configuration. Mileage may vary, but what matters most is selecting an architecture pattern that can be applied consistently while scaling in complexity and volume.
Takeaway 2: Optimize for the demand paths of your engineering team
Sometimes the phrase "paving the cowpaths" is used in a derogatory fashion. Still, I think cowpaths (or demand paths) can help understand emerging patterns and make them highly productive for engineers. The benefit of something like Rails extracted from production code is that many common patterns are baked-in, so there's no need to remake trivial decisions for every new feature. Much like larger architectures, we haven't found many of these in the FaaS realm just yet but found it essential to make the decisions early to reduce our engineers' cognitive load. These are decisions such as - How do we do development environments (global, per-feature, per-engineer, something else)? Where/how do we test? What's our deployment pipeline? How do we rollout/rollback safely? How do we monitor our apps (when they're short-lived and our standard tracing tools don't work)? Setting these early helps drive productivity by reducing the number of questions that we must answer to build a feature.
Takeaway 3: Fine-grained on-demand billing can dramatically reduce costs
Tying costs to actual usage has probably been the most tangible success for us. We've found that switching to on-demand workloads is just so cheap for so many use cases, and the cost savings are very noticeable. It's fast both in terms of machine and human costs to fire up something event-driven to respond to a trigger or run a scheduled task. Development and testing environments are great examples - we can finally afford to have parity across all environments, and we don't have to pay through the nose for it!
We've also avoided most fixed overhead in terms of long-term contracts, usage minimums, and high flat fee subscriptions. I thought there would be more here that would require us to make one-way-door-style decisions, and the reality is that this hasn't been the case. One resource that's on-demand but not elastic is network infrastructure - it's not easy to spin up/down on demand, so making it economical may require some backflips. Datastores used to be the same way, but Aurora Serverless and DynamoDB on-demand have relatively recently broken that mold a bit. When organizations continually pull on the "don't run it yourself" thread, they likely run up against some critical services that are inelastic and require longer-term commitment. These will seem extra strange when they're the only resources in their ecosystem billed that way. We've tried to think critically through these contracts and do a more in-depth analysis before proceeding, ensuring that they meet both the strategic bar and an urgent need before committing.
Takeaway 4: Complex systems are still complex, and code is still code
Finally, the ease-of-operations picture has turned out differently than we anticipated. There are many examples of serverless functions as a kind of "ops glue" or trivial one-off triggers. There's much less to be found about complex systems, how to manage testing distributed systems that cannot run locally, or how to deal with monorepos in multiple languages, among other challenges. All that to say, patterns do exist, but there's a fair bit of assembly required to translate them to this kind of deployment model and use them to build cohesive services with well-defined interfaces. It's easy to end up with a system that resembles one of those display cars in malls, constructed entirely out of tiny Legos, and the whole thing is an interface. We've had to be aggressive about defining service boundaries, setting up contracts, and the abstractions the team needs to reason easily about complex systems. These abstractions from the runtime model also enable escape hatches when something doesn't fit into a FaaS workflow: when a function must run longer than 15 minutes or needs resources that aren't available in a Lambda runtime, for example.
Relatedly, we've found that doing infrastructure-as-code means creating a lot more code. One of our most significant language investments by lines of code has been HCL (Terraform's configuration language). While moving functionality from imperative application code to declarative resources is a net positive, the declarative codebase is still a liability. It needs to be tested, maintained, and refactored the same as any other codebase. Some abstractions help for sure, but there's still an irreducible amount of heavy lifting required.
Finally, the increasing reality is that an organization's engineers will have to be proficient in several language ecosystems to implement anything meaningfully complicated. I think this has been the trend for a while, but it's even more valid in a serverless world. This requirement has implications for hiring, training, and other organizational matters beyond this article's scope.
Concluding Thoughts: Serverless will eventually be boring (and that’s a good thing)
So the natural question from here is, "Has it been worth it?" Let's look at that through three lenses:
Has it enabled us to get where we want to be? Yes, though not without hiccups. Serverless is not a panacea, it won't magically solve all technological and organizational challenges, but no technology will do that.
Would we retake the same path if given a choice? Yes. My background is heavily in the Ruby/Rails world, and I spent much time at the beginning of Highwing wondering if our problem space was a "Rails-shaped" problem that could work with a simpler architecture. For many startups, I think something boring is still very much the right choice. For startups where that decision is less clear (Highwing included), I think it always makes sense to avoid the bleeding edge. With the establishment of patterns and architectures, serverless is rapidly moving away from being on that edge. It's possible to have it both ways; it's not an all-or-nothing proposition: A Rails app with serverless event handlers, for example, is very similar to what we use today.
Do we believe this is the direction the world is headed? Yes, absolutely. The upsides of this kind of approach are substantial. As the tooling continues to mature, it will become a superpower for teams who can relentlessly transfer risk that isn't central to their value proposition and focus on their competitive differentiators.
Interested in connecting with our engineering team, or even joining us? Great! Check out our open positions here.