I recently shipped a service that simplifies the implementation of custom business rules for a lot of my applications. This post talks about the original problem I had, some instances of similar solutions, in other applications, and how I created a more general solution that can be leveraged in most of my applications.

Introduction

I recently shipped a new version of an existing feature at work that leverages custom business rules to determine how we perform some action with our product. The details of the action aren’t terribly interesting, as it’s just standard business logic.

When we first shipped this feature a year ago, we had a concept of “attaching” a “condition” to some piece of configuration, which enabled conditional application of some process, where all the remaining parameters were bundled with the configuration of that process. At runtime, we enumerated the process configuration, the related conditions (“constraints”), and if any constraint was satisfied, we would immediately apply the changes specified in the parent configuration. This had a few interesting properties:

No recursion. Recursion would have been a nightmare in this feature, so we deemed it as explicitly unsupported
Because of the previous point, the design included the concept of a “priority”, where we would “accept” the first entry that was “matched”
A condition could easily be “defined” once, and “attached” many times
The configurations could be “scheduled” with start and end dates

I’ve seen so many things built this way that I consider it to be a system design pattern. The overall logic for the feature is extremely simple, which is great, but the business rules were extremely limited.

The concept of rules that I originally implemented was laughable, but it was also done that way so we could ship the first version quickly. A user could essentially specify a sequence of conditions that were executed in a logical OR evaluation. The individual conditions were also limited. A user could specify a named value that was well-known within the system, along with a numeric lower- and upper-limit on the value that was being “resolved” (looked up). If we failed to find the value, or could not parse it correctly (as a number), then the entire condition failed, and we would fall through to the next one. A hypothetical example of this is shown below, using a syntax that is similar to Python.

from app_system import get_well_known_value

def between(well_known_val, lower_limit, upper_limit):
    """
    Takes `well_known_val` as a string, attempts to convert to a float,
    and compares to the floats `lower_limit` and `upper_limit`.
    Returns `true` if `well_known_val` can be converted to a float, and
    is bounded by the range [lower_limit, upper_limit]; otherwise
    `false`.
    """
    try:
        num = float(well_known_val)
        return lower_limit <= num and num <= upper_limit
    except:
        return false

def match_rule(unitId, ruleset):
    """
    Takes a `unitId`, which indexes from a value to it's well-known
    member list, and an active ruleset. If a rule contained within
    the ruleset has conditions, the rule is returned if and only if
    one of the conditions is satisfied. If there are no conditions,
    the rule is considered to be an "unconditional" rule, and is always
    matched. If no rules are matched, the result is `null`.
    """
    for rule in ruleset:
        if rule.conditions:
            for cond in rule.conditions:
                well_known = get_well_known_value(unitId, "Price")
                if between(well_known, rule.lower_limit, rule.upper_limit):
                    return rule
        else:
            return rule
    
    return null

This is an incredibly naive implementation, but it turns out, it’s actually fairly useful for a gigantic variety of cases, assuming you have:

A basic concept of well-known values in your system
Exclusively numeric data

I shipped the original version of this feature in a period where my team really needed to get something in place for an important business initiative. The trouble was this was a terribly specific implementation, and didn’t scale well to meeting other business needs for a variety of reasons.

This was the first time I shipped a feature like this in the organization, and I basically followed the same pattern that many of our other applications followed for achieving the same thing. This was also when I started thinking about how we might go about solving the problem differently. At that time, I still didn’t realize how often we needed to ship features like this, and also didn’t know very much about the rest of the application where I implemented these changes. I wanted to see how prevalent this “dynamic rule” problem was before I started building something more robust.

Other Implementations

It turns out, this was not the only place where we’d implemented something like the logic above. Far from it, in fact. In other parts of the system, we had much more sophisticated versions of more or less the same thing. Even within the same application, there was a much smarter rule engine that perhaps could have been employed, but the architecture of that rule engine was again far too specific to be useful outside the context where it existed. We did have support for more sophisticated comparisons in other places, and even support for chaining conditions as being all logical AND in place of logical OR conditions. Those more sophisticated versions enabled our collective system to create some fairly comprehensive rules, especially considering how primitive they were in their designs.

But they weren’t enough. It turns out, once you give your users a version of a rule evaluation engine or pipeline, they get upset with the limitations, which every version we had did. Many of them had been around for quite some time, with varying degrees of capability. As I looked inward at the application that I primarily worked on, and then across other parts of the system, I saw a common theme: we had several parts of the system that were all trying to achieve more or less the same goal, all doing it marginally different from one another. This may seem innocuous at first, but it’s actually a big problem; because we re-implemented some version of the same core idea every time we needed to do something new, we were spending a lot of time (read: effort) on problems that really ought to be solved generically. Additionally, since we created a new version every time, it led to 1) a larger maintenance problem, and 2) inconsistency for our users. I knew we could do better.

Surrounding Context

When I looked at the application I maintained, as well as other parts of the system that contained similar features/functions, what I saw was “SQL WHERE clauses everywhere.” Our system had a pseudo-concept of a global data store, which was basically a global variable list that we could reference at various points in a process. It was not the best hallmark of Domain Driven Design (DDD), because we had a cross-cutting concern that was hard to control, but damn did it give us the ability to respond efficiently when a feature needed to go out. Just figure out what state needs to be shared somewhere, any processes that are antecedent to the dependent process, and find a way to scribble some shared state that can be consumed in the place you need it. I’d be lying if I said I didn’t leverage this for all manners of terrible usages…

Through working on the system, particularly in my application, I realized that global state would be an impediment for a long time. At the time, it really was essential and valuable, but the fact that we had so many features depending on it was rather concerning (for obvious reasons). At the same time, we also had an entire system we needed to remain compatible with as we made progress towards our future goals. Some of those were very clear, but others would be driven very heavily by the needs of our business.

In a nutshell, I found a few primary problems that I felt we needed to solve:

Create a pathway for eliminating globally shared state in our system.
Provide a system-wide, general solution that could replace the 90% of solutions that all perform essentially the same task.
Testability. However we decided to replace existing features, we should be able to easily test and verify compatibility for existing components. The first part of this means we should also be able to replay data from a subset of our production system and get identical results from any solution we propose to deliver.

Some related goals were:

Enable incremental adoption. Whether it takes 2 weeks, 2 months, or 2 years, we should not break any existing workloads
Performance, performance, performance… There weren’t many parts of the system that are performance sensitive, but there were some; we needed to be as capable of serving those parts as we are for the rest of the system
The solution should be simple for our development team members and users alike, or we risk having minimal-to-no adoption

Moving Forward

Amazingly, my company was full of people that knew a certain amount of Structured Query Language (SQL). They were generally able to write SELECT statements, complete with joins and where-clauses. Some of them were even good at using Common Table Expressions (CTEs) as part of their queries, along with more advanced features in SQL Server. It really amazed me every time I saw some of the stuff that folks managed to create to serve their teams. As I looked across the organization, I felt that I could capitalize on this knowledge in a number of ways:

Most of our business rules are in principle just predicates (as most are), so knowing how to write a WHERE clause could be the foundation for an interface between a rule system and the users
Even if users don’t know how to write SQL, the most basic queries are so similar to English that a 5th grader could probably explain them
If we constrained the grammar, it would probably be fairly trivial to implement either a translator, or an interpreter that was not coupled to the data in our databases at all

I decided this was a very compelling idea on a number of fronts. After some initial conversations with stakeholders, I thought there was good traction for building essentially what’s described in the final point. I started crafting a scanner, a parser, and finally, an interpreter. This required defining a grammar for the language I’d be scanning, parsing, and ultimately, interpreting. The basic concept was to allow users to write their own predicates, which would look like a WHERE clause in SQL. Again, most of the stakeholders that would have any knowledge of this whatsoever already knew SQL, so it was a good starting point. The null-handling semantics would be identical to versions of SQL we used in-house, which I felt would help facilitate a smooth adoption curve.

In order to decouple this new thing I was building from our database, and in particular our global data store, I decided that I just wouldn’t have any concept of data storage in the new system. By “no concept of data storage,” I really do mean no concept of data storage. The only things it would be familiar with were sockets (for HTTP), and how to write log data to a log store. The interpreter interface would be very minimal:

public interface Interpreter {
    bool Evaluate(string expr, IDictionary<string, string> context);
}

Very minimal indeed… It would need to know about some common “types” and “conversions,” but if we could scan and parse something that looked like a WHERE clause, parsing numbers, booleans, and other basic “primitives,” ought to be fairly simple (particularly considering the fact that most languages provide functions for those things in the standard library).

I thought this was a great concept: instead of having a bunch of features that were basically serving as interpreters, why not just implement one and be done with it? That seems fairly logical to me… But then comes the question of how we expose that? I know a lot of folks are burned out on HTTP, but I thought it may actually make a good platform for hosting. HTTP is far from the best protocol in the world, but I wasn’t especially interested in designing one, and I also wasn’t interested in having to make applications update anytime we had an update to the interpreter either. I’d rather just have applications see it as a black box where they send an expression, a context, and get a result that indicates to them:

Whether the evaluation was successful, and if so, what the result was
If it failed, anything to log about the failure (to support our operations teams)

That became the genesis of an idea…

Building a Service

I churned out the work on the Interpreter in what I felt was a reasonable timeline. I was only able to devote some of my time to it at work, but I got it built, complete with a bunch of unit tests to help prove that it was in some sense working. There were some minor bugs that I didn’t catch immediately, but I knew I’d circle back and catch them once I had more than just a concept implemented.

With the interpreter finished, I proceeded to build an HTTP server for exposing it to other systems. I provided some very simple methods in the final interface:

ValidateExpression took an expression string, attempted to parse it, and returned a “formatted” result with a unique list of identifiers. There was also error reporting for failures. The overall goal of this API was to enable us to verify that strings were valid at various times in application lifecycles, and also provided a good extensibility point for user interfaces in the future (if we so desired).
EvaluateExpression took an expression string, and the magic “context” parameter (variable list) mentioned in the previous section
EvaluateBatch was essentially a wrapper over EvaluateExpression, taking a list of expressions instead of a single expression. This was an easy optimization for us to ship very early on: allow clients to submit multiple expressions as part of a single request. The goal was to reduce the number of transfers between clients and servers.

I introduced logging during each stage, along with a server-generated CorrelationId that would be shipped to the client that they could log at runtime. Operators could take the CorrelationId, and use it to associate any events in the server-side log data.

With all that infrastructure in place, I needed a way to deliver the server to production. Since May of 2020, I’d adopted a policy of never deploying any new applications manually. I began work on crafting a full DevOps pipeline to target dev, test, and production environments, so that we could easily ship the whole thing once we were comfortable with it.

The last bit of work was to generate a sort of “client SDK” for use within existing applications. I didn’t want a bunch of users or devs going and looking at a Swagger doc to figure out how to call the server. Instead, I wanted to provide them the tools to integrate with the server to ease their pathway into using it. I built the first client in C# (which is what the server was written in, along with most of our applications), and published it to a private NuGet feed for consumption. My favorite thing about this entire leg of the project was that I wrote a bunch of integeration tests that called the HTTP server, and by writing integration tests, I identified a bunch of things that were broken in the client. It sure feels good to validate your own effort on writing tests when you realize something is broken. It turned out that I used an incorrect type as part of the client interface in one of the methods, and I’d also constructed the type-hierarchy in such a way that the JSON serializer was silently failing to re-construct my objects. It was really easy to fix those bugs, but without tests of any kind, they would have been very hard to catch.

Integrating the First Client

I started pushing versions of my new service to my development and test environments very early. The idea was to make code available, and if anyone was interested in taking a peek at it from my team, it would be very easy for them to do so. It also helped me tremendously: by getting the foundations of a proper service built (including a CI/CD pipeline, automated tests, test coverage reporting, etc), it would also be easier to identify whether or not I broke something along the way. Trip wires for the win.

With the server in my dev environment, and the client package in my NuGet feed, I was able to start re-writing my existing code to use this new service. I say “re-write” because I was really replacing an existing interface implementation, and while the boundaries of the system would remain compatible (it’s an interface, after all), the internals were going to see a fair number of changes. Of course, when I say re-write, I also mean “write new,” because I generally like to keep old and new versions of features parallel to one another guarded by a feature flag for at least one release (the one where I introduce the new version of a feature). We operate distributed systems, and feature flags make it much, much easier to stage all of our infrastructure, run any migration code that’s required, and then make an entire change “go live” all at once across our entire machine cluster. We essentially just make sure that the old thing still passes all it’s old tests in the test environment, ensure the new component is passing the tests we’ve defined for it (which usually includes tests for the previous version when it makes sense), and then push to production. The entire loop is very simple, and leads to extremely smooth deployments.

And so it went. I worked on integrating the new service into my existing code base, writing new tests, and preparing to ship the next version. I was happy with the performance of both the new code in my existing application (where I had the primary need), and in the new service I’d built. I was also very pleased with how much easier it was to test this new version using reproductions of data from the actual production system. One of my big frustrations in previous versions was that it was difficult to get configuration data for rules defined in the system. This new version made it trivially simple, which made the development process a real joy. I could see how much easier it would be to operate the new version as I went, which made me happy.

Goal Check

When I started out on this project, I had a few primary goals in mind. You’re probably curious about how the work I did lines up with the goals I had, right? Let’s revisit those.

1: Create a pathway for eliminating globally shared state in our system.

I’m calling this one a success. While the first version is still going to share a bunch of global state, it’s not core to the system design. Callers of the service are free to source data however they want, which they provide through the Context parameter. This also means that we can determine what data will be commonly served from applications, and instead of figuring out how to get that data into a global data store, the consuming application is free to define data in any way it wishes to when calling the interpreter server.

2: Provide a system-wide, general solution that could replace the 90% of solutions that all perform essentially the same task.

I built a service that is capable of evaluating predicates by design. That covers 90% of use-cases I’ve encountered. There are still some things that my new service isn’t capable of doing, but the fact that it can’t perform those functions yet doesn’t mean that it can’t. We are free to make choices in the future about whether or not adding functionality to our interpreter makes sense.

3: Testability…

This was, again, so fundamental to the design that it would have been hard to miss. Some existing applications may not be producing enough information about how they were evaluated for them to have a clear pathway between their current state and adoption of the new service in terms of verifying that the new service meets their needs, but, the work I did on integrating the first workload should serve as a good case-study on what that path looks like.

For the related goals:

Enable incremental adoption. Whether it takes 2 weeks, 2 months, or 2 years, we should not break any existing workloads

Using a new service is something that you opt-in to. By definition, the delivery of a new service should not break existing services. Any time a new application is onboarded, it will require regression testing of its primary value areas related to the functionality being changed, but as long as the fundamental use-case is one of the ones covered by the design, this goal should be met. Because we have the service installed in so many environments before production, it should also be easy to identify any failures to work for existing applications.

Performance, performance, performance… There weren’t many parts of the system that are performance sensitive, but there were some; we needed to be as capable of serving those parts as we are for the rest of the system

The early tests showed decent performance. We could churn out answers to queries in less than 250ms in many cases, which we thought was good. Could it be faster? Sure. Skipping round-trips across a network would have helped with that, however, 250ms was “good enough” for most of our scenarios. I’d say it met our expectations in that area.

The solution should be simple for our development team members and users alike, or we risk having minimal-to-no adoption

It’s still early for us to evaluate this one, but I think it’s promising… I was able to reduce a lot of my configuration complexity, and re-implement an entire feature of the application I work on, with a fairly minimal amount of effort.

Closing Thoughts

It’s really easy to look at a project like the one I’ve described and think I’m just another developer building things for the sake of building them. I asked myself repeatedly throughout the process of building this new service if it was really necessary, or if I just wanted to build it. Why not just choose an existing language, like Python, or C#, or F#? The conclusion I came to was that I’d be happy with any of those, but I also felt a strong tug to avoid them. Python would be a fantastic choice, but it comes with questions of how you host it, how you expose it as a service, and more. If you don’t have any experience with managing Python in production, then it may not be a great first step. That pretty much summarizes how I feel about it. I had similar concerns about C# and F# as far as scripting goes. There’s also that pesky problem of adoption: if users don’t feel comfortable using a particular thing, then it doesn’t matter how you implemented the backend.

There’s more work to be done yet. The first version of this service isn’t terribly featured in the language itself. In particular, I want to add support for “fuzzy comparisons” ala LIKE comparisons in SQL (wellKnown LIKE 'abc%'), and also add support for lists (to enable IN expressions, such as wellKnown IN (1, 2, 3)). These don’t exist yet, and they may never exist; we’ll see how useful those features would actually be. I’d like to eventually replace this service altogether with something like Azure Functions, but that’s not immediately possible for a number of reasons.

Finally, I hope to at some point provide something like the solution I described as an open source project. Any future OSS version will be a complete re-write from the ground up, but I hope that in providing something as open-source, others will contribute and help make this sort of technology available to a larger audience of developers.

I hope you enjoyed this post. I certainly enjoyed reflecting on the work I did, and am very excited about what I’ll do where it goes from here.

Cheers,

- Brian