by skunxicat

When You Can’t Trust the Supplier: Designing for Uncooperative Systems

What happens when your supplier is a black box, has no API agreement with you, and can change behavior at any time?

Uncooperative Systems Design

Most system design assumes a cooperative environment. You call an API, it responds predictably, you handle errors, you move on. The contract is implicit: the supplier behaves consistently, errors are well-defined, and success means what it says.

This assumption is so deeply embedded in how we think about systems that it rarely gets named. It’s just the water we swim in.

We built a booking system where none of that is true.

No commercial agreement with the airline. No stable API contract. No guaranteed behavior. No escalation path when things break. Just a system that needs to book real flights, handle real money, and be honest about what it can and cannot guarantee.

The interesting part isn’t the technical challenges — those are solvable. The interesting part is what happens to your mental model of a system when you remove the assumption of cooperation entirely. You have to rethink what words like success, failure, and truth actually mean.

This is what we learned.


The Root Problem Is Relational, Not Technical

The first instinct when building against an unstable external system is to engineer around it — add retries, add timeouts, add circuit breakers. That’s necessary, but it misses the point.

The real problem isn’t volatility. Prices change, sessions expire, inventory disappears — that’s manageable. The deeper problem is the absence of a relationship.

Without a contract:

  • There’s no agreed interface stability
  • Errors don’t have stable semantics
  • Success signals can’t be fully trusted
  • Behavior can change without notice

Volatility is a technical property. The absence of a relationship is a structural one. You can buffer against volatility. You cannot engineer a relationship into existence.

This distinction matters because it changes what you’re designing for. You’re not designing a system that handles errors well. You’re designing a system that operates responsibly inside a boundary it cannot control and cannot move.

In some cases, retrying the same operation does not increase the chance of success — it simply repeats the same outcome under a slightly different surface.

That’s a different problem. And it requires a different kind of honesty — not just in the code, but in what you tell the people who depend on it.


The Naive Model and Why It Fails

The standard booking flow looks like this:

Search → Confirm Price → Place Order → Booking Confirmed

In a cooperative environment, each step has a deterministic outcome. Confirmed means confirmed. Failed means failed. The system’s internal state and external reality stay in sync because the supplier guarantees it.

When you remove the contract, this model breaks in a specific way: you can’t trust the outcome signals.

A booking attempt might:

  • Appear to succeed but not produce a PNR
  • Appear to fail but have already charged the card
  • Return an ambiguous response that could be either

In practice, this shows up in very concrete ways:

  • A payment step returns successfully, but no PNR is ever issued
  • The flow redirects to the homepage instead of completing the booking
  • A 3DS challenge is presented but never completed by the user

Each of these cases produces signals, but none of them can be trusted as definitive proof of success or failure.

If your system maps execution outcomes directly to business truth, you’ll misclassify these cases. You’ll tell a customer their booking failed when money has already left their account. Or you’ll tell them it succeeded when it hasn’t.

The naive model conflates execution with truth. That’s the mistake.

And it’s an easy mistake to make, because in most systems execution is truth. The API told you it worked, so it worked. The API told you it failed, so it failed. We’ve built entire industries on that assumption. Removing it feels wrong, like the ground shifting underfoot.

But the ground was always shifting. We just had a contract that said it wasn’t.


Separating Three Things That Are Usually One

The fix is to explicitly model three separate layers:

Quote — the business obligation. A conditional promise derived from live airline state. It represents what might be possible, not what will happen. A Quote does not fail. It persists until it is fulfilled, cancelled, or expired. Failed attempts do not invalidate it.

Job — an execution attempt. A controlled effort to convert the quote into a ticket. It can fail. It can be retried. Multiple jobs may exist for a single quote. A Job is ephemeral. It does not define business truth.

Truth — the confirmed outcome. The only acceptable proof is a PNR (Passenger Name Record). Until a PNR exists, no booking has occurred — regardless of what the execution layer reports.

Quote (obligation)
  └── Job (attempt)
        └── Evidence (signals)
              └── Truth (PNR or confirmed failure)

Evidence is not a single signal, but a collection of observations — page transitions, responses, side effects — that must be interpreted together.

This separation sounds obvious in hindsight. In practice, most systems collapse these layers because in cooperative environments they’re always in sync. When the supplier is a black box, they diverge constantly.

The philosophical shift here is subtle but important. A Quote is not a transaction — it’s a promise under uncertainty. A Job is not a result — it’s an expression of effort. Truth is not what the system reports — it’s what can be independently verified.

Once you internalize that separation, a lot of design decisions become clearer. You stop asking “did the booking succeed?” and start asking “what evidence do we have, and is it sufficient to establish truth?”


UNCONFIRMED Is Not an Error

The most important design decision was treating UNCONFIRMED as a valid, stable state — not a temporary condition to be resolved immediately.

When a job ends without sufficient evidence to determine success or failure, the correct response is to say so explicitly:

Job outcomes:
  SUCCESS     → confirmed fulfillment (PNR exists)
  DENIED      → rejected before commitment (availability, validation)
  FAILED      → attempt completed with negative outcome
  UNCONFIRMED → insufficient evidence to conclude

UNCONFIRMED triggers a reconciliation process — a separate, bounded effort that re-examines the system from different angles (account state, email confirmations, external records) to establish truth over time.

The key insight: forcing a binary outcome when evidence is insufficient produces wrong answers. UNCONFIRMED is the honest answer. It’s not a bug in the system — it’s the system working correctly under uncertainty.

There’s something deeper here about what it means to build honest software. Most systems are designed to always produce an answer. Uncertainty is treated as a failure mode to be eliminated, not a condition to be modeled. We add timeouts, fallbacks, and defaults — all in service of returning something rather than admitting we don’t know.

But in a system handling real money, a confident wrong answer is worse than an honest uncertain one. UNCONFIRMED is the system saying: I observed what I could. I don’t have enough to tell you what happened. I’m not going to guess.

That takes a certain discipline to build and a certain courage to ship.


The Primary Risk Is Not Failed Bookings

A failed booking is recoverable. The customer doesn’t get a ticket, you don’t charge them, you try again or refund.

The primary risk is unacknowledged spend: money leaving the client’s account without a corresponding PNR being issued.

Because the system has no contractual guarantee over payment signals, it may miss changes in upstream behavior. It cannot independently assert financial truth.

This means the only reliable reconciliation mechanism is external:

  • Client’s sales ledger (money received from customers)
  • Client’s account balance and payment outflows
  • The difference between expected and observed spend

Any discrepancy beyond acceptable bounds indicates a failure outside the system’s control. The system’s job is to make that discrepancy visible, not to pretend it can’t happen.


Admission Control: Refusing New Obligations

One of the less obvious design decisions was building explicit admission control into the system.

The system exposes a global status that governs whether it accepts new fulfillment requests:

StatusBehavior
AVAILABLEAll requests accepted
UNSTABLERequests accepted with warning
MAINTENANCENew requests rejected (503), read-only available
UNAVAILABLENew requests rejected (503), existing jobs still queryable

UNAVAILABLE doesn’t mean the system is offline. It means the system is closed to new obligations while remaining available for status queries.

This matters because operating in a degraded state without telling clients is worse than refusing requests. If the system can’t fulfill responsibly, it shouldn’t accept the responsibility.

Currently this is controlled manually. The intent is to evolve toward automated transitions based on error rates and latency thresholds — a self-protecting system that adapts to changing conditions.


What the System Guarantees

Being explicit about guarantees is as important as the guarantees themselves.

The system guarantees:

  • Explicit correlation between quotes, sales, and attempts
  • Transparent success and failure semantics
  • Bounded retries and cancellation paths
  • Full auditability of decisions and outcomes
  • Clear operational limits

The system does not guarantee:

  • That every sale becomes a ticket
  • That effort is predictable
  • That external behavior will remain stable
  • That financial signals are authoritative

This isn’t a disclaimer. It’s an accurate description of the problem space. A system that claims more than this is lying — and in a system handling real money, that’s a serious problem.

There’s a tendency in software to treat the list of guarantees as a marketing exercise — you want it to be long, impressive, reassuring. But a guarantee you can’t keep is worse than no guarantee at all. It creates false confidence, defers accountability, and makes failures harder to reason about when they happen.

The honest list of guarantees is shorter. But it’s real. And clients can build on real.


The Broader Pattern

This design applies anywhere you’re integrating with a non-cooperative external system:

  • Fraud/risk systems: Outcomes depend on hidden rules, behavior varies by context, decisions aren’t fully observable
  • Distributed systems: Immediate outcomes aren’t always final, truth requires reconciliation
  • Any black-box integration: Where you observe signals but can’t verify semantics

In practice, this often means introducing controlled variation in how requests are executed, in order to observe how the external system reacts under different conditions.

The common thread is: don’t infer truth from execution signals alone. Collect evidence, interpret cautiously, defer truth when necessary, resolve through reconciliation.

The system that pretends uncertainty doesn’t exist will eventually produce wrong answers at the worst possible moment. The system that models uncertainty explicitly will surface it early, handle it gracefully, and remain honest under pressure.

But there’s a broader point beyond system design. The way we build software reflects assumptions about the world it operates in. Most software assumes a world of contracts, stable interfaces, and cooperative counterparts. That world exists — but it’s not the only world.

Some systems operate at the edge of that world, where the contracts run out. Where you’re interacting with something that doesn’t know you exist, doesn’t care about your correctness guarantees, and will change without warning. Building in that space requires a different posture: less certainty, more observation. Less assertion, more inference. Less confidence in individual outcomes, more investment in the process of establishing truth over time.

It’s closer to science than engineering. You form hypotheses, collect evidence, revise your model. You don’t get to declare truth — you get to converge toward it.


Key Takeaways

  • Separate execution (Job) from obligation (Quote) from truth (PNR). They diverge in non-cooperative environments.
  • UNCONFIRMED is a valid state. Forcing binary outcomes when evidence is insufficient produces wrong answers.
  • The primary risk is unacknowledged spend, not failed bookings. Design reconciliation around financial exposure, not just execution outcomes.
  • Admission control is a feature. A system that refuses new obligations when it can’t operate predictably is more trustworthy than one that accepts everything and fails silently.
  • Be explicit about what the system guarantees and what it doesn’t. Clients can work with honest uncertainty. They can’t work with false confidence.

You don’t get to declare truth — you build systems that converge toward it.