
Payment infrastructure was not built for autonomous decisions. Every API, every auth flow, every compliance checkpoint assumes a human being is present at the moment of transaction; someone who clicked a button, entered a PIN, or tapped a screen. That assumption is now under pressure.
Anath Bandhu Chatterjee has spent 15 years building the distributed systems that move money at scale, working across major financial institutions on real-time payment infrastructure, microservices architecture, and the observability frameworks that keep high-volume transactions from failing silently. He has seen firsthand what happens when those systems hit their limits — and why adding AI agents into that stack is not simply an engineering challenge but a fundamental rethinking of how trust, authorization, and accountability get encoded into financial infrastructure.
In this interview, Anath walks through the architectural gaps that make today’s payment APIs poorly suited for autonomous agents, what delegation and audit trails would actually need to look like, and why the industry is further from solving this than most people realize.
You’ve spent 15 years building distributed payment systems at companies like PayPal and JPMorgan Chase, where a single minute of downtime can cost millions. What’s the most critical lesson you’ve learned about building infrastructure that cannot fail?
The most critical lesson that I have learned while working at these payment industry giants is that infrastructure that cannot fail is, in essence, an infrastructure that fails, but recovers automatically, degrades predictably, and never loses money in the process.
In these critical payment processing systems, redundancy is not just a feature; it is the foundation of everything. In payments, you can’t rely on a single point of failure, whether it’s hardware, software, or even human processes.
Below are some of the key tenets that helped contain the blast radius and maintain business continuity in the event of failures.
Idempotency at its core: Every payment request carried a client-generated idempotency key that lived for hours. Because every API was 100% idempotent, retries ensured exactly-once semantics. Zero double-charges, zero lost payments.
Compensating actions in distributed transactions: We used saga patterns extensively because two-phase commit doesn’t scale to our volumes. When a payment authorization succeeds, but settlement fails, we have automated compensation flows that reverse the authorization rather than leaving the system in an inconsistent state.
Active-Active multi-region architecture: We had multiple live, load-balanced regions. Traffic was already spread across these regions. When one region failed, the load balancers detected TCP health-check failures under two seconds and rerouted traffic to the remaining regions. No human touched anything.
Automated canary rollback and feature flags. We had just rolled out a new version of a microservice just a few minutes before an outage. When we saw a tiny spike in decline rates (unrelated but coincident), the system auto-rolled back the change within seconds because the canary metrics crossed the error budget threshold. Again, no human is involved.
Most people think of PayPal and JPMorgan as financial companies, but you’re essentially running massive distributed computing systems that happen to move money. How does building payment infrastructure differ from other types of distributed systems, and why do those differences matter?
Absolutely right, these companies run massive, distributed computing systems that happen to move money, but the moving money part changes everything in the following fundamental ways.
State transitions are atomic and irreversible: In most distributed systems, if you lose a message or a request time out, you retry with an idempotency key to ensure at-most-once semantics. In payments, the challenge is more nuanced; every state transition that represents money movement must be durable, observable, and reconcilable, while the underlying infrastructure is inherently distributed. Eventual consistency is not an option where money is on the line, double charges are impossible.
Latency has direct financial consequences. If your social media feed is slow, users are annoyed. If payment authorization takes too long, merchants lose sales. At PayPal, we measured that every 100ms of additional latency in checkout flows corresponds to a measurable drop-off in conversion rates. But we also can’t sacrifice correctness for speed—we need both sub-100ms authorization response times and perfect accuracy. That naturally drives architectural decisions that wouldn’t make sense elsewhere.
Regulatory compliance dictates architecture in ways most engineers never experience. Processing a cross-border payment from the US to the EU means navigating PCI DSS, GDPR, PSD2, FinCEN regulations, plus banking regulations in both jurisdictions. And these aren’t just checkboxes—they fundamentally constrain how we build systems.
The operational impact is real. Want to spin up a new AWS region to handle a traffic spike? That’s not a one-line Terraform change—it’s legal approval, compliance reviews, data protection impact assessments, and regulatory sign-offs that can easily take 6-9 months. I’ve seen architecture decisions delayed for a year because the compliance team needed to verify that our proposed deployment model satisfied contradictory requirements from different regulators.
You recently attended the Agentic AI Conference, which suggests you’re thinking about where AI and payments intersect. From your vantage point building today’s payment infrastructure, what’s the biggest gap between current systems and what AI agents will need to transact autonomously?
The biggest gap that I can see is that the current payment APIs assume a human in the loop for every decision, with session-based authentication and explicit user consent at transaction time. AI agents will need something fundamentally different: delegated authority with fine-grained, revocable permissions and real-time risk assessment.
Here’s why current systems break down in my opinion:
Session-based auth doesn’t work. Today’s payment flows assume a user logs in, authorizes a transaction, and logs out within minutes. An AI agent might need to make payment decisions over days or weeks based on a changing context. OAuth tokens expire. Payment provider sessions timeout. There’s no good model for “this agent can spend up to $500/day on my behalf for cloud resources, but needs approval for anything over $100 in a single transaction.”
Intent verification is human-centric. Payment systems use 3D Secure, SMS verification, and biometric auth—all designed to confirm a human intends to make this specific payment right now. How do you verify an AI agent’s intent? How do you prove the agent hasn’t been compromised? How do you distinguish between “the agent is operating within its mandate” and “the agent has been prompt-injected to drain the account”?
Audit trails aren’t designed for autonomous decisions. Right now, payment logs capture “user X authorized payment Y at timestamp Z”—that’s enough when a human clicked a button. But what happens when an AI agent makes the payment? We need to log the agent’s reasoning process, what inputs it weighed, and what alternatives it considered. And this isn’t just nice-to-have for debugging—it’s going to be legally mandated.
Think about liability: if an agent makes an unauthorized payment, who’s on the hook? The person who owns the agent? The developer who built it? The payment processor that facilitated it? Without a proper decision audit trail, these disputes become impossible to resolve.
The reality is the industry isn’t close to ready. Most payment APIs still fundamentally assume someone’s sitting at a browser or tapping a phone screen, looking at a payment confirmation dialog, and clicking “yes.” That mental model is baked into everything from API design to regulatory frameworks. Changing it will take years, not months.
There’s a lot of hype around microservices, but you’ve built them in production at scale for companies processing trillions of dollars. What’s the reality that most companies get wrong when they try to adopt microservices architecture?
Many companies overlook the extra work that comes with microservices and see them as a quick fix for modularity, missing the bigger issues that distributed systems come with. At PayPal, we have grown to thousands of services. But without strong service discovery, circuit breakers, and idempotency, you can quickly end up with a mess of dependencies and cascading failures. People often talk about how microservices help ‘decouple teams,’ but in reality, this can break up knowledge. Teams focus on their own goals, which can lead to problems such as inconsistent APIs and data silos across the company.
A common mistake is skipping the basics, such as contract testing and observability. At JPMorgan, we required OpenTelemetry for tracing across all services, but many organizations start using microservices without these tools, which makes debugging very difficult. Microservices can be great for resilience, like letting you scale parts of your system independently, but this only works if you invest in automation for deployments and monitoring. If you don’t, you just end up spreading the same problems you had with a monolith, making everything more expensive and complicated.
You’ve worked across three continents and for companies ranging from consulting firms like Cognizant to tech giants like Cisco to elite financial institutions. How do different engineering cultures approach the same distributed systems problems, and what have you learned from that diversity of perspective?
Engineering cultures vary particularly by risk tolerance and pace of innovation.
While I was at Cognizant (consulting in India and the United Kingdom), it was pragmatic and client-driven. The focus was on cost-effectiveness and reusable patterns like containerization for quick wins, but there was less emphasis on bleeding-edge tech due to diverse client needs.
At Cisco (US tech giant), I have seen prioritization of hardware-software integration and of approaching distributed problems with a security-first and network-first lens, e.g., zero trust and latency optimization , fostering a “build to last” mentality.
At JPMorgan, I saw that things are conservative. Every change to production goes through security, data privacy, and localization checks. Deployments took weeks, with planning for rigorous change management and approval cycles. While this felt slow, I learned the importance of following processes.
At PayPal, I have seen things play out in a perfect blend of process maturity where it matters (data, security, consistency), while at the same time enabling innovation at speed for customer-facing features and pragmatic risk management.
Authentication and authorization are relatively solved problems for human users, but when AI agents start making autonomous payment decisions, the entire model breaks down. Can you explain why current payment APIs weren’t designed for this, and what would need to change?
The thing about current payment APIs is that they’re built around humans clicking buttons. Everything assumes someone’s sitting there, actively deciding to make a payment right now, and explicitly saying “yes, charge me.” AI agents don’t work like that at all, and honestly, the mismatch is pretty fundamental.
The auth model just doesn’t fit. Payment APIs use OAuth 2.0 or session tokens that expire after a few minutes or hours. The whole flow assumes you log in through a browser, authorize one specific payment, then log out. But AI agents don’t really have “sessions” in that sense. They need credentials that stick around for days or weeks, with permissions that are way more granular than “yes you can make payments” or “no you can’t.” I’ve never seen a payment provider that lets you say “this API key can spend up to $500 a day, but individual transactions over $100 need approval.” That’s just not how these systems were built.
Authorization is too black and white. Right now, when you make a payment, the API basically asks “is this user allowed to do this?” and the answer is yes or no. Done. With agents, it needs to be way more nuanced. Like, “can this agent make this payment given what it’s been spending lately, current risk factors, and the rules it’s supposed to follow?” You’d need to evaluate policy in real-time for every transaction, not just check if someone has valid credentials.
Nobody knows who’s liable when things go wrong. If I get tricked into authorizing a fraudulent payment, there are clear rules about who’s on the hook—consumer protection laws, merchant agreements, chargeback processes, all that. But what happens when an AI agent makes an unauthorized payment? Maybe it got hit with a prompt injection attack. Maybe there’s a bug in how it reasons about things. Who pays for that? The person who owns the agent? The developer who built it? The payment company that processed it? Payment APIs don’t even have fields for “agent identifier” or any kind of decision log, because this scenario literally wasn’t on anyone’s radar when they designed these systems.
What actually needs to change:
We need capability-based auth instead of the identity stuff we have now. Not “this token represents Bob,” but “this token lets you spend up to $X under these specific conditions, Bob issued it, and Bob can revoke it whenever.”
We need real-time policy evaluation. The payment system has to look at how the agent’s been behaving, what it’s been spending, current risk signals—all of that at the moment of transaction. Can’t just rely on “well, this credential was valid yesterday.”
We need structured decision context. The API should capture why the agent made this transaction. What did it consider? What other options did it look at? What risk factors were in play? That all goes into the audit trail.
We need fraud detection built for agents. Current fraud systems look for weird user behavior. We’ll need different models that catch weird agent behavior—like if it’s being manipulated through prompt injection, or if there’s some logic flaw, or if it’s getting adversarial inputs.
The reality is, the industry isn’t anywhere close to solving this. Payment infrastructure moves at a glacial pace, and we’re still in the “figuring out what this even looks like” phase. I’d be shocked if we have real solutions in less than five years.
In your experience building at both PayPal and JPMorgan, what are the biggest misconceptions companies have about what it takes to scale payment infrastructure? What looks simple from the outside but is much harder in practice?
The biggest misconception is that it’s just a throughput problem. Like everyone thinks, “we’re growing fast, we need to handle more transactions, let’s add servers and optimize our code. But it’s maybe 20% of the actual problem.
Things are way more complex in reality. Data consistency, for one. Everyone wants to use eventual consistency because they read about it in some tech blog. But in payments, you absolutely cannot have eventual consistency on account balances.
At PayPal, we needed real-time consistency for anything that touches money. Which means actual database transactions, distributed locks, all that heavy coordination stuff that doesn’t scale as nicely as people want. You can’t just throw more servers at it past a certain point. We hit this wall around 150k transactions per second, where adding more capacity actually slowed things down due to lock contention. Took us weeks to figure out what was going on.
Oh, and the latency, in payments, tail latency is what kills you. If 99% of your authorizations are fast but 1% take more than five seconds, you’re losing tons of sales. People don’t wait. They bounce. We spent half a year – I’m talking like six months of engineering time – just trying to cut 50 milliseconds off our p99 latency. Sounds ridiculous, right? But it translated into about $15 million in additional annual revenue because fewer people abandoned checkout.
JPMorgan had different but same fundamental issues. But even there, people underestimated the operational complexity. Payment systems aren’t these clean stateless APIs. We’re holding fraud scores in memory, tracking transaction IDs we’ve seen, managing circuit breakers, and rate limits. When a server crashes, you can’t just boot a fresh one. What happens to the payments that were in flight? We built an entire failover system just to handle restarts without dropping transactions.
Looking ahead, we’re seeing AI agents that can write code, analyze data, and make decisions, but payments remain a largely human-mediated process. What trends in distributed systems or payment infrastructure do you think will define the next decade as autonomous agents become more prevalent?
I see the AI agents are getting smarter by the day and doing real work, like writing code or crunching data, on their own. While payments are still stuck, a human has to click “approve” every time. From what I’ve seen building these massive systems at PayPal and JPMorgan, the next ten years are going to completely turn things around, but slowly, because we are dealing with money.
The big shift I am hoping for is agentic commerce” – these autonomous agents actually handle transactions end-to-end without you babysitting them.PayPal has already jumped in with its own merchant integrations.
On the distributed systems side, the trend that will matter most is programmable payments and delegated authority baked into the infrastructure.
You’ve progressed from developer to staff engineer at some of the most demanding engineering organizations in the world. What advice would you give to engineers who want to work on systems where the stakes are this high, where your code moves money and downtime has immediate financial consequences?
Here’s what I wish someone had advised me early on my career:
Get really, really good at understanding failure modes. When I was starting out, I focused on making code work. That’s what they teach you in school, right? Write code that works. But in payment systems, that’s maybe 30% of the job. The other 70% is understanding every possible way your code can fail and what happens when it does. What if the database is slow? What if this API call times out? What if two requests come in at the exact same millisecond? What if the network drops halfway through?
I started reading postmortems obsessively. AWS publishes them when they have outages. Google does too. GitHub, Stripe, everyone. Read those religiously. You start seeing patterns – cascading failures, thundering herds, this thing called split-brain where two parts of your system disagree about who’s in charge. Once you’ve seen it happen to someone else, you start recognizing it in your own code before it bites you.
The other thing is that observability isn’t something you add later. You build it in from day one. You need distributed tracing. You need metrics that actually matter. At PayPal, we tracked authorization success rates alongside technical metrics because sometimes a backend issue doesn’t show up in infrastructure numbers but immediately appears in failed transactions.
Understand the business context, not just the technical requirements. This is huge. I’ve seen so many engineers who can explain every technical detail but have no idea why they’re building what they’re building. Payment systems exist to solve business problems. Enable commerce. Reduce fraud. Stay compliant. When you understand the why, you make better technical decisions.
Never stop learning. Payment infrastructure touches distributed systems, database internals, network protocols, security, cryptography, and financial regulations. You’ll never master all of it. I haven’t. But you can keep expanding your understanding.
Just go in with your eyes open. It’s not glamorous. It’s stressful. You’ll make mistakes – I still do. But if you’re careful, methodical, always learning, and you have that healthy paranoia about failure modes, you can build systems that are actually reliable.


