Vadym, tell us how work on the new SaaS billing solution began. What problem was it meant to solve for telecom operators?
Every telecom operator has a system that generates reports, payment records, and other documents. For example, once a month an operator sends a user a billing report, showing the number of minutes and SMS messages used, broken down by date. But in many companies, these solutions were developed many years ago and have barely changed since—causing significant challenges in the modern era.
For instance, our client’s system was very difficult to scale because of its monolithic architecture, and it had numerous dependencies and complex configurations. As a result, each system launch turned into a lengthy process where something could go wrong at any moment—not to mention the difficulty of making changes to improve performance or introduce new templates.
My task was to make the solution highly scalable and easy to configure—so that even a marketing specialist without IT skills could customize report templates. From the start, it was clear this would be a complex project: there was no system documentation, and the requirements we were given were vague. Much of the work involved communicating with stakeholders to clarify what exactly needed to be done.
At the initial stage, we invested a lot of time in clarifying details and gradually shaping the scope. We concluded that it was not enough to just improve the backend; we also needed to create a simple, user-friendly interface with flexible editing features—so that every specialist could easily work with the system.
How critical was the task in terms of system load? What was the biggest challenge at the start?
The main challenge was the task itself: to build a fault-tolerant, scalable, and easily extensible system. The previous system contained a large amount of legacy code that had degraded over time, layered with newer code. In the end, modifying such a system turned out to be more difficult and expensive than building a new one.
We needed to handle load balancing effectively, because the system worked in spikes—billing reports were sent to all users at the same time on certain days, and during those periods the load peaked. Marketing campaigns, which brought in new customers, also triggered increased activity. This meant the system architecture always had to be ready for high loads.
How did you approach the system architecture: why did you choose an asynchronous model, and what tools did you use for load management?
A synchronous system usually handles peak loads poorly—you need to scale very quickly to handle the growing number of operations. An asynchronous model allows you to reduce stress on the main system by spreading the load gradually across other systems.
We split processes into several stages and introduced the Apache Kafka message broker to relieve pressure from the main service. At the first stage, requests were accepted with basic validation and then sent to Kafka—detailed processing was deliberately skipped to speed things up. Then we consumed requests from Kafka and began enrichment—collecting user data, selecting the right template and version, filling it, and so on.
This way, the initial load on the system was minimal, and requests were processed much faster than before. The subsequent stages could be executed with a slight delay—this wasn’t critical, since the system had already registered the request to prepare a report.
For implementation, in addition to Apache Kafka, we used Java and service communication through REST and gRPC (Google Remote Procedure Calls)—tools that enable interaction between services written in different languages.
What challenges did you face when aggregating data from dozens of external sources, and how did you ensure correctness and fault tolerance?
Each data source has its own specifics—protocols, APIs, data exchange formats. Adding a new source meant introducing potential instability into the system: it might be unavailable or malfunction. All of this complicated the system, since each failure had to be resolved manually. To address this, we built a modular architecture to ensure correctness and resilience.
Each module was connected via its own adapter and had several layers: retry (re-executing failed operations), deduplication (removing duplicate data), fallback strategies (executing backup plans), logging of critical steps, and more.
We didn’t just log operations—we also logged critical checkpoints in the process flow. In case of errors, if reprocessing was required, we already had completed checkpoints—so we could perform deduplication and additional checks. Under partial system failures, this allowed us to restore operations and maintain stability.
Was there a moment in the project when everything could have gone wrong? How did you and the team handle it?
The first difficulty was the vague requirements—at that point, everything could have gone completely off track. We didn’t understand most of the system requirements: how many requests the system had to withstand, what the response time should be, data retention policies, and other parameters. But I managed to negotiate with stakeholders and develop clear requirements.
The second issue arose during development. We conducted research and tested different engines for rendering templates into documents. Out of 15 potentially suitable options, we chose one that initially seemed the easiest to integrate and produced decent rendering quality.
However, once we started more complex tests, some services using this engine failed to process large volumes of requests—so we had to replace it. Fortunately, we conducted testing properly, identified the issue early, and ultimately saved the project, giving it the chance to grow further.
We also encountered challenges with templates. During testing, we noticed that certain types of templates took much longer to process. To address this, we added several layers of data buffering and caching to simplify rendering. We also built a separate metrics system to track each stage and monitor performance trends. This allowed us to respond proactively to delays.
I was directly involved in all major technical decisions. I also had a strong influence on architectural choices, distributed tasks among mid-level developers, and mentored them—for example, explaining how to approach problems and which architectures to prefer.
What did this project give you personally in terms of professional growth? Have your approaches to high-load systems changed?
This project gave me significant growth in terms of soft skills. The client’s requirements were very fluid, and we had to clarify a lot on the go.
I had to listen to stakeholders, translate business requirements into architectural solutions, and defend those solutions in front of company leadership. Essentially, I acted as a developer, architect, and lead all at once.
Finally, in your opinion, what technology trends will shape the future of high-load systems?
I believe one of the main trends is the shift toward event-driven architectures. In fact, we also developed an event-driven system—requests were received at the entry point, and everything else was processed through events. This allowed us to scale, reduce load on the main system, and recover after failures.
I also think observability approaches will play a bigger role—various metrics, alerts, and real-time analytics of system performance. Most likely, these practices will increasingly be embedded at the system design stage.