
We are sitting down with Artem Sentsov, Co-Founder and CTO of ClearPic AI, to discuss the technical and strategic challenges of building a knowledge graph for entity resolution and risk detection across Central Asia’s fragmented data landscape.
With a background spanning international relations, counter-terrorism financing research, and machine learning systems, Artem brings a unique perspective to solving one of the compliance industry’s hardest problems: connecting the dots when bad actors don’t want to be found.
First, the “Wild West” problem. What made Central Asian data so uniquely challenging compared to other regions you’ve worked with?
It’s not just the data itself that is messy, the environment itself is what I would call institutionally hostile. For instance, in the UK, Companies House, the country’s main registrar of all companies, actually wants you to have the data, either via API access or bulk downloads.
But in Central Asia, the mindset is the exact opposite. Government bodies might technically publish data to tick a box, but they do everything in their power to make accessing it as painful as possible. We deal with fragmentation where three different government bodies (Justice, Tax and Statistics) keep three different versions of the same company record, and none of them talk to each other.
And then there’s the tyranny of the exact match. If you don’t have the precise tax ID or if you make a single typo in a company name, you get nothing. No search suggestions, no fuzzy matching. You can’t just “browse” to see who owns what, or visualize a holding structure because the underlying bulk data required to build those connections simply isn’t public.
You mentioned spending 30% of development time on encoding normalization alone. Can you walk us through a specific example where Cyrillic edge cases broke your system?
This is what I call the classic “alphabet soup” problem. You wouldn’t believe how many ways there are to write the letters “C” or “X” in local variations of Cyrillic that standard encoders simply choke on.
But the real headache is historical drift. Take a businessman operating in Uzbekistan, Tajikistan, and Azerbaijan. In one country, his name would be recorded in Cyrillic. In another, it’s in Latin. In a third, it’s a mix. Plus, you have naming conventions shifting from Soviet styles (ending in “-ov”) to traditional local endings.
To a standard algorithm, these look like three different people. If you rely on a standard 80% phonetic match, you’ll miss them. We had to build custom normalization just to say: “Yes, this person bidding on a state contract here is the same person running a holding company there.”
Why did you choose a Knowledge Graph over a pure LLM approach, especially given the hype around RAG systems?
We treat the Knowledge Graph as our deterministic source of truth, whereas an LLM is probabilistic. The problem with a pure LLM approach is hallucination: models think in terms of semantic similarity, which can lead to dangerous false positives in compliance. An LLM might confidently invent a connection between two people because they share a patronymic and a region, or miss a critical connection because of a spelling variation.
Retrieval-Augmented Generation systems are only as good as the data they retrieve. If you feed raw data into a RAG, you get confident errors. We built the Knowledge Graph to clean, normalize, and link the data with 95% confidence first. Only then do we layer the LLM on top to act as an interface, to write queries or explain the data. The Graph provides the facts; the LLM just writes the report.
Risk management and compliance teams love certainty. How did you sell them on probabilistic scores?
We didn’t try to sell them a black box. We gave them the controls. Compliance isn’t one-size-fits-all. Sometimes you need a precise match to verify a director. But other times, you’re hunting for a hidden network, and you actually want to see the fuzzy connections.
We tell them: “You can lower the threshold. You’ll see more noise, but you won’t miss the signal.” It’s better for an analyst to manually dismiss a few false positives than to miss a critical asset because the algorithm played it too safe.
What are the most sophisticated techniques you’ve seen bad actors use to fragment their records?
I would reframe bad actors to include common business practices for asset protection, though they look suspicious. The most pervasive technique is the “Nominal Director” pattern, using a proxy with no digital footprint to hold shares. We spot this by looking for discrepancies: a company might list a random individual in the corporate registry, but the “Beneficial Owner” field in a government procurement database (where transparency rules are stricter) reveals the real owner.
But my favorite is what I call “Name Spoofing.” We see shell companies registered with known multinational brand names. They want investigators to see that name on a bank statement and think, “Oh, it’s a major, reputable company,” and stop digging. In reality, it’s a firm with five layers of circular ownership and zero assets. It’s simple, but it works surprisingly well against tired humans.
How do you handle the schema drift problem when new entity types like crypto wallets emerge?
Schema drift is a massive challenge, and we are increasingly using AI to solve it. We use LLMs to analyze incoming datasets and propose an optimal schema structure automatically, identifying which fields map to our “Person” or “Company” entities and which are novel, like “Wallet Address.”
As crypto becomes a standard payment and holding vehicle, our schema has to be living and breathing. But we always keep a human in the loop. We can’t let an AI blindly decide how our data structure evolves. The AI proposes the schema change; the engineer approves it.
Your background is in international relations and researching counter-terrorism financing, not computer science. How did that unconventional path actually help you design a better compliance system?
It taught me to think in systems. In geopolitics, you learn that nothing is linear and everything is connected. You can’t reduce geopolitics to a simple equation. That’s exactly how dirty data works. I don’t just look at a database row; I understand the macro-environment that created it.
After 9/11, the world has become obsessed with transparency, but it’s created a lot of friction for legitimate business. I see my role as a translator. I understand why the data is messy (politics, culture) and I understand what the investor needs (safety, clarity). I’m not just parsing code; I’m trying to bridge two very different worlds.
How do you explain a graph algorithm to a compliance officer?
Compliance officers understand tree structures just fine, but the hard part is explaining the “why”. Data lineage is the challenge: it’s less about “how” the nodes are connected and more about “why” we trust the data behind the node.
We focus on behavior: we have to show that, for instance, the company has changed owners, but it’s still trading with the same bad actors and has the same debts. The pattern didn’t change, only the name did. We have to build a glass box where they can see the provenance of every data point.
What keeps you up at night? Where do you think the field needs to go next?
That would be the pollution of truth via generative AI output that’s plaguing the internet. As we move forward, we face a fundamental problem: LLMs are generating vast amounts of synthetic information, such as fake news and entirely hallucinated documents. How do we distinguish between a verifiable fact from an official registry and a sophisticated, AI-generated narrative planted by a state actor or a corporation?
We are seeing warning signs of a “splinternet”: a fragmented internet where regions become more isolated and data access becomes harder. The next frontier for us is establishing a mechanism to verify the origin of information. We need to ensure that when our graph asserts a fact, it is immune to the flood of synthetic noise. Trusting the source is becoming more difficult than processing the data.
If someone wanted to build a similar system for a different “hostile data environment,” what’s the one lesson that would save them six months of pain?
To paraphrase Tolstoy: “All clean data is alike; every dirty dataset is dirty in its own way.”
There is no silver bullet. My only advice is aggressive experimenting. Don’t try to design the perfect system on paper. The data will break your assumptions on day one. You have to test, fail, fix, and repeat. It’s painful, but it’s the only way to build something that actually works.




