
The scale of modern marketplaces makes manual content processing virtually impossible — especially when it comes to the two critical areas: product title generation and content moderation. Mistakes here are costly, ranging from poor search visibility to an immediate loss of user trust. That’s why more and more platforms are turning to machine learning — it enables the creation of clear, localized titles and prevents violations before product listings even go live.
Research confirms the effectiveness of this approach: hybrid ML models improve title quality by up to 15% according to chrF metrics. At the same time, automated moderation can detect up to five times more violations than manual filters.
We spoke with Artem Bochkarev, Director of Machine Learning at AliExpress and an expert in deep tech integration at e-commerce scale. Under his leadership, AliExpress localized and replaced Alibaba’s original search and recommendation algorithms, and launched automated systems for content moderation and product title generation. Artem’s approach is not experimental: it’s a robust, production-ready ML infrastructure embedded directly into business-critical processes — from title generation to real-time moderation, across billions of SKUs.
When did machine learning become a truly viable alternative for title generation and content moderation? Was there a specific moment or challenge that triggered the shift?
The real shift happened with the emergence of modern large language models (LLMs). Before that, automatic title generation was difficult to implement — there was no clear way to define quality metrics, data requirements were high, and the output was often too generic or inaccurate.
LLMs changed that. They can understand context, local language, and product categories, and generate clear, relevant, and readable titles. This made them truly useful for business, and we saw a measurable increase in conversions after implementing them.
As for content moderation, the technology has existed for a while, but it wasn’t widely adopted. Smaller marketplaces could manage with manual review. But on large-scale platforms like AliExpress, where millions of items are uploaded, automation became a necessity. We can’t tightly control what and how sellers upload, so machine learning models handle most of the moderation: detecting violations before listings go live, adapting to new policies, and learning from labelled data.
Why automate title generation at all — what value does it bring to the business and the end user?
Here, I would talk more broadly about content enhancement. It could be title or description generation, attribute extraction and correction or image enhancement. The goal and value for the business is two-fold: to make the item more appealing and understandable for the user and to improve their ability to find it.
First, there are many studies that conversion from product view to the order depends greatly on the quality of the product page. This includes a short, clear and descriptive title, an image that truly shows what the product is and also the full description. When we rolled out the improvement in titles, we saw an increase in conversion rates.
The other way good content benefits business is that search works better. For example, in clothing, color is very important in product selection. If we are somehow able to correctly assign colors to products, we can leverage this info in search: people can use filters with this attribute, or we can even detect it from a query (black shirt) and find only relevant products. It is not possible if our items don’t have correctly parsed attributes.
How is the content moderation system architected? Where is the line between machine learning and manual processes?
The content moderation system follows a hybrid structure that combines machine learning with human oversight to ensure both scalability and quality.
Under my leadership, the team trains models for content generation and moderation using data labelled by human annotators. In day-to-day operations, the majority of items are moderated automatically by the ML model. Only new or updated listings are reviewed, and if the model is uncertain about a prediction, it defers the decision to a human reviewer. These human decisions are then fed back into the training pipeline to continuously improve the model.
When a new rule or regulation is introduced — something the model hasn’t been trained to recognize — a small sample (typically a few dozen examples) is manually labelled, and the model is retrained to handle the new requirement.
Additionally, regular sampling of model outputs is conducted for manual review — a critical step in monitoring moderation quality, identifying potential blind spots, and ensuring alignment with evolving platform policies and standards.
This approach allows us to combine the speed and scale of automation with the nuance and judgment of human review, keeping the system both efficient and reliable.
How do you determine whether a generated title performs better — what does a “good result” mean in this context?
The team under my leadership developed several offline and online evaluations to help determine model success. In the offline setup, for a sample of products, an ML-generated title is created and shown alongside the original to a human labeller. Two simple questions are asked: how accurate each title is, and which one is preferable to the person. By aggregating many such responses, it becomes possible to estimate the average difference in appeal between different title models, as well as the model’s error rate. This serves as a baseline offline evaluation — without good performance on these metrics, there’s no point in releasing the models.
Another offline evaluation involves measuring how search performance is affected after replacing titles — specifically, how relevant the results are (since search is driven by item titles, any changes directly impact it). Ultimately, success is defined as achieving a statistically significant improvement in one of the key business metrics in an A/B test. The main metrics here are click-through and order conversion rates on the affected items.
What does automated moderation contribute to user trust in the platform? Is that impact noticeable to users? How does the implementation of ML in content generation and moderation affect how users perceive the platform? What changes are noticeable to them, and how do you measure that impact?
Moderation is often not very noticeable to users except when a person is looking for specific banned things. There are some scenarios where some odd forbidden content can appear in some of the search results or recommendations, but it’s rare, and we do not think it affects the experience much. On the other hand, the company needs to deal with legal and government complaints where there is some forbidden content on our platform, and we need to make an efficient process to find and ban such content to avoid litigation. If we talk about content enhancement, this change is very noticeable for the users and changes their image of the platform. In our internal testing, we found that in 3 out of 4 cases, users prefer ML-generated titles and report better satisfaction from using the platform.
How do you balance automation and human oversight — where does trust in the model end and manual review begin?
The model is not trusted by default — any model deployed to production must be monitored based on a clearly defined set of metrics to prevent it from deviating or producing incorrect predictions. This is achieved through periodic human reviews on small random samples, as well as continuous calculation and monitoring of data statistics. For instance, if an average value becomes skewed or the model begins to output a particular label more frequently, this serves as a signal for an ML engineer to investigate and identify the source of the issue. In short, while it is impossible to eliminate all mistakes made by AI models, it is entirely feasible to maintain the error rate within a predefined range (e.g., less than 5%).
How do you see the future of content generation and moderation in e-commerce — what’s likely to become standard, and what will remain experimental?
I think we will see more and more companies moving away from strict manual processes for item creation and validation and moving to more AI-based approaches, which would save significant human resources. I think we have yet to see the large-scale adoption of image generation for products on a massive scale. This would potentially allow for more homogeneous and visually appealing product pages; you might even incorporate personalization into generating such images. But this approach raises more concerns regarding the authenticity of sold items, so it remains unclear whether we will move in that direction.