Taller, faster, bigger, smarter… it’s human nature to want to be the best, and the purpose of the competition is to prove who – or what – is the crème de la crème of any given challenge or activity.
For some time now, the spirit of competition has been applied to many ‘challenges’ between humans and machines. Aside from their entertainment and curiosity value, one of the benefits of such challenges is that they help to measure the evolution of new technologies, bringing them to a level of maturity whereby they can eventually be applied to everyday applications to make life easier or more productive for us all.
By way of example, many of us now use speech as the primary interface to the personal assistants built into the devices we use day to day, ranging from our mobile phones to our car infotainment systems and smart speakers at home. While speech technology has a long history, it reached a perception landmark in 2011, when IBM’s Watson – a question-answering computer system capable of answering questions posed in natural language – ‘appeared’ on the quiz show Jeopardy! It was a very public display of the progress that had been achieved in natural language processing, which is what makes it so easy to ‘talk to’ and ‘command’ the speech-based devices and services we all use daily.
One of the most exciting examples of technological progress to manifest itself recently, came in the form of Alibaba securing first place in the latest global VQA (Visual Question Answering) Leaderboard. The annual challenge, which has been organised since 2015 by the worldwide leading visual conference – CVPR – attracts the greatest minds from global players including Facebook, Microsoft and Stanford University. In the task, an image is presented and a related natural language question is asked, to which participants are asked to provide an accurate natural language answer.
This was a significant event, which was followed by another more significant milestone, as Alibaba’s system marked the first time that a machine has outperformed humans in understanding images for answering text questions. The algorithm recorded an 81.26% accuracy rate in answering questions related to images, compared to humans’ performance of 80.83% (in the test-standard part). To give you an idea of the scale of this achievement, this year the challenge contained more than 250,000 images and 1.1 million questions.
The breakthrough of machine intelligence in answering image-related questions was made possible thanks to the innovative algorithm design. By leveraging proprietary technologies – including diverse visual representations, multimodal pre-trained language models, adaptive cross-modal semantic fusion, and alignment technology – the Alibaba team was able to make significant progress not only in analysing the images and understanding the intent of the questions but also in answering them with proper reasoning while expressing it in a human-like conversational style.
What makes this especially exciting is that this technology has already been widely applied across a number of applications to make them even more user-friendly. For example, it already features the intelligent chatbot, Alime Shop Assistant, which is used by tens of thousands of merchants across retail platforms, adding extra convenience and value to the user experience.
The ‘win’ is another significant milestone in machine intelligence, which underscores the continuous efforts being made in driving the research and development in related AI fields. It also gives us an opportunity to celebrate the advantages advanced AI brings to humans; when machines are ‘smart’, they can be used to assist us in our daily work and life, enabling people to focus on the creative tasks that they are best at, while the machines focus on the less interesting, more repetitive tasks.
To that end, VQA can be used across a wide range of areas, such as searching for products on e-commerce sites, for supporting the analysis of medical images for initial disease diagnosis, as well as for ‘smart’ driving; the auto AI assistant can offer basic analysis of photos captured by the in-car camera. All of those are scenarios where VQA is working to make life better for people personally and professionally.
The desire to be taller, faster, bigger, smarter will never leave us. But in the meantime, we should all reflect on the technological progress that has been made that makes our day-to-day lives easier… and much of it has been forged through the spirit of man vs machine competition.