Machine Learning

The dangers of AI model drift: lessons to be learned from the case of Zillow Offers

How much faith should companies place in AI making business decisions? And how often should organisations ask this question of any particular model? The recent experience of Zillow Offers, an US online property marketplace, provides a cautionary tale about AI Model Risk Management. At Zillow, a misfiring property valuation algorithm led the company to overestimate the value of the houses it purchased in Q3 and Q4 2021 by more than $500m. As a result, Zillow has announced Q3 losses of $304m, expects additional losses of roughly $200 million, and is planning a reduction of its workforce by 25%.

How can companies best manage model risk? Zillow likely rigorously tested the model in development, and then, after putting it into production, we can see that they actually rolled it out slowly, to see how it performed in the real world. After this initial live period, they then aggressively expanded the purchasing programme, buying up more homes in a six-month period than it had in the previous two years. In one view, this seems prudent – develop, test, prove out in live conditions, and then expand.

However, this approach clearly wasnā€™t enough to manage future risk. Market conditions in real estate can shift rapidly, deviating from conditions in which a model is initially trained. There is a strong argument that the subsequent fallout that happened at Zillow would not have occurred had the models been aggressively monitored, with adjustments made when market conditions changed and housing prices fell. Instead, models assuming a continuing buoyant market were apparently left alone and overestimated prices accordingly. This ā€˜concept driftā€™ could have been avoided by risk management processes informed by better tools for model monitoring, so that the data science teams could better manage and maintain the quality of the companyā€™s AI models.

Ongoing model risk management requires both technology and people

Had Zillow deployed tools to measure ā€˜driftā€™ or changes in its model accuracy, model outputs, and model inputs on an ongoing basis, potential model issues would have been detected, since such tools would automatically alert data science teams when there is drift or performance degradation, support root cause analysis, and inform model updates. (In a separate article ā€œDrift in Machine Learningā€, my colleagues explore the various types of drift and what can be done to account for it.)

In addition to monitoring tools, human intervention would also have been worthwhile to help understand the root causes of why estimated house prices (the model outputs) were trending higher, in particular when looked at alongside lower model accuracy. Similarly, an examination of model input distributions could have raised an alert: a model input that tracked average property prices over a period of time would have shown that the market was cooling. On the basis of this information, the model could then have been adjusted to reflect changing market conditions by placing greater weight on the most recent data.

The complexity of real-world risks make a strong case for model monitoring

The exact cause of the failure of Zillowā€™s algorithms to estimate house prices accurately is unknown, but there are a number of reasons why property prices might fall and why certain zip codes decline in popularity.

Significant market conditions which could well have contributed to the modelsā€™ skewed results include COVID-induced shortages of labour for property renovation or the reduced desirability of certain zip codes owing to population density and/or the choice to work from home without needing to commute to a city office. All these factors would mean that property prices declined, or at least stabilised, while the models continued to assume that they were continuing to rise. It is significant to note that the global pandemic impacted many different markets, of which housing was just one, yet for the most part AI models did not fail. Zillowā€™s models however incorrectly assumed a continuing upward trend.

It is clear that the algorithms didnā€™t account accurately for the relationship between the target variable (the price of the house) and the input variables (number of bedrooms, number of bathrooms, square footage, property condition, etc). House prices went down even for the same value of the input variables; and the models were not updated to reflect the new relationships.

Machine learning (ML) models tend to assume that what was true in the past will be true in the future. But they fail to take into account real-world unknowns such as changing market conditions or the knock-on effect of a global pandemic. Any model may face a wide array of real-world risks that are not adequately captured in the development or early live testing phases. While this may sound overwhelming, it simply makes the case for ongoing model monitoring in production. The impact of all of the unforeseen complexities becomes evident as you measure the continuing accuracy of the model. In the case of Zillow, had model accuracy been measured and the downward trend in prices captured, it is likely that the algorithms could have been updated in a timely manner and valuations adjusted accordingly.

Governance is key

In this scenario of companies purchasing housing based on machine learning models, there is some evidence that better governance minimises risk. Zillowā€™s main competitors, Opendoor and Offerpad, did not suffer the same consequences of the sudden changes in property prices and market conditions, which suggests that they had implemented adequate tools and processes to monitor and detect potential issues with their AI models.

The big takeaway from this for any organization developing ML models is that model risk is best mitigated by robust AI governance supported by monitoring and debugging tools. The key to managing and accounting for market fluctuations is not simply to leave AI to its own devices, but to ensure that humans are also part of the supervision process. That way algorithms can be adjusted to take into account any market shifts which may result from local or global events.

Author

  • Anupam Datta

    Anupam Datta is Co-Founder, President, and Chief Scientist of Truera. He is also Professor of Electrical and Computer Engineering and (by courtesy) Computer Science at Carnegie Mellon University. His research focuses on enabling real-world complex systems to be accountable for their behavior, especially as they pertain to privacy, fairness, and security. His work has helped create foundations and tools for accountable data-driven systems. Specific results include an accountability tool chain for privacy compliance deployed in industry, automated discovery of gender bias in the targeting of job-related online ads, principled tools for explaining decisions of artificial intelligence systems, and monitoring audit logs to ensure privacy compliance. Datta has also made significant contributions to rigorously accounting for security of cryptographic protocols. Specifically, his work led to new principles for securely composing cryptographic protocols and their application to several protocol standards, most notably to the IEEE 802.11i standard for wireless authentication and to attestation protocols for trusted computing. Datta serves as lead PI of a large NSF project on Accountable Decision Systems, on the Steering Committees of the Conference on Fairness, Accountability, and Transparency in socio-technical systems and the IEEE Computer Security Foundations Symposium, and as an Editor-in-Chief of Foundations and Trends in Privacy and Security. He obtained Ph.D. and M.S. degrees from Stanford University and a B.Tech. from IIT Kharagpur, all in Computer Science.

    View all posts

Related Articles

Back to top button