How much faith should companies place in AI making business decisions? And how often should organisations ask this question of any particular model? The recent experience of Zillow Offers, an US online property marketplace, provides a cautionary tale about AI Model Risk Management. At Zillow, a misfiring property valuation algorithm led the company to overestimate the value of the houses it purchased in Q3 and Q4 2021 by more than $500m. As a result, Zillow has announced Q3 losses of $304m, expects additional losses of roughly $200 million, and is planning a reduction of its workforce by 25%.
How can companies best manage model risk? Zillow likely rigorously tested the model in development, and then, after putting it into production, we can see that they actually rolled it out slowly, to see how it performed in the real world. After this initial live period, they then aggressively expanded the purchasing programme, buying up more homes in a six-month period than it had in the previous two years. In one view, this seems prudent – develop, test, prove out in live conditions, and then expand.
However, this approach clearly wasnāt enough to manage future risk. Market conditions in real estate can shift rapidly, deviating from conditions in which a model is initially trained. There is a strong argument that the subsequent fallout that happened at Zillow would not have occurred had the models been aggressively monitored, with adjustments made when market conditions changed and housing prices fell. Instead, models assuming a continuing buoyant market were apparently left alone and overestimated prices accordingly. This āconcept driftā could have been avoided by risk management processes informed by better tools for model monitoring, so that the data science teams could better manage and maintain the quality of the companyās AI models.
Ongoing model risk management requires both technology and people
Had Zillow deployed tools to measure ādriftā or changes in its model accuracy, model outputs, and model inputs on an ongoing basis, potential model issues would have been detected, since such tools would automatically alert data science teams when there is drift or performance degradation, support root cause analysis, and inform model updates. (In a separate article āDrift in Machine Learningā, my colleagues explore the various types of drift and what can be done to account for it.)
In addition to monitoring tools, human intervention would also have been worthwhile to help understand the root causes of why estimated house prices (the model outputs) were trending higher, in particular when looked at alongside lower model accuracy. Similarly, an examination of model input distributions could have raised an alert: a model input that tracked average property prices over a period of time would have shown that the market was cooling. On the basis of this information, the model could then have been adjusted to reflect changing market conditions by placing greater weight on the most recent data.
The complexity of real-world risks make a strong case for model monitoring
The exact cause of the failure of Zillowās algorithms to estimate house prices accurately is unknown, but there are a number of reasons why property prices might fall and why certain zip codes decline in popularity.
Significant market conditions which could well have contributed to the modelsā skewed results include COVID-induced shortages of labour for property renovation or the reduced desirability of certain zip codes owing to population density and/or the choice to work from home without needing to commute to a city office. All these factors would mean that property prices declined, or at least stabilised, while the models continued to assume that they were continuing to rise. It is significant to note that the global pandemic impacted many different markets, of which housing was just one, yet for the most part AI models did not fail. Zillowās models however incorrectly assumed a continuing upward trend.
It is clear that the algorithms didnāt account accurately for the relationship between the target variable (the price of the house) and the input variables (number of bedrooms, number of bathrooms, square footage, property condition, etc). House prices went down even for the same value of the input variables; and the models were not updated to reflect the new relationships.
Machine learning (ML) models tend to assume that what was true in the past will be true in the future. But they fail to take into account real-world unknowns such as changing market conditions or the knock-on effect of a global pandemic. Any model may face a wide array of real-world risks that are not adequately captured in the development or early live testing phases. While this may sound overwhelming, it simply makes the case for ongoing model monitoring in production. The impact of all of the unforeseen complexities becomes evident as you measure the continuing accuracy of the model. In the case of Zillow, had model accuracy been measured and the downward trend in prices captured, it is likely that the algorithms could have been updated in a timely manner and valuations adjusted accordingly.
Governance is key
In this scenario of companies purchasing housing based on machine learning models, there is some evidence that better governance minimises risk. Zillowās main competitors, Opendoor and Offerpad, did not suffer the same consequences of the sudden changes in property prices and market conditions, which suggests that they had implemented adequate tools and processes to monitor and detect potential issues with their AI models.
The big takeaway from this for any organization developing ML models is that model risk is best mitigated by robust AI governance supported by monitoring and debugging tools. The key to managing and accounting for market fluctuations is not simply to leave AI to its own devices, but to ensure that humans are also part of the supervision process. That way algorithms can be adjusted to take into account any market shifts which may result from local or global events.