
Large Language Models’ strong predictive ability brings both power and risk – “simulated reasoning” of AI introduces trust and reliability concerns to high-stakes decisions and applications
When we think of AI, people often picture “chat models” — generative AI (Gen-AI). But AI is far broader: Gen-AI is a subset of deep learning, deep learning is a subset of machine learning, and machine learning is a subset of AI. Gen-AI models are trained on massive datasets drawn from diverse sources and typically cost tens of millions of dollars to develop. Their conversational and reasoning capabilities have captured public imagination, but their design introduces challenges for generative AI when used in decision-making contexts.
An AI can change its top decision recommendation without new evidence — simply by introducing a less relevant option.
This counterintuitive instability occurs in both modern large language models (LLMs) and well-known decision science phenomena such as rank reversal. While the underlying mechanisms differ, the shared effect — shifting top-ranked outcomes unexpectedly — raises serious trust and reliability concerns for high-stakes applications. This work is the first to draw a structured analogy between probabilistic output variability in LLMs and rank reversal in multi-criteria decision-making (MCDM), and to outline a framework for mitigating both.
One such challenge is non-determinism. By default, most Gen-AI models are configured to generate varied responses to the same prompt. If you give the same input to the same model moments apart, you are unlikely to receive an identical answer — it will be close, but not the same. This behavior arises because LLMs generate outputs based on probabilities rather than fixed, algebraic rules. While parameters such as “temperature” can be tuned to approximate deterministic behavior, most end users (and many practitioners) lack the expertise to set these optimally. In production deployments, especially in consumer contexts, non-determinism is therefore the norm rather than the exception.
A large language model functions like a supercharged autocomplete system. Having processed billions of words from books, articles, and websites, it learns patterns in how words, phrases, and ideas connect. When prompted, it predicts the most likely next word, then the next, forming coherent text. This predictive strength makes LLMs powerful — but also risky — because they simulate reasoning rather than executing exact logical or mathematical computation.
The consequences are clearest in domains requiring precision and strict logical structure. In exact computation, LLMs may make small but consequential errors because they are not actually calculating. In deep multi-step reasoning, their predictions can drift, breaking a chain of logic. And with large-scale numerical data, they are better at recognizing textual patterns than performing raw numerical analysis, where deterministic algorithms remain more reliable.
This instability parallels rank reversal in Mult-Critieria Decision Making (MCDM). In rank reversal, adding a new, clearly inferior option can change the order of previously ranked alternatives. For example, imagine four medical treatments — A, B, C, and D — ranked A > B > C > D. Introducing a fifth treatment, E, worse than D, might unexpectedly produce B > A > C > D > E. Although no evidence about A or B has changed, their relative ranking flips. In a clinical decision-support scenario, this could mean an AI’s top treatment recommendation changes solely because a less relevant option was added — even in the absence of new medical data.
Bring Trust to Artificial Intelligence with Decision Science
The novel insight of this work is recognizing that while LLM variability and rank reversal arise from different causes — stochastic text generation versus mathematical properties of MCDM algorithms — they have a common outcome: unstable recommendations that can erode decision-maker trust. Both phenomena highlight the need for AI systems in high-stakes contexts to be designed for stability, reproducibility, and transparency.
To address these risks, AI must be integrated with trusted decision-making frameworks. This requires combining the adaptive intelligence of LLMs with the structured reliability of established decision science methods. Such frameworks should ensure that recommendations remain consistent when non-essential inputs change, and should provide decision candidates that are stable, explainable, and reproducible. Without this integration, we risk deploying AI systems that are intelligent yet volatile — a dangerous combination in healthcare, finance, national security, and other critical domains.
Large language models hold transformative potential, but their probabilistic nature demands careful design, configuration, and integration. By understanding and mitigating the parallels between LLM output variability and rank reversal, we can move toward AI decision-support systems that are not only powerful, but also trustworthy.



