LLM Token Burn Rates as a Performance Metric: What IBM's AI Dashboard Reveals About the Measurement Problem

Two Stories, One Structural Problem

Two news items crossed my feed this week that, taken separately, seem unrelated. First, reports emerged that some tech companies are evaluating employees based on how quickly they burn through large language model tokens - treating raw consumption as a proxy for productive AI engagement. Second, IBM's head of consulting, Mohamad Ali, announced that the company has formalized an internal AI dashboard that monitors the work of AI agents in real time, and has now made that dashboard available to external clients. IBM is explicitly framing the future of consulting as humans supervising AI outputs rather than producing them directly. These two developments represent opposite ends of the same measurement failure, and understanding why requires taking organizational theory seriously.

The Token Burn Problem Is a Competence Inversion in Disguise

Measuring employee performance by LLM token consumption is a category error dressed up as management science. The implicit assumption is that heavier AI usage signals greater capability or greater contribution. But this confuses input volume with output quality. A worker who generates 10,000 tokens of mediocre, poorly-prompted output is not more competent than a worker who generates 2,000 tokens of precisely structured, high-yield output. What this metric actually captures is activity, not expertise.

This connects directly to what I have been developing in my dissertation research on algorithmic literacy coordination. Kellogg, Valentine, and Christin (2020) documented how algorithmic systems at work create new forms of visibility that are systematically legible to managers but not necessarily meaningful. Token consumption metrics fall into exactly this category. They are legible, they are quantifiable, and they create the appearance of rigor. But they measure a surface behavior rather than the structural competence that actually drives performance variance. Hatano and Inagaki (1986) drew a foundational distinction between routine expertise, which is procedure execution, and adaptive expertise, which is principled problem-solving. Token burn rate is a measure of routine execution. It says nothing about whether a worker understands the structural features of LLM interaction that would allow them to adapt effectively across novel tasks.

IBM's Dashboard and the Topology-Topography Confusion

IBM's consulting dashboard is a more sophisticated intervention, but it contains its own theoretical problem. The framing that humans will "monitor the work of AI agents" presupposes that meaningful oversight is achievable through observational access. This is the topology-topography confusion I have written about in other contexts. Knowing that an AI agent produced a particular output, and even seeing that output in real time, is not equivalent to understanding the structural constraints that shaped it. Supervisors using IBM's dashboard will develop rich topographic knowledge - they will know what outputs look like and when they seem wrong. But topological knowledge, understanding the underlying shape of how the agent's decision processes are structured, requires something different.

This is not a minor technical gap. Rahman (2021) demonstrated in his study of algorithmic control systems that workers who are positioned as monitors of algorithmic outputs frequently develop what he called "folk theories" of system behavior. These folk theories are individually plausible but structurally inaccurate. The result is that workers believe they are exercising meaningful oversight when they are actually responding to surface patterns. For IBM's clients, this is a governance risk. A consulting engagement where human experts are monitoring AI agent outputs but lack structural schemas for understanding agent behavior is not actually supervised AI deployment. It is supervised AI theater.

The Metric Design Problem Is an Organizational Design Problem

What connects the token burn story and the IBM dashboard story is a shared organizational design failure. In both cases, organizations are reaching for legible proxies because direct measurement of AI-mediated competence is genuinely difficult. This is understandable, but the choice of proxy matters enormously. Schor et al. (2020) argued that platform-mediated work creates new forms of dependence precisely because workers lack the structural understanding to evaluate their own position within the system. Managerial metrics that prioritize surface legibility over structural validity reproduce this dynamic at the organizational level. Managers become dependent on metrics they cannot critically evaluate.

The practical implication is not that token metrics or monitoring dashboards are useless. It is that they require organizational support structures that build structural schema, not just procedural familiarity. Workers and managers who understand why LLMs produce particular outputs, not just what those outputs look like, are better positioned to use these tools critically. IBM is selling a dashboard. It should be selling the interpretive framework that makes the dashboard meaningful. Those are different products, and right now only one of them is on the market.

References

Hatano, G., and Inagaki, K. (1986). Two courses of expertise. In H. Stevenson, H. Azuma, and K. Hakuta (Eds.), Child development and education in Japan (pp. 262-272). Freeman.

Kellogg, K. C., Valentine, M. A., and Christin, A. (2020). Algorithms at work: The new contested terrain of control. Academy of Management Annals, 14(1), 366-410.

Rahman, H. A. (2021). The invisible cage: Workers' reactivity to opaque algorithmic evaluations. Administrative Science Quarterly, 66(4), 945-988.

Schor, J. B., Attwood-Charles, W., Cansoy, M., Ladegaard, I., and Wengronowitz, R. (2020). Dependence and precarity in the platform economy. Theory and Society, 49(5), 833-861.