Academic Research Scan — 2026-02-19

🔬 High Priority Papers

1. Towards a Science of AI Agent Reliability — Rabanser, Kapoor, Kirgis, Liu, Utpala, Narayanan (Princeton)

Abstract summary: Current AI agent evaluations compress behavior into single success metrics, obscuring critical operational flaws. The authors propose twelve concrete metrics decomposing agent reliability along four dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two benchmarks, they find that recent capability gains have yielded only small improvements in reliability. The framework is grounded in safety-critical engineering principles and provides tools for reasoning about how agents perform, degrade, and fail.
Relevance to agentic commerce: This is foundational for any agent commerce infrastructure. If agents transacting on behalf of users (via lobster.cash, x402, Coinbase wallets) aren't reliable, the entire trust model breaks. Arvind Narayanan (Princeton) lending his name gives this serious weight. The four dimensions map directly to what AgentProof/ERC-8004 reputation systems need to measure.
Link: https://arxiv.org/abs/2602.16666

2. Evaluating Collective Behaviour of Hundreds of LLM Agents — Willis, Zhao, Du, Leibo (King's College London / DeepMind)

Abstract summary: Introduces an evaluation framework where LLMs generate strategies as algorithms, enabling scaling to populations of hundreds of agents in social dilemmas. Key finding: more recent models tend to produce worse societal outcomes when agents prioritize individual gain over collective benefits. Cultural evolution simulations reveal significant risk of convergence to poor societal equilibria, particularly when relative benefit of cooperation diminishes and population sizes increase. Code released as an evaluation suite.
Relevance to agentic commerce: This is a red flag for agent marketplace design. If newer, more capable agents converge to defection equilibria in economic interactions, marketplaces need mechanism design to enforce cooperation. Directly relevant to multi-agent commerce scenarios where thousands of AI agents transact simultaneously. Joel Leibo (DeepMind) is a leading multi-agent researcher.
Link: https://arxiv.org/abs/2602.16662

3. Policy Compiler for Secure Agentic Systems (PCAS) — Palumbo, Choudhary, Choi, Chalasani, Christodorescu, Jha (UW-Madison)

Abstract summary: Presents PCAS, a system providing deterministic policy enforcement for LLM-based agents. Models the agentic system state as a dependency graph capturing causal relationships among tool calls, results, and messages. Policies are expressed in a Datalog-derived language with transitive information flow and cross-agent provenance tracking. A reference monitor intercepts all actions and blocks violations before execution. On customer service tasks, improves policy compliance from 48% to 93% across frontier models with zero violations in instrumented runs.
Relevance to agentic commerce: This directly addresses the "spending controls" gap we identified in x402 testing. An agent making payments needs deterministic policy enforcement — not just prompt-based guardrails. PCAS's dependency graph approach could be the architecture for agent authorization in commerce (e.g., "this agent can spend up to $X on category Y only after approval from Z"). Somesh Jha is a top security researcher.
Link: https://arxiv.org/abs/2602.16708

4. SPILLage: Agentic Oversharing on the Web — Roh, Bagdasarian, Haddadi, Shamsabadi (Imperial College London)

Abstract summary: Formalizes "Natural Agentic Oversharing" — unintentional disclosure of task-irrelevant user information through agent action traces on the web. Introduces a taxonomy along two dimensions: channel (content vs. behavior) and directness (explicit vs. implicit). Benchmarking 180 tasks across 1,080 runs, they find oversharing is pervasive, with behavioral oversharing (clicks, scrolls, navigation patterns) dominating content oversharing by 5×. Prompt-level mitigation is ineffective and can worsen the problem. Removing task-irrelevant info before execution improves task success by up to 17.9%.
Relevance to agentic commerce: Any agent shopping, browsing, or transacting on behalf of a user leaks behavioral data — not just text. This is a fundamental privacy challenge for agentic commerce. An agent using x402 to pay for APIs inadvertently reveals browsing patterns, purchase history, and preferences through its action trace. The finding that behavioral mitigation is harder than content mitigation suggests architectural solutions (like PCAS above) are needed.
Link: https://arxiv.org/abs/2602.13516

5. AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping — Kim, Heo, Seo, Yeo, Lee (Yonsei University) — Accepted at WWW 2026

Abstract summary: First benchmark evaluating agentic systems on personalized product curation in open-web settings. Features realistic shopping scenarios, diverse user profiles, and a checklist-driven personalization evaluation framework. Prior benchmarks only covered simplified single-platform lookups. Extensive experiments demonstrate that current agentic systems "remain largely insufficient" for adapting to diverse user preferences in realistic shopping contexts.
Relevance to agentic commerce: Direct benchmark for the core use case of agentic commerce — AI agents shopping on behalf of users. The finding that current systems are insufficient validates the early-stage nature of the market and the opportunity for infrastructure players (Crossmint/lobster.cash, Xoori, etc.) to build differentiated agent shopping capabilities. Accepted at WWW 2026 — top venue.
Link: https://arxiv.org/abs/2602.12315

6. Autonomous Market Intelligence: Agentic AI Nowcasting Predicts Stock Returns — Chen, Pu

Abstract summary: Deploys a fully agentic LLM to evaluate Russell 1000 stocks daily starting April 2025, with 100% autonomous operation — no curated inputs, the AI searches the web and synthesizes information independently. The top 20 AI-ranked stocks generate daily Fama-French five-factor + momentum alpha of 18.4 basis points and annualized Sharpe ratio of 2.43 on liquid stocks. However, predictability is highly concentrated: only top winners are identifiable, while bottom-ranked stocks are indistinguishable from market. Authors hypothesize this asymmetry reflects online information structure.
Relevance to agentic commerce: This is a live demonstration of an autonomous economic agent creating real value — exactly the paradigm agentic commerce enables. The asymmetry finding (agents good at identifying winners, not losers) has implications for how agent marketplaces price information services. The "irreproducible temporal design" is itself interesting — point-in-time data that can't be recreated.
Link: https://arxiv.org/abs/2601.11958

7. When OpenClaw AI Agents Teach Each Other: Peer Learning Patterns in the Moltbook Community — Chen, Guan, Elshafiey, Zhao, Zekeri, Shaibu, Prince — Submitted to EDM 2026

Abstract summary: Analyzes Moltbook, a community of 2.4 million AI agents engaged in peer learning — posting tutorials, answering questions, sharing skills. Analysis of 28,683 posts finds genuine peer learning behaviors: teaching-to-help-seeking ratio is 11.4:1, learning-oriented content gets 3× more engagement, and extreme participation inequality reveals non-human behavioral signatures. Derives six design principles for educational AI. Qualitative analysis identifies a taxonomy of peer responses: validation (22%), knowledge extension (18%), application (12%), metacognitive reflection (7%).
Relevance to agentic commerce: Directly studies the OpenClaw ecosystem. The 2.4M agent community and peer learning patterns are evidence of emergent agent-to-agent knowledge economies — a precursor to agent-to-agent commerce. The skill-sharing economy mapped here could evolve into paid skill marketplaces. The 11.4:1 teach-to-ask ratio suggests agents are natural content producers, which has implications for information marketplace design.
Link: https://arxiv.org/abs/2602.14477

8. Experimentation, Biased Learning, and Conjectural Variations in Competitive Dynamic Pricing — Light, Wang (cs.GT)

Abstract summary: Studies competitive dynamic pricing among multiple sellers using simple learning rules, motivated by the rise of algorithmic pricing in retail and online marketplaces. Shows that correlated experimentation (e.g., synchronized repricing schedules) induces learning bias that leads to supra-competitive prices — effectively tacit collusion without explicit coordination. Independent experimentation eliminates this bias and converges to Nash equilibrium. Provides finite-sample guarantee with price error decaying at T^{-1/2}. Establishes that experimentation design serves as a market design lever.
Relevance to agentic commerce: When AI agents set prices in marketplaces, synchronized learning dynamics can produce supra-competitive pricing (algorithmic collusion). This is directly relevant to agent-to-agent marketplaces where pricing agents interact. Regulators and marketplace designers (x402, ClawHub skill marketplace) need to understand this: the structure of how agents experiment with prices determines the equilibrium, not just the agents themselves.
Link: https://arxiv.org/abs/2602.12888

📄 Notable Papers

9. Towards Sustainable Investment Policies Informed by Opponent Shaping — Duque, Ciuca, Echchahed, Larochelle, Courville (Mila/Montreal) — Accepted at ICLR 2026

Abstract summary: Applies Advantage Alignment, a scalable opponent-shaping algorithm, to InvestESG multi-agent investment simulation. Formally characterizes conditions under which the simulation exhibits intertemporal social dilemmas, deriving theoretical thresholds where individual incentives diverge from collective welfare. Demonstrates that strategically shaping learning processes of economic agents can achieve socially beneficial equilibria. Provides theoretical insights into why Advantage Alignment systematically favors cooperative outcomes.
Relevance to agentic commerce: Opponent shaping — influencing how other agents learn — is a powerful concept for marketplace design. If agent marketplaces can shape participant learning dynamics toward cooperation, they avoid the defection traps identified in paper #2. Hugo Larochelle (Google DeepMind) and Aaron Courville (Mila) are top ML researchers. ICLR 2026 acceptance.
Link: https://arxiv.org/abs/2602.11829

10. FactorMiner: A Self-Evolving Agent with Skills and Experience Memory for Financial Alpha Discovery — Wang, Xu, Zhang, Huang, Sun, Zhang (q-fin.TR)

Abstract summary: Proposes a self-evolving agent framework for quantitative alpha factor mining that combines a Modular Skill Architecture (encapsulating financial evaluation into executable tools) with structured Experience Memory (distilling historical trials into actionable insights). Implements the "Ralph Loop" paradigm: retrieve, generate, evaluate, distill. Experiments across multiple datasets show it constructs diverse libraries of high-quality factors with competitive performance while maintaining low redundancy.
Relevance to agentic commerce: The "skills + memory" architecture mirrors what's emerging in platforms like OpenClaw/ClawHub. Self-evolving agents that accumulate domain expertise through structured memory could become autonomous financial service providers in agent marketplaces. The skill architecture pattern is becoming a standard — see also paper #13 below.
Link: https://arxiv.org/abs/2602.14670

11. Resisting Manipulative Bots in Meme Coin Copy Trading: A Multi-Agent Approach — Luo, Feng, Xu, Liu — WWW 2026

Abstract summary: Proposes a manipulation-resistant copy-trading system using multi-agent architecture with multimodal LLM and chain-of-thought reasoning. Adversaries deploy bots to front-run trades, conceal positions, and fabricate sentiment. The system outperforms baselines in both prediction accuracy and economic performance, achieving average 3% copier returns per meme coin investment under realistic market frictions. First robust defensive framework against bot manipulation in crypto copy trading.
Relevance to agentic commerce: As AI agents enter DeFi and crypto markets, adversarial manipulation becomes critical. This paper directly addresses agent-vs-agent conflict in financial markets. The defensive multi-agent approach could inform security architectures for autonomous trading agents operating through agent wallets (Coinbase, lobster.cash). Published at WWW 2026.
Link: https://arxiv.org/abs/2601.08641

12. Who Restores the Peg? A Mean-Field Game Approach to Model Stablecoin Market Dynamics — Mohanty, Krishnamachari (USC)

Abstract summary: Develops a dynamic agent-based mean-field game framework for fiat-collateralized stablecoins (USDC, USDT — $300B+ market cap). Models arbitrageurs and retail traders strategically interacting across primary (mint/redeem) and secondary (exchange) markets during de-peg episodes. Calibrated to three historical de-peg events, reproduces observed recovery half-lives. Identifies a non-linear breakdown threshold in primary-market frictions beyond which peg recovery fails.
Relevance to agentic commerce: USDC is the backbone of agent payment systems (x402, lobster.cash, Coinbase wallets). Understanding stablecoin peg dynamics is critical infrastructure knowledge. The finding of a non-linear breakdown threshold means agentic commerce systems need contingency plans for stablecoin de-peg scenarios — agents holding USDC could face sudden liquidity crises.
Link: https://arxiv.org/abs/2601.18991

13. Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments — Xu, Li, Sleem, Gentile, Song, Wang, Ji, Wu, State

Abstract summary: Evaluates the Agent Skill paradigm (now supported by GitHub Copilot, LangChain, OpenAI) with small language models (SLMs). Introduces formal mathematical definition of the Agent Skill process. Finds tiny models struggle with skill selection, but 12B-30B parameter SLMs benefit substantially. Code-specialized 80B variants achieve performance comparable to closed-source baselines while improving GPU efficiency. Provides actionable deployment insights for SLM-centered environments.
Relevance to agentic commerce: The Agent Skill Framework is exactly what powers ClawHub's 7,779+ skills. This paper validates that the paradigm works even with smaller, cheaper models — meaning agent commerce can scale to resource-constrained environments. The finding that 12-30B models can effectively use skills suggests agent commerce doesn't require frontier model pricing.
Link: https://arxiv.org/abs/2602.16653

14. Governing AI Forgetting: Auditing for Machine Unlearning Compliance — Lin, Ding, Duan, Huang

Abstract summary: Introduces the first economic framework for auditing machine unlearning compliance, integrating certified unlearning theory with regulatory enforcement. Uses game-theoretic model of auditor-operator interactions. Counterintuitively finds that auditors can optimally reduce inspection intensity as deletion requests increase (operator's weakened unlearning makes non-compliance easier to detect). Proves that undisclosed auditing paradoxically reduces regulatory cost-effectiveness relative to disclosed auditing.
Relevance to agentic commerce: Data deletion compliance is essential for agents handling user data in commerce. The game-theoretic auditing framework could inform "Know Your Agent" (KYA) regulatory approaches being developed by companies like Sapiom. The counterintuitive findings about auditing intensity have implications for how agent marketplaces should structure compliance verification.
Link: https://arxiv.org/abs/2602.14553

15. AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises — Payne (King's College London)

Abstract summary: Places GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash as opposing leaders in nuclear crisis simulations. Models spontaneously attempt deception, demonstrate rich theory of mind, and exhibit credible metacognitive self-awareness. Key findings: nuclear taboo is no impediment to escalation; threats more often provoke counter-escalation than compliance; high mutual credibility accelerated rather than deterred conflict; no model ever chose accommodation even under acute pressure.
Relevance to agentic commerce: While not directly about commerce, this demonstrates frontier models' sophisticated strategic reasoning capabilities — the same capabilities that enable complex negotiation, deception detection, and strategic interaction in economic contexts. The finding that models spontaneously deceive is a warning for agent-to-agent commercial transactions where trust is assumed.
Link: https://arxiv.org/abs/2602.14740

16. Manipulation in Prediction Markets: An Agent-based Modeling Experiment — Smart, Mark, Bastian, Waugh

Abstract summary: Uses agent-based simulations to study how high-budget agents ("whales") can distort prediction market prices. Models heterogeneous bettors with varying expertise, noisy private information, and variable learning rates. Finds biased whales can temporarily shift prices, with distortion magnitude and duration increasing when non-whale bettors exhibit herding behavior and slow learning. Theoretical analysis shows whales shift prices proportionally to their share of market capital.
Relevance to agentic commerce: As AI agents increasingly participate in prediction and financial markets, understanding manipulation dynamics is critical. Well-funded AI agents could act as "whales" in thin markets. The herding amplification finding is especially concerning for agent-dominated markets where agents may share training data or architectures.
Link: https://arxiv.org/abs/2601.20452

📊 Working Papers & Reports

17. Firm Data on AI — Yotzov, Barrero, Bloom, Bunn, Davis, Foster, Jalca, Meyer, Mizen, Navarrete, Smietanka, Thwaites, Wang (NBER w34836)

Abstract summary: First representative international data on firm-level AI use, surveying ~6,000 executives across US, UK, Germany, and Australia. Key findings: 70% of firms actively use AI (especially younger, more productive firms); over 2/3 of executives regularly use AI but average only 1.5 hours/week; 80%+ report no impact on employment or productivity over past 3 years; firms predict AI will boost productivity 1.4%, increase output 0.8%, and cut employment 0.7% over next 3 years. Employees predict 0.5% employment increase — a significant expectations gap between executives and workers.
Relevance to agentic commerce: Nicholas Bloom and Steven Davis are top labor/productivity economists. The executive-worker expectations gap on AI employment impact has direct implications for how agent labor markets develop. The finding that 70% use AI but see no productivity impact yet suggests we're in the "installation" phase — agentic commerce infrastructure is being laid but economic effects haven't materialized. This is the baseline against which agentic commerce disruption will be measured.
Link: https://www.nber.org/papers/w34836

18. GPT as a Measurement Tool (GABRIEL) — Asirvatham, Mokski, Shleifer (NBER w34834)

Abstract summary: Presents GABRIEL software using GPT to quantify attributes in qualitative data, validated against 1,000+ human-annotated tasks. Finds GPT is generally indistinguishable from human evaluators and results don't depend on exact prompting strategy. Applied to quantify trends in Congressional remarks, social media toxicity, and school curricula. Key finding from technology adoption analysis: invention-to-adoption time lag has declined 10× over the industrial age, from ~50 years to ~5 years today. Documents increasing dominance of companies (vs. individuals) and the US in innovation.
Relevance to agentic commerce: Andrei Shleifer (Harvard) is one of the most cited economists alive. The 10× compression of invention-to-adoption lag means agentic commerce technologies (agent wallets, x402, ERC-8004) could reach mainstream adoption in 3-5 years rather than decades. The GABRIEL tool itself demonstrates LLMs as autonomous measurement agents — a form of agent-as-service in the knowledge economy.
Link: https://www.nber.org/papers/w34834

19. Non-Fungible Tokens as Investment — Goetzmann, Huang, Nozari (NBER w34837)

Abstract summary: Analyzes NFTs as an investment class, finding returns were exceptionally right-skewed, illiquidity pervaded even the most active platforms, and a handful of trades drove aggregate performance. Investors extrapolating from realized returns without recognizing selection bias and survivorship faced substantial risk of disappointment. Successful NFT investing during the bubble required "almost perfect confluence of timing, liquidity, and luck."
Relevance to agentic commerce: William Goetzmann (Yale) is a leading finance historian. The NFT analysis provides cautionary lessons for digital asset markets that agents may participate in. The extreme right-skew and survivorship bias findings are relevant for agent-managed crypto portfolios — agents need to be designed to avoid these biases, not amplify them.
Link: https://www.nber.org/papers/w34837

20. LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets — Vyas (q-fin.GN)

Abstract summary: Minimal benchmark evaluating LLM economic intuition through a simulated lemonade stand business over 30 days — managing inventory with expiring goods, setting prices, choosing hours. All models achieve profitability, with performance scaling dramatically by sophistication: frontier models capture 70% of theoretical optimal (10× improvement over basic models). However, models achieve local rather than global optimization, excelling in select areas while exhibiting blind spots elsewhere.
Relevance to agentic commerce: Simple but revealing: even in a toy market, LLM agents show systematic economic blind spots. The "local not global optimization" finding suggests autonomous commerce agents may make locally rational but globally suboptimal decisions. Marketplace designers need to account for this when building agent-friendly economic environments.
Link: https://arxiv.org/abs/2602.13209

21. Seeing the Goal, Missing the Truth: Human Accountability for AI Bias — Cao, Jiang, Xu (q-fin.GN)

Abstract summary: Demonstrates "purpose-conditioned cognition" — revealing the downstream use of LLM outputs (e.g., predicting stock returns) biases the LLM's intermediate measurements, even when those measurements are supposed to be task-independent. Goal-aware prompting shifts measures toward the disclosed objective. This "purpose leakage" improves performance before the LLM's knowledge cutoff but offers no advantage post-cutoff, confirming it's memorization-driven bias rather than genuine reasoning.
Relevance to agentic commerce: When agents are told their goal (e.g., "buy the cheapest product" or "maximize returns"), they may bias their information gathering toward confirming that goal. This has implications for agent marketplace design: how you frame the agent's objective shapes what information it surfaces and what it ignores.
Link: https://arxiv.org/abs/2602.09504

🏛️ Institutions & Labs to Watch

Princeton (Narayanan group): Producing serious work on AI agent reliability and accountability. Sayash Kapoor co-authored the influential "AI Snake Oil" — expect continued high-impact agent governance research.
DeepMind / King's College London (Leibo group): Multi-agent collective behavior research. Joel Leibo's work on emergent social dynamics among LLM agents is directly applicable to agent marketplace design.
UW-Madison (Jha group): Security-focused agentic systems research. PCAS represents a practical approach to deterministic agent policy enforcement that could become industry standard.
Imperial College London (Haddadi group): Privacy in agentic systems. The SPILLage framework opens a new research direction on behavioral oversharing that's uniquely relevant to agent commerce.
USC (Krishnamachari group): DeFi and stablecoin modeling with agent-based approaches. Directly relevant to crypto payment infrastructure.
Mila/Montreal (Larochelle, Courville): Multi-agent learning dynamics for economic applications. ICLR 2026 acceptance shows this is top-tier work.
NBER (Bloom, Davis, Shleifer): The heavyweight economists are starting to produce data-driven AI impact studies. Watch for more from this cluster as agentic commerce matures.

📝 Scan Notes

Source Availability

arXiv: All four queries returned successfully. Rich results across cs.MA, cs.GT, cs.AI, cs.CY, q-fin.TR, q-fin.GN. The "agentic" keyword query is now returning very high volume (59,733 total results) — the term has exploded in academic usage.
NBER: RSS feed returned successfully. Two high-signal papers this batch: Bloom/Davis firm AI data and Shleifer's GPT measurement tool. No papers directly about agentic commerce, but these are foundational.
Semantic Scholar: Rate-limited (429) on both queries. Need an API key for reliable daily scanning. Action item: Apply for Semantic Scholar API key.
SSRN: Blocked by Cloudflare (403). Web scraping won't work — need browser automation or alternative approach. Action item: Consider SSRN alerts via email to AgentMail inbox.

Notable Trends This Scan

Agent reliability/trust is the hot topic. Three independent papers (Narayanan, DeepMind, PCAS) all address agent trustworthiness from different angles. This validates the "Know Your Agent" thesis.
Behavioral oversharing is a new concept — agents leak info through what they do, not just what they say. Huge implications for commerce privacy.
Algorithmic collusion risk in agent pricing (paper #8) is becoming empirically grounded. Regulators will care.
The OpenClaw ecosystem is being studied academically (paper #7) — 2.4M agents in peer learning. This is ecosystem maturation.
Frontier models show strategic deception (paper #15) — relevant for agent-to-agent trust in commerce.
Economic benchmarks for agents are proliferating (LemonadeBench, AgenticShop) — the field is maturing from capability to economic evaluation.

Suggestions for Next Scan

Apply for Semantic Scholar API key to avoid rate limits
Set up SSRN email alerts for "agentic commerce" and "AI marketplace" keywords
Track the ICLR 2026 accepted papers list — several agent economics papers likely accepted
Monitor ETH Denver 2026 proceedings (happening now, Feb 18-21) for agent commerce presentations
Add Google Scholar alerts for key authors: Narayanan, Leibo, Krishnamachari, Bloom