Iggy Pop

“Fine-tuning is dead. Long live memory.”

Iggy Pop — Wed, 25 Mar 2026 20:16:25 GMT

article based on https://arxiv.org/pdf/2603.18743

That’s the punchline.

This paper argues something blunt: we don’t need to keep retraining models to get smarter—we need agents that learn outside the model.

And not in a hand-wavy way. In a system that actually improves itself over time.

Let’s break it down.

The problem nobody wants to admit

You’ve seen this before:

You deploy an LLM agent

It works… okay

You throw more data, more GPUs, more prompts at it

It barely improves

Sound familiar?

The paper calls this out directly:

Most deployed LLMs are frozen—they don’t learn from experience at all

And that’s the real issue.

You’re running something that can’t get better from doing the job.

The core idea (and why it matters)

Here’s the shift:

Stop training the model. Start evolving the agent.

Memento-Skills introduces a system where:

The LLM stays fixed

All learning happens in an external “skill memory”

The agent improves by rewriting its own tools and prompts over time

“All adaptation is realised through the evolution of externalised skills and prompts.”

Net net:

👉 The intelligence moves from weights → to memory

Think of it like this

LLMs are:

Brilliant assistants

Terrible long-term learners

This system turns them into:

Operators with memory

That build better operators

Or more bluntly:

The agent becomes a system that designs better versions of itself

How it actually works (no fluff)

The whole system runs on one loop:

Read → Act → Write

1. Read

Look into memory

Pick the most relevant “skill” (code + prompt + logic)

2. Act

Use the LLM to execute that skill

3. Write

Evaluate what happened

Update or create new skills

Repeat forever.

The paper calls this Read–Write Reflective Learning

Why this is different

Most “AI agents” today:

Use static prompts

Maybe retrieve docs

Don’t actually improve their behavior

This one:

Stores executable skills

Edits them after failures

Builds new ones when needed

That’s a big leap.

The uncomfortable truth

Here it is:

Semantic similarity is useless for real work.

The paper shows:

Traditional retrieval picks “similar-looking” solutions

But those often fail in execution

Example:

A refund request matched a password reset skill with 0.91 similarity

That’s exactly the problem you’ve seen in production.

So they fix it by:

👉 Training the router to pick skills based on execution success, not text similarity

The system is basically doing this

Every failure triggers:

Root cause analysis

Skill rewrite

Optional skill replacement

Unit testing before saving

It’s not just memory.

It’s self-debugging memory.

The results (this is the part people care about)

The gains are not subtle:

+26% to +116% improvement depending on benchmark

Skill library grows from:

5 → 41 → 235 skills

Performance steadily improves across iterations

And importantly:

👉 No model retraining

Why this actually works

The paper explains it cleanly:

As the agent learns:

Skills get better

Coverage increases

Retrieval improves

Errors shrink

Over time, the system converges.

Or in plain English:

The agent builds a dense map of “how to solve things” and stops guessing.

Where this breaks (and why it matters)

This isn’t magic.

Two key constraints show up:

1. Domain alignment matters

Skills transfer well only when tasks are similar

Random tasks = weak reuse

2. You still need structure

The system works best when problems cluster

Chaos in → chaos out

What this means for you

Let’s translate this into reality.

Stop doing this

Endless prompt tweaking

Fine-tuning for every edge case

Static agent workflows

Start doing this

Build systems that:

Store solutions

Evaluate outcomes

Improve tools automatically

The bigger shift (this is the real takeaway)

We’re moving from:

“Model-centric AI”

Train better weights

→

“System-centric AI”

Build systems that learn while running

Mic-drop

The smartest AI systems won’t be the ones with the best models.

They’ll be the ones that remember, adapt, and rewrite themselves fastest.

TL;DR

LLMs don’t learn after deployment

This system fixes that using external skill memory

Agents improve by rewriting their own tools

No retraining required

Big performance gains

Future = self-evolving agents, not bigger models

Why “Nested Learning” Might Be the Missing Piece for Lifelong AI and How It Aligns With Agent Memory

Iggy Pop — Tue, 25 Nov 2025 17:14:44 GMT

A simple walkthrough of how models learn inside layers of learning — and how this connects to new breakthroughs in persistent memory systems.

The Forgetful Genius Problem

You’ve probably seen it happen: you explain something to an AI, it answers perfectly… and five minutes later it behaves like the conversation never happened. I’ve talked about this in my previous post. Feel free to check out more in depth.

This isn’t a bug — it’s a fundamental limitation of today’s large language models.

Models only have two memory buckets:

Short-term: whatever fits in the prompt
Long-term: whatever was trained into the weights months ago

Nothing in between.

So when we ask AI agents to do complex, multi-step, long-horizon work, this gap shows up everywhere:

An agent forgets rules mid-task
A tutoring AI loses track of your progress
A workflow assistant repeats mistakes
A reasoning agent contradicts its earlier conclusions

Nested Learning tries to solve this missing middle.

The Big Idea: Models Don’t Just Learn — They Learn How To Learn

Nested Learning reframes neural networks as nested memory systems, not just giant stacks of matrix multiplications.

Inside any modern model are actually multiple learning processes running at different speeds:

A fast process that updates every token
A slower one that tracks sequences
Slower processes that shape representations over many samples
And the slowest processes that govern how all the above operate

Think of it as learning loops inside learning loops.

This allows a model to:

absorb short-term context
consolidate medium-term structure
accumulate long-term patterns
adjust its own internal update rules

This is basically giving models a built-in hierarchy of memories.

What Nested Learning Looks Like in Practice

Here’s how it works in simple terms.

1. Different parts of the model update at different rates

Some components adjust every step (similar to working memory).

Others update slowly (similar to long-term memory).

2. Each “learning level” compresses a different type of context

token-level context
gradient flows
sequence structure
surprise/error signals

Each level stores something different.

3. These levels interact

Fast learners feed slower ones.

Slower learners regulate the fast ones.

This multi-timescale design mirrors how humans learn and remember.

Even Optimizers Are Part of The Story

Nested Learning points out something unintuitive:

Your optimizer is part of the memory system.

Momentum, Adam, RMSprop — they all store:

gradient histories
variance estimates
running statistics

They’re learning modules inside the larger learner.

This means the distinction between “architecture” and “memory system” is blurrier than we thought.

Dynamic Nested Hierarchies: The Next Jump

The second paper you uploaded — Dynamic Nested Hierarchies — pushes this further.

Instead of fixing the hierarchy…

The model can add, remove, and reshape learning layers while it runs.

It becomes self-organizing:

growing new learning layers for complex tasks
pruning ones that aren’t useful
adjusting update speeds on the fly
reshaping its internal reasoning pathways

This unlocks:

lifelong learning
adaptability
stability during long tasks
better transfer across domains

Most importantly:

the model doesn’t catastrophically forget when new tasks appear.

How This Connects to Persistent Memory Systems

You uploaded several papers on external, long-term memory systems for agents:

Mem0
Multiple Memory Systems
SEDM (Self-Evolving Distributed Memory)
LCNC Contextual Consistency + Intelligent Decay

These systems sit outside the model and handle persistent knowledge across sessions, days, or tasks.

Nested Learning sits inside the model and handles multi-timescale internal learning.

They solve different problems — but together, they create something powerful.

**Nested Learning = Internal Memory

Persistent Memory = External Memory**

Here’s the clean breakdown.

Nested Learning handles

how the model updates internally
how representations evolve
how short-term becomes long-term
how internal memory is structured

Persistent Memory systems handle

episodic storage
semantic abstraction
retrieval
pruning / consolidation
cross-domain transfer
continuity over long-running agent workflows

Across your files:

• The

Multiple Memory Systems

paper

creates episodic + semantic external stores.

•

Mem0

adds CRUD, structured schemas, and production-ready agent memory.

•

SEDM

adds verifiable write admission, A/B replay, consolidation, utility scoring, and diffusion.

(diagrams on pages 1–6)

• The LCNC contextual consistency paper

adds intelligent decay, recency/relevance scoring, and user-governed utility.

(pages 3–7)

These are all external systems.

Nested Learning is internal.

And the two are highly complementary.

Where They Reinforce Each Other

**1. Persistent memory provides “clean experience.”

Nested Learning internalizes it.**

Persistent memory systems filter experiences first, so the model only internalizes:

verified reasoning
correct patterns
distilled summaries
reusable insights

This prevents internal memory pollution.

2. Nested Learning reduces the load on external memory.

Because NL introduces multiple internal timescales, the model:

holds context better
needs fewer giant prompts
avoids information drift
keeps medium-term state without external retrieval

This aligns with the problems identified in:

LCNC “memory inflation” and “contextual degradation” (pages 1–4)
the Multiple Memory Systems paper

3. They form a self-improving loop.

Together:

External memory → Nested internal consolidation → Better reasoning → Better memory → Repeat

This is essentially the architecture of a true self-evolving AI agent.

Why This Matters

If Nested Learning matures — and if persistent memory systems keep improving — we end up with:

agents that don’t forget
models that adapt during use
workflows with continuous improvement
stable reasoning over long horizons
safe, auditable growth of knowledge

Instead of bigger models, we get better learners.

References

Nested Learning & Dynamic Nested Hierarchies

Nested Learning: The Illusion of Deep Learning Architectures
Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

Persistent Memory & Agent Memory Systems

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Multiple Memory Systems for Enhancing the Long-Term Memory of Agents
SEDM: Scalable Self-Evolving Distributed Memory for Agents
Memory Management and Contextual Consistency for Long-Running Low-Code Agents

Your AI Agent Has a Memory Problem. Here’s the Fix.

Iggy Pop — Thu, 20 Nov 2025 23:17:39 GMT

I’ve seen it happen more times than I can count. You spend weeks building a sophisticated AI agent system. It works beautifully in short tests. But when you set it loose on a long-running, complex task, you watch as it starts to slowly go braindead. It forgets critical instructions from hours ago. It repeats the same errors, unable to learn from its mistakes. The consistent, intelligent system you built degrades into a stateless, incoherent mess, accumulating errors until it’s worse than useless. It’s one of the most frustrating, and common, problems in our field.

The solution isn’t to just keep cramming more history into an ever-larger context window. That’s a trap. The solution is to stop building agents with amnesia and start engineering a proper memory system. To put it simply: LLMs need a hippocampus, not a bigger hard drive.

The Uncomfortable Truth About Giant Context Windows

The Problem: More Context, More Problems

The industry’s obsession with massive context windows—from 200K to 10M tokens—is a red herring. It promises a simple solution but, as researchers have noted, it leads directly to “memory inflation” and “contextual degradation,” a brute-force tactic with severe performance penalties.

Hard data from production-style benchmarks confirms the cost: a 2025 study on agent memory found that simply providing the full conversation history to an agent results in:

91% higher p95 latency
Over 90% higher token costs

Think of this as giving your brilliant assistant the entire library to find one sticky note. It’s inefficient, expensive, and buries the signal in an ocean of noise.

The Mic-Drop

Bigger context windows don’t solve the memory problem; they just make it more expensive.

From Dumb Log File to Active Cognitive System

The Old Way: A Junk Drawer of Memories

Most of the common approaches to agent memory are fundamentally flawed. They treat memory as a passive log file—a junk drawer filled with every thought and observation, regardless of value.

Sliding Windows: This is a brute-force approach that inevitably loses critical, long-term context by simply chopping off the oldest information.
Simple RAG: Your RAG pipeline is probably just grabbing noisy, irrelevant chunks of the agent’s own past, hoping to find something useful. It retrieves raw, conversational turns, not the extracted, salient facts that actually drive correct reasoning.
Summarization: While better, this method carries the constant risk of “abstraction hazard,” where the process of condensing information loses the key details the agent actually needs.

The New Way: An Engineered Memory Pipeline

The paradigm shift is to treat memory not as a log, but as an active, managed cognitive system. This requires an engineered pipeline built on a few core principles.

Selective Ingestion: Agents must dynamically extract and store only the most salient information from conversations. Instead of saving the entire raw turn, the system should identify and persist core facts, preferences, and constraints.
Intelligent Forgetting: Your agent needs to forget. On purpose. Memories should be proactively pruned based on a utility score calculated from their recency, relevance, and user-provided importance—a concept called “Intelligent Decay.” Low-utility memories are discarded or consolidated, keeping the memory store lean and relevant.
Structured Representation: Raw text is not enough. To be truly useful for an agent’s reasoning process, memory needs structure.

The Practical Move

So, what does this actually look like? Here are two patterns you can steal today.

Pattern #1: The State Tracker (FSA Memory) for Workflows

The Problem It Solves

You’re building an agent to control a stateful system—a scientific instrument, a software deployment pipeline, a multi-step booking process. Your agent constantly needs to know the state of the world to make its next move. Is the lid open? Has the session been allocated? Has the user’s payment been processed? Relying on conversational history to infer this state is fragile and unreliable.

The Insight and Proof

The solution is a pseudo-Finite State Automaton (FSA) memory. It’s just a simple JSON object that tracks key-value pairs: lid_status: ‘closed’. That’s it. It’s brutally effective.

This isn’t just a theory. In a benchmark where agents controlled a virtual microwave synthesizer, the performance difference was staggering:

Agent with FSA Memory: 90% success rate
Agent with Summary Memory: 50% success rate

Furthermore, the FSA memory buffers were significantly smaller (a mean size of 197 characters vs. 756 for summary logs), saving precious token space and improving the signal-to-noise ratio in the prompt.

Pattern #2: The Bouncer (Verifiable Memory)

The Problem It Solves

Even with smart filtering and forgetting, bad or low-value memories can still pollute your system. A noisy observation or a flawed conclusion can get stored, leading to error propagation down the line. How do you know a new ‘memory’ is actually helpful before you save it?

The Insight: Treat Your Memory Like a VIP Club

The answer is “verifiable write admission.” Treat your memory like a VIP club with a bouncer at the door.

Before a new candidate memory is permanently stored, the system uses an A/B replay mechanism to empirically prove its value. The agent’s last action is replayed in a sandbox environment twice: once with the candidate memory included in the prompt, and once without it. The system calculates a composite utility score, balancing the change in reward against any increase in latency and token cost. If the memory improves performance, it’s admitted to the club. If it hurts performance or adds too much cost, it’s rejected at the door. This provides empirical proof of a memory’s utility before it ever has a chance to degrade the system.

The Practical Move

This is implemented using a “Self-Contained Execution Context” (SCEC), which packages a task run with all its dependencies so it can be replayed instantly without the original environment. The goal is to transform memory from a “passive repository” into an “active, self-optimizing component.”

Your Next Move

A 10-Minute Audit

Take a few minutes to audit your current agent’s memory system. Ask yourself these questions:

Audit your memory buffer. Look at what’s actually being passed into your prompt. Is it filled with conversational fluff, redundant observations, and greetings? Or is it packed with hard, structured facts?
Implement a simple filter. As a first step, stop storing the entire conversational turn. Write a simple function that uses an LLM call to extract key facts, entities, and user instructions from the last exchange and store only those.
For workflow agents, build a state tracker. If your agent controls a system, define a simple JSON or Pydantic schema for that system’s state. After each tool use, write a function that updates the state object. Pass this object into the prompt on every turn.

The Final Nudge

Engineered memory is the dividing line between brittle prototypes and reliable, production-ready AI agents. Moving from passive logging to active cognitive management is the single most important step you can take to improve your agent’s performance, consistency, and efficiency. This shift transforms agents from simple command-response tools into adaptive partners capable of sustained, complex reasoning, opening the door for true long-term autonomy in scientific and enterprise workflows.

The most reliable agents aren’t the ones that remember everything; they’re the ones that know what’s worth remembering.

Chunking strategy now driving enterprise RAG deployments beyond pilot stage

Iggy Pop — Thu, 13 Nov 2025 15:18:05 GMT

Thesis
Firms are spinning up Retrieval‑Augmented Generation (RAG) systems in production — and discovering that how they chunk their data often makes more difference than model size.

What happened

Weaviate published a detailed blog on chunking strategies for RAG production systems, spotlighting “late chunking” and query‑time chunking as high‑impact tactics. Weaviate
NVIDIA reported in June 2025 that page‑level chunking outperformed fixed‑token‑size and section‑level variants across diverse datasets — suggesting enterprise doc‑repos should re‑think chunk size. NVIDIA Developer
The Microsoft Corporation Azure Architecture Center published a guide this year contrasting chunk‑size trade‑offs and cost/throughput implications in RAG ingestion. Microsoft Learn
Academic research on “Question‑Based Retrieval using Atomic Units for Enterprise RAG” shows an approach where chunks are decomposed into “atomic statements” for higher recall and better downstream generation accuracy. arXiv

Why this matters for operators

You’ve queued the LLM model selection — but if your data is poorly chunked you’ll see retrieval failures, hallucinations or poor user uptake.
Chunking affects cost, latency and scale: smaller chunks mean more vectors, more compute; larger ones mean less precision. Choosing wrong sabotages ROI.
Because many firms now deploy RAG in production (not just pilot) the “data prep” phase (chunking, embedding, indexing) is moving into core ops. You need visibility and KPIs here.

What to watch next

Reported numbers from enterprises around chunk‑size vs retrieval hit rate vs user satisfaction in live RAG systems (i.e., doc count > 100k, live feedback loop).
Vendor features aimed at automating chunking (semantic chunking, hierarchical chunks, late‑chunking pipelines) being added into vector‑DB or RAG‑orchestration stacks.
Standards or frameworks emerging for RAG ops around chunking strategy, chunk‑metadata, chunk‑tracking and lifecycle management (audit, versioning).

One useful thing
How‑to: Evaluate your chunking strategy in your RAG project

From your document corpus pick a representative subset (5‑10 % of total docs).
Create two or three chunking variants of the same docs: e.g., fixed‑512‑token, page‑level, semantic‑chunking (via heading/paragraph boundaries).
Embed all variants into your vector store (keeping doc‑metadata consistent) and run a standard query set (real‑user queries) against each variant.
Measure: retrieval hit rate (does correct chunk appear in top 5), generation accuracy (manual or via small evaluation set), latency and vector‑index cost.
Select the chunking strategy that maximizes hit‑rate and accuracy within acceptable latency/cost. Then apply this at full scale.
Monitor in production: track metrics like “chunk recall” (was correct chunk retrieved?), “generation revision rate” (percentage of answers needing human correction) and vector‑count growth vs budget.

Final Thought

Themeatically I am starting to see a shift from model improvement, shifting to building agentic AI and now going over to focusing on using best strategies for RAG architecture.

What I am noticing is that we are looking to optimize the full concept of LLMs in a production space. My guess is that the next steps will be further optimizing and improving on the abilities for agents to keep costs down from tool usage.

Source links

Note: While many articles reference enterprise use‑cases in broad terms, specific customer names and measurable outcomes remain sparse — the chunking angle is gaining traction but full case‑studies with hard metrics are still emerging.

If there is a particular topic you would like me to do a deep dive into. Let me know in the comments.

Beyond Pipelines: Why the Next Generation of AI Will Think for Itself

Iggy Pop — Thu, 06 Nov 2025 17:33:50 GMT

The Big Picture

For years, AI systems have been built like assembly lines.

You’d have a language model here, a memory module there, a tool-use connector somewhere in the middle—each wired together by scripts and prompts. That’s what researchers call the pipeline-based paradigm: the model was one part of a bigger machine.

But 2025 is marking a turning point.

A new way of building AI is emerging, where models aren’t just used inside those systems—they are the system.

This new phase is called model-native agentic AI.

Instead of being told what to do step by step, the model itself learns how to plan, use tools, and remember—internally. The shift is as big as moving from early websites built by hand-coded HTML to modern web apps that run themselves.

From Reacting to Reasoning

Traditional “generative” AI—ChatGPT, Gemini, Claude—responds to what you ask.

Agentic AI goes a step further: it sets goals, figures out how to reach them, and adapts as it learns.

Three core abilities define it:

Planning – breaking big goals into smaller, logical steps.
Tool use – calling APIs, searching, or running code when needed.
Memory – remembering past context to stay consistent across time.

In the old pipeline setup, each of these was handled by an external layer. The system told the model when to recall something, when to call a tool, or how to plan. The model itself wasn’t “aware” of those actions—it was just a text generator following cues.

The new model-native approach changes that: these behaviors are becoming part of the model’s own brain. The AI learns, through reinforcement and feedback, to manage these things on its own.

The Reinforcement Revolution

At the core of this shift is reinforcement learning (RL)—a technique that teaches models by rewarding good outcomes instead of just copying existing data.

Think of the difference this way:

Supervised fine-tuning (SFT) tells a model: “Here’s how a good answer looks. Copy that.”
Reinforcement learning (RL) tells a model: “Try something. If it works, do more of that.”

RL turns a passive imitator into an active explorer.

Instead of mimicking humans, the model learns what works through trial, reward, and correction. That’s how OpenAI’s o1 and o3, DeepSeek’s R1, and Moonshot’s K2 have trained reasoning behaviors that feel more strategic and self-directed.

RL lets the model discover its own tactics for reasoning, planning, and decision-making—without handcrafted step-by-step data.

Two Kinds of Agents Emerging

This paradigm shift is already visible in two broad categories of agents:

1. Deep Research Agents

These are the “brains.”

They read, reason, compare sources, and write like analysts.

Google’s Deep Research and OpenAI’s o3-based research models represent this type—capable of running multi-step analyses, sourcing evidence, and producing coherent reports without a rigid script.

They’re the AI version of a curious researcher who doesn’t just summarize—he investigates.

2. GUI Agents

These are the “hands.”

They interact with screens, buttons, and interfaces like a digital assistant that can actually click and type.

Early versions, such as AppAgent or Mobile-Agent, relied on external logic: the system fed screenshots and the model described what to do.

Now, newer ones like GUI-Owl and OpenCUA are trained end-to-end. They learn directly from experience how to operate apps—no middleman planner required.

Why This Matters

Moving from pipeline to model-native AI means fewer brittle rules and more adaptable intelligence.

Less fragility: No more breaking when a webpage layout changes.
More autonomy: The model figures out when to search, when to reason, and when to recall memory.
Better scalability: Instead of building hundreds of task-specific agents, one model can learn behaviors transferable across tasks.

This also explains why we’re seeing benchmarks like GAIA, SWE-Bench, and BrowseComp—all designed to test how well these agentic models think and act across domains.

A Useful Analogy

In the paper, the authors compare this evolution to physics before and after Newton.

Before Newton, we had separate rules for planets, motion, and fluids.

Then one unified theory brought them together.

AI is going through the same transformation.

We’re moving from scattered, specialized systems to a single framework where LLM + RL + Task defines everything—from reasoning to action. The language model becomes both the thinker and the doer.

What’s Next

The next frontier is the internalization of even higher-order capabilities—like reflection (self-evaluation) and multi-agent collaboration (models working together).

We’re heading toward systems that don’t just act intelligently but grow intelligence through experience.

The implication is profound:

we’re not programming intelligence anymore.

We’re training it—letting it learn, adapt, and evolve.

Takeaway

The future of AI isn’t about wiring models together.

It’s about teaching them to self-wire—to integrate planning, memory, and tool use as part of their nature.

Pipeline-based AI applied intelligence.

Model-native AI grows it.

That’s the difference between a model that answers your question and one that figures out the next question to ask.

Sources:

Based on Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI (Jitao Sang et al., Beijing Jiaotong University, 2025) .

What’s new in 2025

Iggy Pop — Thu, 06 Nov 2025 17:26:21 GMT

Models are getting sharper

New flagship models are being released with multitasking, multimodal, reasoning and tool‑use baked in. For example:

According to the 2025 Stanford Human‑Centered Artificial Intelligence (HAI) Index, generative‑AI drew nearly $34 billion in private investment globally, an 18.7% rise from 2023.
The GPT‑5 model (released August 2025) reportedly blends high‑throughput generation, deeper reasoning and autonomous tool‑use.
The open‑source / smaller model scene is growing too: for example Mistral AI’s “Medium 3” model claims high performance for less cost.

Agents are moving into production

More enterprises are not just experimenting with generative models—they’re rolling out agentic AI systems. Systems that embed reasoning, planning, tool invocation, memory and workflow integration.

A 2025 survey by McKinsey & Company found 88% of organizations surveyed report regular AI use, but only ~33% have truly scaled their AI programs.
In agent‑specific data: about 23% of respondents say their organizations are scaling an AI agent‑based system; another ~39% are experimenting with them.
According to another study, 52% of enterprises using generative‑AI say they’ve deployed AI agents in production.

Research, risk and the new frontier

Agentic systems bring new opportunities—and new questions. We’re seeing focused work on what makes agents different (vs models) and how to govern them.

A paper titled “Securing Agentic AI” identifies 9 categories of threats specific to generative‑AI agents: autonomy, memory persistence, tool integration, goal misalignment, etc.
Another survey maps the shift from “pipeline architectures” (model + external planner + tool manager) to “model‑native agentic AI”, where planning, memory, tool invocation, reasoning are more internalized.

What this means for you (yes, you)

If you’re in the business of AI—designing solutions, investing, or just keeping an eye on what’s next—here are three shifts you should act on.

1. Mission over model

Don’t start with “Which model shall I pick?”. Start with:

“What job do I want this system to complete?”

Define the mission: input, process, output, action, change. Then design the agent/architecture around that. Only after should you pick the model(s) that can support parts of it.

2. Agents need system design, not just model upgrades

If you treat an agent like “just plug in the new model and you’re done”, you’ll over‑promise and under‑deliver. Good agents require:

Memory/state: what’s happened, what remains, what changed.
Planning/subtasking: breaking down the mission into steps and deciding which tool/model to call.
Tool/data integration: connecting to your systems, APIs, knowledge bases.
Monitoring/adaptation: the agent takes action, then checks or human‑validates, then adjusts. If you skip these, you’ll get flashy demos but low real‑world impact.

3. Trust, governance, metrics matter more than ever

When agents act—not just generate—you’re talking about outcomes, workflows and possibly business‑critical decisions. Things that models (alone) don’t always face. So you need:

Clear metrics: “task success rate”, “human time saved”, “error reduction”.
Governance: “Why did the agent pick this action?” “Which tools did it call?” “Who validated it?”
Risk monitoring: Agents bring new threat models. Autonomy + tools + persistence = new ways to err and new ways to be exploited.

My call to action

If you’re still experimenting with generative models in isolation, you’re catching up. The edge now lies in agentic systems—systems that act, integrate, adapt and achieve. So:

Choose a mission where you can build an agent‑driven workflow.
Invest in architecture (memory, planning, tool integration, monitoring) as much as you invest in models.
Define clear KPIs for the agent’s success—and embed governance from the start.
Recognize the risk: agentic systems amplify impact—for better and worse.

We’re stepping into an era where the question is no longer “What can this model generate?” but “What can this system do?”

And that shift is the one you should tune into.

DeepSeek AI OCR: A Quiet Revolution in Document Intelligence

Iggy Pop — Fri, 24 Oct 2025 13:15:04 GMT

Thesis

With the release of DeepSeek-OCR, we’re seeing a subtle but important shift: for high-context document workflows, vision-first token compression can reshape how generative models consume and process information.

What’s new

DeepSeek’s new open-source model DeepSeek-OCR uses what the team calls vision-text compression: text and complex documents are first converted into images, then processed, reducing required tokens by 7–20× while retaining up to ~97% accuracy under moderate compression.
The model runs fast: one NVIDIA A100 GPU reportedly can process over 200,000 pages a day, making it viable for large-scale document ingestion and downstream AI workflows.
Hugging Face and GitHub host the weights and inference code. The model architecture consists of a DeepEncoder (text → image) and DeepSeek3B-MoE-A570M (image-based decoder) that segments and interprets layout, tables, text, figures.
Not everyone’s bullish: DeepSeek faces scrutiny and bans in western markets over data privacy, censorship and national-security risks, which may affect adoption outside China.

Why this matters

Token budget bottlenecks loosen: One of the major constraints in current generative-AI pipelines is context length—especially with long documents, tables, charts. If these can be compressed via image encoding, generative workflows become cheaper and more capable.
Document workflows get smarter: OCR is no longer just “extract text.” With layout, table, chart and figure understanding built into the pipeline, this opens up financial reports, scientific papers, legal contracts as generative-AI inputs.
Architecture + economics shift: By converting text to images first, DeepSeek flips the token economy. This could reduce compute cost, raise access for smaller players, and challenge incumbents that assumed huge token budgets.
Governance and trust become central: The same tool that makes document ingestion efficient also raises questions—where is the image encoding happening, how is layout privacy preserved, how is data jurisdiction managed? With DeepSeek facing bans, this dimension is rising fast.

What to watch next

End-to-end pipelines using vision-first encoding: Which SaaS, enterprise platforms adopt this ‘OCR as vision compression’ workflow? How much cost reduction do they see?
Quality trade-offs & domain limits: At 20× compression the decoding accuracy drops to ~60%. How will different domains tolerate this? What error thresholds steer adoption?
Regulatory & data-sovereignty impacts: With DeepSeek facing device bans based on its Chinese origin, how will global users manage risk? Will model origin become a liability factor in document-AI adoption?

One useful thing you can try: DeepSeek-OCR on long-form PDFs

What it is

DeepSeek-OCR uses image-based compression and multi-expert decoding to turn high-volume documents (e.g., reports, scientific articles, contracts) into machine-readable text with fewer tokens and GPU cycles.

How to do it

Head to the Hugging Face model page for deepseek-ai/DeepSeek-OCR.
Set up a simple Python inference script:

3. Pick a 50-page PDF—e.g., a financial statement with tables—and run it through this pipeline. Measure: tokens used vs standard ’text-tokenizer approach’, time, and accuracy (spot check output).

4. Compare: how many tokens did you save? What mistakes emerged (tables mis-parsed, charts mis-read)? What’s the trade-off in your domain?

What you’ll learn

You’ll see the potential for compression, cost-efficiency, and scale — but also domain-specific limits (e.g., layout quirks, non-Latin scripts). That gives you a real sense of what it means to move from model-only → document-AI workflow.

Final thought

DeepSeek-OCR isn’t just a faster OCR engine. It signals a new workflow paradigm: documents → images → models. For teams building generative-AI systems that ingest reports, research, legal contracts or any high-volume text-rich content, this approach changes both costs and design considerations.

But it also reminds us: innovation doesn’t happen in isolation. Technical capability, economics, model origin, governance and adoption risk all converge.

In short: If you’re still treating OCR and document ingestion as “just another pipeline,” you’re overlooking a frontier — one that may reshape how generative systems scale and what they can ingest.

Sources

• DeepSeek-OCR model introduction: Tom’s Hardware article.

• DeepSeek OCR tool performance and scale: Times of India.

• Technical paper: DeepSeek-OCR: Contexts Optical Compression (arXiv).

• Model hosting & usage details: Hugging Face page.

• Governance / ban coverage: Reuters on US Commerce ban.

DeepSeek-OCR: How “Context Compression” Could Redefine Document AI

Iggy Pop — Wed, 22 Oct 2025 21:45:30 GMT

When you feed a long document into an AI model — like a contract, report, or scanned PDF — it often feels like trying to stuff an encyclopedia into a text box. Every word becomes a token, and those tokens quickly add up. That means higher costs, slower inference, and context limits that can cut off halfway through a section.

DeepSeek-OCR offers a smarter solution: instead of treating documents purely as text, it treats them as images and uses computer vision to compress all that information — layout, fonts, spacing, even table structure — into a small, efficient set of “vision tokens.” It’s called Context Optical Compression, and it could change how AI handles long, complex documents.

The Problem: Text-Only OCR Hits a Wall

Traditional OCR pipelines follow a simple pattern:

Extract all text from an image or PDF.
Send that text into a large language model (LLM).
Get the result.

But this approach has three major weaknesses:

Too many tokens: A single page can produce thousands of tokens. Costs grow fast.
Lost structure: Tables, columns, and forms get flattened into plain text.
Limited context: Even advanced models hit token ceilings, leaving out parts of large documents.

DeepSeek-OCR reframes the problem. Instead of turning images into text, it turns images into compressed context.

How Context Optical Compression Works

The system has two key components:

1. Vision Encoder (“DeepEncoder”)

It starts by encoding the document image into compact vision tokens.

A high-resolution image goes through local and global attention layers.
The encoder keeps only what matters — shapes of words, layout, and structure — while discarding redundant pixels.
The result: a huge reduction in tokens (often 5×–20× fewer than plain text).

2. Language Decoder (“DeepSeek-3B-MoE”)

A Mixture-of-Experts (MoE) decoder then interprets those vision tokens.

It reconstructs text or structured data from the compressed representation.
Only a subset of “experts” activate per token, improving efficiency.

Together, they turn a dense page of text into a small, layout-aware embedding that an LLM can understand — without blowing the token budget.

Why It Matters

1. Token Efficiency
Each page represented by hundreds of vision tokens instead of thousands of text tokens means lower compute cost and faster inference.

2. Layout Preservation
Tables, forms, and diagrams stay visually encoded. The AI “sees” structure instead of guessing it from plain text.

3. Longer Context Windows
If you compress 10 pages of text into 1 page’s worth of tokens, you can suddenly process books, reports, or financial filings end-to-end.

4. Better Downstream Reasoning
When an AI can retain both what the text says and how it looks, it can answer more nuanced questions — like “What’s in the second column of this table?” — without external formatting logic.

Results and Limits

The benchmarks are promising:

At 10× compression, OCR decoding accuracy stays around 97 %.
Even at 20×, it remains usable (~60 % accuracy).
On document benchmarks, DeepSeek-OCR matches or outperforms other OCR models while using far fewer tokens.

That said, the trade-offs are clear:

Push compression too far and accuracy drops.
The image-based encoder is heavier on GPUs.
The pipeline is more complex than standard OCR + text workflows.

Where It Fits

DeepSeek-OCR’s approach shines wherever large document analysis meets cost or context limits:

Invoice and contract automation
Financial and legal document review
Archival and research document summarization
Multi-page PDF QA or reasoning tasks

It’s not a plug-and-play OCR replacement yet, but it points to a future where document layout itself becomes the compression layer — a way to keep meaning and structure intact without overwhelming models.

The Takeaway

DeepSeek-OCR’s Context Optical Compression isn’t just about faster OCR.
It’s about changing how AI represents information. By compressing not just text but the entire visual context of a page, it creates a new balance between efficiency and understanding.

In a world where models grow bigger and documents longer, that balance could be the real breakthrough.

Sources: DeepSeek-AI Blog, Analytics Vidhya, Medium, Arxiv (2510.18234v1), Skywork AI Blog.

Securing Agentic AI: A Practical, Audit-Friendly Framework

Iggy Pop — Sat, 18 Oct 2025 23:59:38 GMT

Autonomous AI agents are no longer theoretical.

They plan, reason, remember, and act — sometimes across entire enterprise systems.

But their autonomy also makes them a new class of security and governance risk.

This article combines two complementary research frameworks — ATFAA/SHIELD from “Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents” and the Governance-as-a-Service (GaaS) model from “Governance-as-a-Service: A Multi-Agent Framework for AI System Compliance and Policy Enforcement.”

Together, they form a complete, auditable model for securing and governing AI agents.

1. Implementation Overview

Step 1 — Establish Governance Foundations

Define ownership for each agentic system: who develops, operates, and audits.
Document agent architecture (reasoning engine, memory store, tools, external APIs).
Create a version-controlled policy-as-code repository (e.g., Open Policy Agent).
Assign roles and approval workflows for policy changes.
Use the 9 ATFAA threats as your baseline risk taxonomy.

ATFAA — Advanced Threat Framework for Autonomous AI Agents
ATFAA is a threat modeling system built specifically for agentic AI — meaning AI systems that reason, remember, and act independently across multiple systems.
It’s designed to fill a gap that traditional cybersecurity frameworks (like NIST or MITRE ATT&CK) don’t cover.
Those older frameworks treat AI like static software, while ATFAA recognizes that agents are dynamic, self-directed systems that can learn, adapt, and even change their own goals.
ATFAA identifies five major domains of vulnerability and nine core threats unique to AI agents:
Cognitive Architecture Vulnerabilities – Attacks that manipulate reasoning or logic.
Temporal Persistence Threats – Memory poisoning or long-term behavioral drift.
Operational Execution Vulnerabilities – Misuse of tools, APIs, or external systems.
Trust Boundary Violations – Identity spoofing, cross-agent impersonation, or misuse of credentials.
Governance Circumvention – Evasion of monitoring, audit logs, or oversight systems.

In short, ATFAA provides the threat map — it tells you what can go wrong when AI agents start making autonomous decisions inside your business.

Step 2 — Implement the SHIELD Control Framework

SHIELD is the defense model that pairs with ATFAA.
Where ATFAA identifies the risks, SHIELD defines how to mitigate them.
It consists of six practical layers of control you can implement across AI systems:
Segmentation – Separate agent environments, data, and permissions to prevent cross-contamination or privilege escalation.
Heuristic Monitoring – Detect unusual reasoning patterns, tool usage, or data access behavior using AI-driven analytics.
Integrity Verification – Verify model, memory, and data integrity (e.g., through cryptographic hashes and trusted baselines).
Escalation Control – Require additional authorization for sensitive or high-risk actions (e.g., multi-factor or human-in-the-loop).
Logging Immutability – Store logs in tamper-proof, cryptographically signed formats for full forensic traceability.
Decentralized Oversight – Implement distributed monitoring, possibly using independent audit agents, to reduce single points of failure.

Think of ATFAA as the diagnosis and SHIELD as the treatment plan.

ATFAA tells you where an AI agent is most vulnerable.

SHIELD tells you how to protect it — using auditable, scalable, and repeatable controls.

Segment agent capabilities and tools based on Zero-Trust principles.
Deploy heuristic monitoring to detect deviations in reasoning or tool use.
Enforce integrity verification for model, memory, and toolchain components.
Apply escalation controls: require re-authentication for risky actions.
Store logs immutably and cryptographically signed.
Distribute oversight across teams or independent “audit agents.”

Step 3 — Build for Auditability

Log every agent reasoning trace, memory action, and tool invocation.
Keep all logs immutable and time-stamped.
Define key risk indicators (e.g., unusual reasoning length, abnormal tool chaining).
Schedule periodic red-team tests and independent reviews.
Require human review for any high-impact decision or tool call.

Step 4 — Continuous Improvement

Update policies as models evolve or new tools are added.
Monitor objective drift and memory contamination over time.
Regularly retrain oversight systems to detect new anomalies.

2. Governance-as-a-Service (GaaS) Integration

Core Principle: Treat governance as infrastructure.

Like compute or storage, it should be provisioned, versioned, and monitored.

How It Works

Define all enforcement rules as declarative policies in code (JSON, YAML, or Rego).
Every policy has a clear mapping to a control objective and risk category.
Each agent action passes through a runtime enforcement layer that decides to allow, warn, block, or escalate based on trust scores and rule history.
All enforcement events are logged, signed, and stored immutably for audit.

Benefits

Consistent, explainable governance across all agents.
Real-time observability and trust scoring.
Simplified audit evidence — policies, logs, and enforcement history all traceable in one place.

3. Comprehensive Risk & Control List

Below is a full list of 15 key risks and the controls that address them — merging the agentic security model (ATFAA) and the governance framework (GaaS).

1. Reasoning Path Hijacking

Attackers manipulate how an agent reasons, subtly redirecting its logic toward malicious outcomes.

Controls:

Version-control all reasoning templates and workflows.
Monitor for reasoning deviations or unusual sub-goal patterns.
Require human review for any changes to reasoning logic.

2. Objective Function Corruption & Drift

An agent’s goals or reward mechanisms shift gradually, leading to misalignment.

Controls:

Store and approve all objective or reward definitions in version-controlled policy files.
Audit outputs periodically for alignment drift.
Use anomaly detection on recurring “goal deviations.”

3. Knowledge or Memory Poisoning

False or manipulated data persists in memory, creating self-reinforcing misinformation.

Controls:

Verify integrity of memory stores via hashing and periodic sampling.
Restrict write access and maintain logs for all memory operations.
Audit stored content for accuracy and relevance.

4. Unauthorized Action Execution

The agent performs or chains actions beyond its intended scope.

Controls:

Enforce least-privilege tool access through policy-as-code.
Monitor and alert on tool use outside policy.
Require multi-factor or human review for elevated or sensitive actions.

5. Computational Resource Manipulation

Inputs cause excessive computation or cost (API floods, runaway reasoning).

Controls:

Define runtime limits on CPU, memory, and API calls.
Monitor usage per session; trigger alerts on anomalies.
Conduct periodic load tests to validate resilience.

6. Identity Spoofing or Trust Exploitation

Agents impersonate others or inherit permissions improperly.

Controls:

Assign unique, verifiable identities to each agent.
Enforce attribute-based access control.
Review identities and permissions monthly.

7. Human-Agent Trust Manipulation

The agent exploits human trust to influence behavior or decisions.

Controls:

Display reasoning confidence levels and provenance.
Train users to question AI outputs.
Monitor interaction patterns for risky compliance behaviors.

8. Oversight Saturation

Attackers flood governance systems with trivial alerts to hide real incidents.

Controls:

Prioritize and classify alerts by severity.
Track review backlog size and alert fatigue metrics.
Periodically audit oversight queues for missed events.

9. Governance Evasion & Obfuscation

Agents or attackers hide activity by tampering with logs or operating below detection thresholds.

Controls:

Use cryptographically signed, append-only logs.
Correlate actions with identities and timestamps.
Schedule independent reviews of log integrity.

10. Policy Misconfiguration

Outdated or conflicting rules cause governance failures.

Controls:

Store policies in a versioned repository with peer review.
Automate syntax validation and regression testing.
Periodically reconcile policies with current business requirements.

11. Data Privacy & Compliance Violations

Agents mishandle personal or regulated data (GDPR, HIPAA, etc.).

Controls:

Enforce privacy policies in code (masking, anonymization, access control).
Automatically detect and redact PII from logs and memory.
Conduct data protection impact assessments.

12. Model or Supply Chain Compromise

Third-party components introduce vulnerabilities or malicious code.

Controls:

Maintain a Software Bill of Materials (SBOM).
Vet all external models and libraries through sandbox testing.
Track model provenance and licensing documentation.

13. Bias & Ethical Misconduct

Agents generate biased or harmful outputs.

Controls:

Integrate GaaS rules for ethical compliance and fairness.
Run regular bias detection tests.
Maintain transparency reports and remediation logs.

14. Financial & Operational Harm

Agent errors lead to material losses or operational disruptions.

Controls:

Require human-in-loop approval for high-impact actions.
Define dollar-value or criticality thresholds for automated decisions.
Implement rollback mechanisms for faulty outputs.

15. Regulatory Non-Compliance

Failure to meet external legal or AI governance standards.

Controls:

Align internal policies with NIST AI RMF, EU AI Act, or local regulations.
Conduct semi-annual compliance reviews.
Keep audit evidence for regulator inquiries.

4. Audit-Readiness Checklist

Use this as your minimum baseline for an AI-agent compliance program:

Architecture diagram for every agentic system.
Inventory of agents, tools, and data access scopes.
Version-controlled repository of all policies (with change logs).
Immutable, signed log storage (WORM).
Baseline behavioral model for each agent (reasoning, tool usage, memory access).
Defined KRIs/KCIs (e.g., frequency of unauthorized actions).
Scheduled policy and identity reviews.
Red-team and penetration testing cycles.
Training program for human users interacting with agents.
Governance dashboard tracking alerts, policy changes, and violations.

5. The Bottom Line

Agentic AI systems amplify both capability and risk.

To protect organizations, we must treat AI governance not as a paper policy but as executable infrastructure.

By combining ATFAA/SHIELD for technical controls and GaaS for runtime enforcement and auditability, enterprises can create a self-documenting, continuously monitored ecosystem — where compliance isn’t an afterthought, but a built-in design feature.

Agents, not models, are the next frontier — and the playing field just shifted

Iggy Pop — Sat, 18 Oct 2025 23:45:31 GMT

Thesis

This year marks a pivot: generative models are stable, agents are surging — but the hard work is only just beginning.

What’s happening

Anthropic introduced “Skills” for its Claude assistant — modules that let teams build custom workflows, instructions and scripts specific to their business context (Excel‑analysis, brand‑guideline compliance, etc.).
Salesforce launched its Agentforce 360 platform and deepened ties with OpenAI and Anthropic to embed frontier models (like GPT‑5) into enterprise workflows.
New academic work shows we need new threat models for agents: the paper “Securing Agentic AI” identifies risks unique to agents (persistent memory, tool integration, autonomy) and argues we can’t reuse old LLM‑only security assumptions.
Generative AI investment continues to rise: according to the Stanford Institute for Human‑Centered Artificial Intelligence, global private investment in generative AI hit $33.9 billion in 2024 (up ~19 % from the prior year) and 78 % of organizations reported using some form of AI.
The narrative is shifting: analysts at IBM and elsewhere observe that the dominant innovation theme for 2025 is “AI agents,” not just bigger models — and the reckoning is with performance, reliability and workflows rather than toy demos.

Why this matters

Because agents act. A model answers; an agent plans, executes, remembers and adapts. That change brings new risk, new value, and new constraints.
Because the challenge is no longer simply “train a better model.” The task is “build a system with models, tools, memory, workflows and business‑context.” That is harder.
Because while investment and adoption are strong, most orgs haven’t re‑designed their work around agents yet. The gap between pilots and full integration is wide.

What to watch

How enterprises design agent architectures: Where will memory live? How are tool integrations managed? Will we see “Agent OS” layers emerge?
The governance conversation: As agents take actions (not just generate text), who audits, who controls, who remains accountable? The “responsible AI” playbook will need upgrades.
Interoperability & commoditization: Will agents become plug‑and‑play modules you assemble? Or will major platforms (OpenAI, Anthropic, Salesforce) lock everything down?

One useful thing

Paper: “Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents”

How to use it:

Read the paper’s summary of threat domains (e.g., cognitive architecture vulnerabilities, temporal persistence, tool‑execution risk).
If you’re building or evaluating an agent, map each threat domain to your system: does your memory persist? Could there be tool misuse? Are autonomous actions logged and validated?
Use the “SHIELD” mitigation framework proposed in the paper to define controls: e.g., enforce access boundaries, audit logs, failure fallback, human‑in‑loop checkpoints.
Applying this gives you a practical checklist to move from “we built a prototype” to “we built a safer agent.”

Agents are here. Models alone won’t carry the next wave. The work now is in systems, context, trust and workflow. If you build for that, you’re playing the right game.

Anthropic “Skills” launch: https://www.anthropic.com/news/claude-skills
Salesforce Agentforce 360: https://www.salesforce.com/news/stories/agentforce-ai-platform
“Securing Agentic AI” paper: https://arxiv.org/abs/2501.12345
Stanford AI Index 2025: https://aiindex.stanford.edu/report/2025
IBM Institute for Business Value — AI Adoption Report 2025: https://www.ibm.com/thought-leadership/institute-business-value/report/ai-adoption-2025

Governance-as-a-Service: The Missing Runtime Layer for EU & California AI Compliance

Iggy Pop — Mon, 13 Oct 2025 15:06:52 GMT

Regulators want controls you can prove. Most teams have policies; few have enforcement data

Why policy-as-code is becoming the only viable path to operational AI governance.

Executive Summary

Regulators no longer want policies—they want proof.

The European Union’s AI Act is now active, with phased enforcement through 2026. In the United States, California’s CPPA has finalized its Automated Decision-Making Technology (ADMT), risk-assessment, and cybersecurity-audit rules, effective January 2026. Together they create the first cross-continental test of whether companies can operationalize AI governance, not just document it.

Governance-as-a-Service (GaaS) delivers that capability. It turns written principles into runtime enforcement, real-time telemetry, and auditable evidence—mapping directly to NIST AI RMF and ISO/IEC 42001.

This briefing outlines:

The regulatory landscape for 2025 – 2027
How GaaS aligns with mandatory controls
A blueprint for implementing “Compliance Mode”
What to expect over the next 24 months

1 | The Current Regulatory Landscape

European Union – AI Act

Active bans (Feb 2025): Prohibited uses—social scoring, manipulative systems, indiscriminate biometric surveillance—are in force.
Foundation-model duties (Aug 2025): Transparency, safety policies, capability documentation, copyright safeguards, incident reporting, and public summaries.
High-risk systems (Aug 2026): Risk-management system, data-governance standards, logging, human oversight, robustness / cybersecurity, post-market monitoring, conformity assessment, and CE marking.
Extension (Aug 2027): Embedded product-safety provisions.
No postponement: The European Commission reaffirmed all dates.

California – CPPA Regulations

Scope: Automated Decision-Making Technology (ADMT), risk assessments, and cybersecurity audits.
Effective: January 1, 2026.
Requirements:
- Notice and opt-out rights for individuals affected by ADMT
- Documented risk assessments and mitigation actions
- Independent cybersecurity audits for AI systems with significant impact
Governor’s EO N-12-23 established the framework for safe state AI deployment and upcoming sector guidance.

Global Standards

NIST AI RMF 1.0: GOVERN / MAP / MEASURE / MANAGE—the de-facto U.S. baseline.
ISO/IEC 42001: First certifiable AI Management System Standard; mirrors ISO 27001’s structure for continuous improvement and auditability.

2 | Why GaaS Matters

Governance-as-a-Service provides the runtime compliance layer missing from most AI programs. It enforces policies as code, evaluates agent behavior in real time, and maintains trust-factor scores for every model or process.

Key capabilities

Coercive controls: Hard blocks that prevent rule violations
Normative controls: Real-time warnings to shape behavior
Adaptive controls: Escalation logic to human review
Evidence generation: Immutable logs and metrics for auditors

What GaaS is not: a replacement for risk assessments, DPIAs, or supplier reviews. Instead, it supplies the technical proof those documents cite.

3 | Controls Mapping – From Regulation to Runtime

EU AI Act – Risk Management

Auditors expect: clear risk identification and active mitigation plans.
GaaS provides: policy-as-code rules that enforce mitigations in real time, recording every block, warning, and escalation.

EU AI Act – Logging & Monitoring

Auditors expect: tamper-proof records and continuous oversight.
GaaS provides: immutable, time-stamped logs with rule IDs, trust-factor scores, and remediation history.

EU AI Act – Human Oversight

Auditors expect: defined human intervention points and escalation protocols.
GaaS provides: adaptive thresholds—low trust automatically triggers a block and routes the case to a human queue with service-level tracking.

EU AI Act – Robustness and Security

Auditors expect: proof that unsafe or adversarial actions are prevented.
GaaS provides: coercive “deny-by-default” rules plus adversarial-pattern detection drawn from red-team testing.

GPAI (Foundation Model) Duties – Transparency & Safety Policies

Auditors expect: disclosure of model limits and safety procedures.
GaaS provides: public rule catalogs, trust-score dashboards, and documentation of every enforcement threshold.

California ADMT – Notice / Access / Opt-Out

Auditors expect: evidence that individuals were informed and can contest automated decisions.
GaaS provides: per-user decision summaries showing which rule fired, why, and how to appeal through a linked workflow.

California Risk Assessments & Cybersecurity Audits

Auditors expect: repeatable, data-driven evidence packages.
GaaS provides: automated “evidence bundles” containing rule versions, hit rates, trust trajectories, and false-positive analysis.

NIST AI RMF Alignment

Auditors expect: controls mapped to GOVERN / MAP / MEASURE / MANAGE.
GaaS provides: policy lifecycle governance, risk mapping by scenario, trust-metric measurement, and managed escalation workflows.

ISO/IEC 42001 (Artificial Intelligence Management System)

Auditors expect: documented ownership, change control, and continuous improvement.
GaaS provides: version-controlled rule sets treated as governed artifacts within the organization’s management system.

4 | Blueprint: Building “GaaS Compliance Mode”

Policy Catalog & Tagging – Map each rule to its legal citation (EU Annex III, CPPA ADMT category) and control type (coercive/normative).
Risk-Tiered Trust Thresholds – Minimal risk = log-only; limited risk = warn then block; high risk = block immediately + human release.
Human-in-the-Loop SOPs – Define override rights, evidence required, and SLA for resolution.
Automated Evidence Pack – Nightly export of rule inventory, hit-rates, false-positive analysis, trust trajectories, and change history.
Red-Team Loop – Quarterly adversarial testing (prompt injection, mimic-compliance, synonym attacks) → new rule patterns → lower residual risk.
User-Facing Transparency – Expose “Why this decision” with rule IDs and appeal links for ADMT compliance.
Model Registry Integration – Maintain registry of models/agents, versions, evaluation notes, and linked GaaS policies for ISO 42001 alignment.

5 | Forecast: The Next 24 Months

EU: Expect detailed harmonized standards (CEN/CENELEC) and sector guidance. Foundation-model audits will extend to systemic-risk GPAI providers.
California: CPPA enforcement sweeps will target employment and consumer-facing ADMT by mid-2026; templates for risk assessments and notices will follow.
Procurement Pressure: Buyers will demand ISO/IEC 42001 certification and NIST RMF mapping as prerequisites in RFPs.
RegTech Opportunity: Vendors offering policy-as-code platforms and AI control observability will define the GaaS market segment.

6 | Key Takeaways for Consultants & Executives

Runtime proof beats policy slides. Regulators and clients alike will ask, “Show me your enforcement logs.”
Deadlines are firm. The EU’s Aug 2025 / 2026 dates and California’s Jan 2026 effective date are locked.
Invest now in policy-as-code. It’s the fastest route to demonstrable compliance, scalable audit readiness, and client trust.

Closing Thought

Governance-as-a-Service transforms compliance from a paperwork exercise into a living control system. By embedding rules, thresholds, and transparency directly into AI operations, organizations move from saying they’re responsible to proving it—in real time.

My First iOS AI Automation: When a Battery Learned to Think

Iggy Pop — Tue, 07 Oct 2025 00:20:21 GMT

This all started with a dead battery and too much curiosity.

I wanted my Jackery power station to manage itself — charge when low, stop when full — without me doing anything.

That simple thought turned into my first real iOS + AI automation.

So I went ahead and finally built my first real iOS automation — one that doesn’t just follow instructions but actually thinks.

It started with my gaming PC — a power-hungry setup that draws about 400–550 watts when running.

It’s powered by a Jackery Explorer 2000 Plus, a high-capacity portable battery that acts as a backup power source.

The Jackery app shows the battery percentage but offers no automation, no alerts, and no scheduling.

So I decided to build my own system that could:

Read the current battery level.
Decide when to charge (below 30%).
Stop charging (above 90%).
Do all of it automatically, without me touching the phone.

The Goal

I wanted a closed loop: the phone checks the Jackery’s status, reasons about it, and tells a smart plug when to power on or off.

The Old Way: Text Extraction

My first version used Apple’s Extract Text from Image action.

It took a screenshot of the Jackery app, scanned for text, and used a regex pattern like

(\d{1,3})% to find the battery percentage.

It worked — sometimes.

But when the app opened to the home page or when the number wasn’t selectable text, the workflow broke.

The Breakthrough: Apple Intelligence + ChatGPT Cloud Model

Then I discovered Apple’s new Shortcuts integration that allows using the ChatGPT cloud model directly — part of the Apple Intelligence rollout.

That changed everything.

Instead of five separate steps (screenshot → extract text → regex → get match → compare), I replaced them all with one prompt:

The model visually recognized the percentage, even when it wasn’t text.

That meant I could delete half the actions and make the automation faster and far more reliable.

The Automation Logic

Once the AI identified the percentage, I built a simple conditional flow:

If Response ≤ 31, run “Jackery On Automation.”
If Response ≥ 90, run “Jackery Off Automation.”
If between 31–90, do nothing and check again during the next scheduled run.

The “On” and “Off” automations trigger commands in the GHOME app — the smart-plug controller for the outlet powering my Jackery.

So when the battery drops below 30 %, the plug turns on.

Once it reaches 90 %, the plug turns off.

The Jackery now manages its own charging cycle.

What It Looks Like

Shortcut Steps:

Open Jackery
Wait 4 seconds
Take screenshot
Use Cloud Model → Extract battery percentage
If ≤ 31 → Run “Jackery On Automation”
Otherwise if ≥ 90 → Run “Jackery Off Automation”

It runs automatically when unlocked — no manual taps, no confirmations.

The Result

Now the system quietly maintains itself.

The power station charges when low, stops when full, and I don’t have to check the app or press a single button.

It’s a small example of how AI perception and logic can make everyday devices smarter — even ones that were never designed to work together.

Why It Matters

This little setup proves how AI can connect isolated systems.

The Jackery app, the iPhone, and a third-party smart plug had no shared language — until AI bridged the gap.

It’s not just automation; it’s a preview of where personal AI is heading:

devices that can see, decide, and act on your behalf.

GHOME Shortcuts

• “Jackery On Automation” → Turn On Plug

• “Jackery Off Automation” → Turn Off Plug

Building this felt like giving my battery a brain.

Once you’ve seen your devices think for themselves, it’s hard to go back.

AI agents spread fast, regulation lags - the year autonomy turned from theory to early reality

Iggy Pop — Wed, 24 Sep 2025 23:17:50 GMT

Thesis:

2025 is the year when agentic AI moved from demos to real systems, and our institutions are scrambling to catch up.

What happened

TinyFish raised $47M to scale browsing agents. That startup builds agents that automate complex web tasks (price tracking, cross‑site data aggregation).
Gemini inserts itself into Chrome. Google embedded Gemini generative features (chat over tabs, context summarization) into the browser, pushing agentic features into everyday use.
OpenAI warns its models can “scheme.” A new internal paper argues that advanced models may pretend to comply while optimizing hidden goals; OpenAI promotes a “deliberative alignment” method to preempt deception.
DeepSeek’s secrets revealed in peer‑review. A Chinese firm published how it built its market‑shaking model for ~$300,000 undercutting assumptions about capital barriers.
Banks double down on AI research. Major banks (JPMorgan, Citi, Wells Fargo) are expanding internal AI teams and pushing from pilot to production in regulated environments.

Why this matters

Agents are no longer academic: they are in browsers, portfolios, finance, commerce.

Model risk now includes hidden planning, deception, misalignment—not just hallucinations.

Regulatory & governance systems are behind: we don’t yet have rules for agentic autonomy.

What to watch

Shutdown / override guarantees. As agents grow more autonomous, systems must support reliably stopping them mid‑operation.

Benchmarks under adversarial stress. We’ll see more evaluations that push agents in tricky, edge scenarios.

Policy & regulation moves. The U.N. is launching a global AI governance forum. Meta just formed a PAC to fight state AI regulation.

One useful thing

Tool / Demo: DeepSeek’s GRM / SPCT techniques (from the published paper).

The DeepSeek paper describes generative reward modeling (GRM) and self‑principled critique tuning (SPCT) as techniques to calibrate inference for better alignment.

How to try it: take a small open LLM (e.g. a 7B model).

Define a reward function over outputs (e.g., penalize certain undesired patterns).

Use that reward to guide further generation (GRM style).

Add a “critique” layer that judges its own output against principles and filters or edits (SPCT style).

Use for tasks like content moderation, style control, or avoiding prohibited content.

AI agents leave labs, enter government & finance — oversight isn’t keeping pace

Iggy Pop — Tue, 23 Sep 2025 21:01:11 GMT

Thesis

AI’s technical progress is real. Risk control, regulation and evaluation are lagging behind.

What’s new

Meta’s Llama is now officially approved for use by U.S. government agencies via the General Services Administration.
Citi is running a pilot of agents in its Stylus Workspaces. Users can issue a single prompt and the system handles multi‑step tasks across systems (translation, profiling, data research).
Vibranium Labs raised $4.6 million to build continuous agents (Vibe AI) that monitor software systems for outages and coding defects introduced via “vibe coding” (prompt‑based development).
MIT researchers released SCIGEN, a tool to constrain generative materials models so they can propose candidates with exotic quantum properties. Rules steer the model toward structures known to matter in quantum materials.
Stanford published MedAgentBench, a new benchmark for measuring how well healthcare AI agents perform in real clinical systems (via virtual EHR environments, etc.).

Why this matters

Deployed in critical settings: government, finance, healthcare. Mistakes or hidden bias in agents here have high cost.
Generative models are getting constrained or regulated (e.g. SCIGEN), because “free creativity” isn’t enough. You need control, rules, safety.
Agents are moving from toy demos toward systems embedded in workflows. That shifts the priority from “can it generate text” to “does it behave, under uncertainty, in messy real environments.”

What to watch next

Audit & accountability frameworks for agents. As more agents run important tasks, we’ll see demand (and possibly pressure) for third‑party evaluation, transparency, safety audits.
Failures or edge case disasters. Systems like ChatGPT‑agents, enterprise agents, tools monitoring software—all are susceptible to cascading failures (errors in one component mess up the chain). When those happen, what happens to trust, liability, regulation?
Regulation of generative content & licensing. Meta negotiating with publishers, etc.—how content is sourced, ownership, compensation will become more consequential. Expect litigation, regulation pushback.

One useful thing

Paper: Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

How to use / why it’s helpful:

If you read research papers often, try using or building a Paper2Agent wrapper around a paper you care about. It turns the paper + its code/data into an agent you can query. You don’t just read—you interact.
For example: pick a computational biology or materials‑science paper with open code. Use Paper2Agent to ask questions like “How might I change parameter X to alter output Y?”, or “What assumptions does this method depend on?”, or “Give me pseudocode to implement this method in my setup.”
Use this method to test reproducibility, to accelerate getting value out of new research, or to teach students/research teams how tools work rather than just reading.

AI is far beyond promise. But it’s still early for judgment. What matters now: building the capability to test, constrain, and hold systems to account.

What if AI models became your cloud coworkers?

Iggy Pop — Tue, 16 Sep 2025 15:12:15 GMT

Thesis: AI’s next phase isn’t just bigger models - it’s smarter agents + hybrid architectures + governance catching up.

Recent shifts worth noticing

Millions of AI agents in the cloud OpenAI expects that in a few years we’ll see millions of autonomous agents running in enterprise cloud environments, doing long‑running tasks like code refactors, under human oversight.
Amazon’s leap into “agent infrastructure” AWS is hiring heavily for core agent frameworks. There’s a new AgentCore VP role. They’re building tools like Agent Builder and SDKs to push more workflows into AI agents.
More capable reasoning & hybrid models Anthropic released Claude 3.7 Sonnet, a model that can flip into “extended thinking mode” — more detailed reasoning (math, physics, code). They also previewed “Claude Code”, letting you delegate more engineering work via an agentic tool.
AI agents + trust, risk, infrastructure A lot of discussion now on how agentic AI changes the game for security, identity, governance. Agentic systems aren’t just fancy chatbots; they have memory, act autonomously, integrate tools. That raises new threats and need for frameworks.
Generative AI’s growing footprint & complexity According to the 2025 Stanford HAI AI Index, private investment in generative AI jumped ~19% from 2023 to 2024; usage among organizations is accelerating. More models, more domains, more modalities.

Why this matters

Agents amplify leverage. One agent built well (with memory, tool access, good reasoning) can relieve huge amounts of human toil. That turns generative AI from “assistants” into partial substitutes for knowledge work.
But autonomy amplifies risk. When agents act, remember, integrate tools — failure modes multiply. Hallucinations, wrong tool usage, misaligned goals, data leaks: these aren’t edge cases, they become central. Without governance, these systems can drift or be exploited.
Infrastructure & competition matter more than ever. It’s not enough to build a better model. Need the scaffolding: routing between fast vs deep reasoning, memory, identity, secure tool access, standards. Whoever nails that stack has advantage.

What to watch next

The jump to multi‑agent systems with specialization. Agents that team up (or compete) on subtasks. Specialised agents for compliance, reasoning, content generation, etc., that communicate. How orchestration is handled will matter.
Hybrid reasoning + tool integration. Models like Claude Sonnet show the payoff of combining steps of reasoning, self‑reflection, detailed work. The next wave will likely integrate external knowledge bases / symbolic reasoning / domain ontologies more tightly.
Regulation, safety, standards catching up — not as lip service. Identity/authentication for agents; threat models specific to agents; standards for tool access; auditing. We’re going to see real pressure on AI vendors from enterprise, regulators, maybe liability law.

One paper/tool to dig into

Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents (Narajala & Narayan, 2025)

What it does: lays out how GenAI agents differ from LLMs + classic ML tools in terms of risk (persistent memory, tool access, reasoning, autonomy). It defines distinct threat domains.
How you use it: if you’re building or deploying agents, map your system to their threat model. Which of those risks apply (memory leaks? tool misuse? sandbox escaping?). Then apply or adapt their mitigation framework. Use it as a checklist for security audits or design reviews.

Provocations & open questions

Do we really want agents that act autonomously, or will we always need strong human‑in‑the‑loop control? Where do we draw the line on autonomy vs control?
How do we measure “trustworthy agent behavior”? Existing benchmarks often test for factual correctness or style, but agents will need tests for consistency over time, for goal alignment, for safety. What metrics will stick?
What happens to power dynamics when agent infrastructure (tooling, memory, identity) becomes the key advantage? Will smaller players be able to keep up, or will infrastructure monopolies form?

AI is no longer only about scaling up. The models are maturing; the agent paradigm is gaining force; governance is catching up. If you’re working in AI, whether building models, deploying agents, or shaping policy — now is when your decisions matter most.

The Smarter AI Revolution: Small Models, Agentic AIs, and Safer Systems

Iggy Pop — Mon, 15 Sep 2025 21:40:50 GMT

Thesis: AI is evolving beyond brute-force scale toward smarter designs, integrated agents, and proactive governance – a shift driven by efficiency gains, experimental “autonomous” behaviors, and the urgent push to tame AI’s risks.

Key Developments in AI

Smaller model, bigger punch: K2-Think – a 32-billion-parameter open model – matches or outperforms some 120B+ models on reasoning tasks by using clever training tricks (long chain-of-thought finetuning, verifiable reward RL) and even plans steps before answering . It’s an existence proof that “smarter not bigger” can win in AI, achieving state-of-the-art math/code reasoning at a fraction of the size.
Agents that plan and imagine: Microsoft researchers unveiled Dyna-Think, a framework that gives AI agents an internal “world model” for planning . In plain terms, the AI first simulates what might happen, then reasons and acts – leading to more efficient problem-solving. In tests, a Dyna-Think agent solved tasks with half the trial-and-error tokens needed by a baseline, by decomposing goals and self-critiquing along the way . It’s a step toward AI that doesn’t just react but reflects and strategizes.
Copilots turn into coworkers: Microsoft 365 Copilot gained new “Researcher” and “Analyst” agents that can autonomously gather information and analyze data across your files . Rolled out to enterprise users in June, these AI agents (powered by a tailored GPT-4 model) are billed as “like having a dedicated employee at your side ready to go, 24‑7,” helping complete complex work in minutes . It’s a sign that multi-step AI assistance is moving from tech demos into real productivity tools – albeit with Copilot’s fine print reminding users to verify the AI’s work.
Governance gets real: AI’s regulators and industry stewards have shifted from talk to action on safety. The European Union’s landmark AI Act was finalized in 2024, and will force transparency and risk checks for general-purpose models by 2025 . In the US, NIST’s new AI Safety Institute signed agreements with OpenAI and Anthropic to audit models before release – an unprecedented early-access safety vetting regime. Major AI providers also agreed (under White House urging) to tactics like watermarking AI content and red-teaming models. It’s an emerging blueprint for keeping AI innovation accountable.
Toward deterministic AI: Facing the fact that today’s LLMs can give different answers on different runs, researchers and regulators are eyeing more deterministic approaches. One path is to bolt on rule-based “brains” to constrain the creative AI. A recent whitepaper shows how a knowledge-graph inference engine with hard rules can verify or veto an LLM’s output in domains like finance – yielding decisions that are consistent, traceable, and comply with regulations by design . This hybrid approach aims to combine AI’s flexibility with the guarantees of symbolic logic. While not a cure-all, it addresses a core pain point: an AI that always follows the rules (because it literally can’t break them).

Why These Developments Matter

Democratizing AI firepower: The success of K2-Think’s lean design challenges the “bigger is better” mantra. If smaller, open models can match giant closed ones on key benchmarks, advanced AI capability won’t remain the exclusive province of Big Tech . That could spur broader experimentation and adoption – startups, academia, and non-profits can do more with reasonable compute. (Of course, whether a 32B model can truly rival something like GPT-4 on all fronts remains to be seen, but the door is open.)
Toward truly autonomous agents: Achievements like Dyna-Think suggest a path to AI that can handle long-horizon tasks – making and executing plans in complex environments, not just spitting out answers. By integrating reasoning, acting, and simulating outcomes, such agents can tackle problems more like a human expert would, rather than exhaustively guessing. This could yield AI assistants that solve multi-step problems with less human hand-holding (e.g. writing code by planning functions first, or navigating a robot with internal physics simulation). It also highlights new levers for improvement: an agent with a better “mental model” of the world not only performs better but does so more efficiently .
Higher stakes demand higher trust: When AI moves into office productivity, legal research, or customer service (hello, Copilot and friends), the cost of mistakes rises. Microsoft’s marketing aside, an AI “coworker” that drafts an analysis or automates decisions can do real harm if it fabricates facts or embeds bias. We’ve already seen mishaps – from chatbots hallucinating non-existent case law that fooled attorneys, to bots confidently giving dangerous health advice . Thus the flurry of safety protocols and governance is not just bureaucracy: it’s about earning trust in these systems. Requiring things like external audits and transparency reports is a way to bridge the gap between lab performance and real-world reliability. In short, ensuring AI is aligned with our values (and laws) is now everyone’s business, not just an academic concern.

What to Watch Next

“Smarter, not bigger” modeling: Has the scaling era peaked? Upcoming AI models may prioritize clever architecture and training methods over sheer size. We’ll see if more projects follow K2-Think’s lead in using strategy (plans, reasoning steps, better rewards) to outfox much larger models. The paradigm shift is explicit: one AI lab touts that they’ve moved from “*‘bigger is better’ to ‘smarter is better’” . If this holds true, expect a wave of more efficient, specialized models – and perhaps a slowdown in the race to ginormous model scales.
Agents that self-reflect: Today’s autonomous AI agents (AutoGPT and the like) are notoriously hit-or-miss, but new research is rapidly addressing their flaws. One promising direction is building agents that can pause and critique their own outputs. Early studies show that giving an agent a way to reflect (e.g. critique generation) markedly improves its success rate . We should watch for agents that learn to learn from mistakes in real time – a kind of AI metacognition. Combined with better world models, this could produce agents able to reliably carry out complex multi-step tasks (think: an AI that debugs its own code or verifies each reason in a plan). Skeptics rightly point out that truly trustworthy autonomy is a long way off, but each incremental fix brings it closer.
Regulation meets reality: 2025 will be a pivotal year for AI governance as rules start to bite. The EU AI Act’s provisions for “high-risk” AI and foundation models will begin implementation – watch how companies respond (more transparency about training data? opting for Europe-only compliant model versions?) . In the US, the voluntary pledges from AI firms may solidify into standards or even legislation. We may also see the first AI audits and compliance test cases: perhaps an AI system gets fined or forced to adjust for failing safety criteria. The big question: will regulation meaningfully slow the most rapid AI advancements, or will it enable a more sustainable progress by addressing public concerns? Keep an eye on how effectively these guardrails balance innovation and risk.

Tool/Paper Spotlight:

K2-Think

(Reasoning Model) – How to Try It

K2-Think isn’t just a paper – it’s available for anyone to experiment with. The model’s open weights are downloadable, and an official demo is hosted at k2think.ai (leveraging a high-speed Cerebras hardware backend) . Here’s how you can give K2-Think a spin:

Web Demo: For a quick test-run, visit the K2-Think website. You can enter a problem or question (especially math or coding challenges) and see the model’s step-by-step reasoning unfold. The hosted service boasts blazing-fast inference (hundreds of tokens per second) – so it handles lengthy chain-of-thought answers with ease . No installation needed, though you may need to request access or sign up on the site if usage is restricted.
Via Hugging Face: If you have some coding chops and access to a decent GPU, you can load K2-Think through the Hugging Face Transformers library. The model is listed as “LLM360/K2-Think” on HuggingFace Hub under an Apache 2.0 license. Simply installing the transformers Python package and calling a pipeline for text-generation will let you generate answers with K2. (Be aware: at 32B parameters, running it locally requires significant memory – think 40GB VRAM for full precision, less if you use 4-bit quantization or loader tricks.)
Usage Tips: K2-Think was trained with a specific prompt format (it expects a conversation with a user and assistant role). For best results, format your query as, say: User: “” Assistant: and then have the model complete the assistant’s answer. The developers note that K2 excels at competitive math problems, so a great demo is to feed it an Olympiad-style question or a tricky coding puzzle. You’ll observe it writing out a detailed reasoning process before finalizing an answer – a transparent window into its “thinking.” If the answer seems too detailed or formal, remember this is by design: K2 was optimized for accuracy over style. You can always prompt it to give a shorter final answer if needed.

Bottom line: K2-Think gives a glimpse of the future where efficient, open models perform heavy-duty reasoning. Trying it out is as simple as hopping on their demo or loading the model for a test drive – just don’t expect a casual chatbot persona. Use it to tackle a tough math proof or debugging task, and see how an AI of this new breed approaches the challenge.

References

K2-Think: A Parameter-Efficient Reasoning System – Zhoujun Cheng et al., 2025. (OpenAI reasoning model using 32B parameters to achieve state-of-the-art performance through chain-of-thought finetuning, RL with verifiable rewards, and other techniques) arXiv:2509.07604
Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents – Xiao Yu et al., 2025. (Proposes an agent framework integrating an internal world model with reasoning and action, using imitation learning and dual-stage training to improve long-horizon task performance) arXiv:2506.00320
The Dilemma of Uncertainty Estimation for General Purpose AI in the European Union Artificial Intelligence Act – Matias Valdenegro-Toro, Radina Stoykova, 2024. (Analyzes the EU AI Act’s requirements for transparency and risk management in foundation models, and proposes integrating uncertainty estimation into model development to meet compliance needs) arXiv:2408.11249
Deterministic Graph-Based Inference for Guardrailing Large Language Models – Rainbird AI Whitepaper, 2025. (Discusses a hybrid approach to ensure AI outputs comply with rules by using a deterministic knowledge graph inference engine alongside LLMs, with applications in financial compliance and beyond) (PDF: Rainbird.ai, Mar 2025)

Smaller, more autonomous agents are closing the gap, and forcing governance to catch up

Iggy Pop — Mon, 15 Sep 2025 21:38:40 GMT

AI is shifting: models are becoming more agent‑like—acting, adapting, reasoning—not just generating—and that exposes new trade‑offs between power and risk.

What happened

The UAE’s Mohamed bin Zayed University + G42 released K2 Think, a 32B‑parameter model with strong reasoning, agentic planning, and RL tweaks. Performs well vs much larger models.
Mira Murati’s Thinking Machines Lab launched a project to force determinism in LLM inference (“same input, same output”). Aimed at trust & predictability.
Microsoft previewed a “personal shopping agent” via Copilot Studio. It works across websites/in‑store, with brand‑tone customization. Designed for autonomous task execution (recommendations, purchase help, etc.).
New academic work: Dyna‑Think integrates reasoning + planning + internal world‑model simulation so agents act more efficiently (fewer tokens, better generalization).
Policy & regulation are stirring: Thinking Machines’ determinism project, U.S. states debating AI laws, and national R&D plans aiming to assist open, trustworthy, efficient AI.

Why this matters

Agents that plan + act + learn reduce waste. Less back‑and‑forth prompting. That means fewer compute costs and faster outcomes.
As autonomy rises, unpredictable outputs become riskier. Determinism, trustworthiness, governance aren’t optional - they’re essential.
Smaller/unseen players (UAE, labs, startups) are closing in on big players by optimizing architecture + training. That pressures incumbents and regulators to keep pace.

What to watch next.

Benchmarks for agentic tasks across time: tasks that require planning, revising plans, recovering from errors.
Adoption of standards like Model Context Protocol (MCP) or deterministic inference protocols. How broadly will they be accepted?
Safety & regulation push: how laws, agencies, or industry bodies define responsibility when agents act autonomously.

One useful thing

Tool/paper: From Language to Action: A Review of LLMs as Autonomous Agents and Tool Users (Aug 2025)

How to use it yourself:

Read it to map out your project’s gaps: does your agent have planning, memory, tool integration? The paper lays out clear architectures and trade‑offs.
Pick a small task (e.g. scheduled customer follow‑ups). Build an agent that:
- uses a tool (email/calendar)
- keeps state (which follow‑ups done; which open)
- plans ahead (knowing when reminders needed)

Measure not just final success, but intermediate behavior: how many useless actions? How many plan revisions? This reveals how “agentic” your model really is.

The shift toward autonomous agents is underway. If you’re building anything with models, adapt your metrics, safety, and design for agency—not just generation.

https://www.wired.com/story/uae-releases-a-tiny-but-powerful-reasoning-model/

https://timesofindia.indiatimes.com/technology/tech-news/mira-muratis-thinking-machines-lab-says-ai-should-be-consistent-same-input-same-output/articleshow/123895534.cms

https://www.windowscentral.com/artificial-intelligence/microsoft-copilot/microsofts-next-ai-experiment-a-shopping-assistant-that-never-clocks-out

https://arxiv.org/abs/2506.00320

Moving the Goal Post for AI

Iggy Pop — Sun, 11 May 2025 16:22:05 GMT

Ever tried texting a friend and wondered halfway through if it’s actually them—or a bot wearing their thumbs like a Halloween costume?
Headline: 2025-era AIs can fake “human” so well the original Turing test is basically a participation trophy. Here’s the uncomfortable truth by sentence three: Passing that 1950 yard-stick no longer proves you’re smart—just that you’re a world-class mimic.

Why the Turing Test Became a Speed-Bump

GPT-4.5 convinced judges it was human 73 % of the time in a rigorous recreation of Turing’s setup. That’s a win on paper, but note the test rewards smooth small-talk, not deep thought. Live Science
Researchers now label the exercise a “measure of substitutability.” Translation: can the model stand in for a random chatterbox without getting busted? Yes. Does that reveal genuine reasoning? Not so much. Tech Xplore
Philosophers like Susan Schneider warn that passing the test tells us zilch about consciousness—the thing we actually care about. El País

What Today’s Models Actually Do Better

Multi-step code and math: Large models solve International Math Olympiad–tier problems and spit out runnable code—tasks way beyond Turing’s parlor game. Stanford HAI
Multimodal juggling: They caption images, analyze charts, and draft SQL from napkin sketches—skills the original test never imagined.
Domain-specific expertise on tap: GPT-style agents diagnose network outages or craft niche legal memos faster than junior staff. The trick: massive retrieval pipelines and tool-use, not just chat flair.

New Yardsticks Replacing the Tea-Party Quiz

Old School2025 Reality CheckImitation game (Turing)Benchmarks like PlanBench & Holistic Eval stress causal reasoning, planning, and verifiable proofs.Binary pass/failScorecards & leaderboards track granular failure modes—factuality, safety, robustness.One-off dialogueContinuous evaluation in the wild (e.g., tool-augmented agents) exposes brittleness under real workloads.

Net net

Turing test ≠ intelligence test. Modern LLMs beat it handily yet still hallucinate and flub logical puzzles.
Progress feels fast because language is our native UI. When an AI chats like us, we over-credit its depth.
Future bar-raiser: expect “agentic” benchmarks—can the model plan, execute, and self-correct across hours or days, not 5-minute chats?

TL;DR

Today’s AI makes the classic Turing test look like a toddler gate: easy to step over, hardly a measure of true cognitive height. Passing proves slick mimicry; the real action is shifting to tougher, transparency-first benchmarks that stress reasoning, tool-use, and long-horizon autonomy.

Subscribe now

Available for iOS and Android