<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Iggy Pop]]></title><description><![CDATA[Iggy Pop]]></description><link>https://iggypop1.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!2Tec!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Figgypop1.substack.com%2Fimg%2Fsubstack.png</url><title>Iggy Pop</title><link>https://iggypop1.substack.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 19 Jun 2026 22:44:07 GMT</lastBuildDate><atom:link href="https://iggypop1.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Iggy Pop]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[iggypop1@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[iggypop1@substack.com]]></itunes:email><itunes:name><![CDATA[Iggy Pop]]></itunes:name></itunes:owner><itunes:author><![CDATA[Iggy Pop]]></itunes:author><googleplay:owner><![CDATA[iggypop1@substack.com]]></googleplay:owner><googleplay:email><![CDATA[iggypop1@substack.com]]></googleplay:email><googleplay:author><![CDATA[Iggy Pop]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[“Fine-tuning is dead. Long live memory.”]]></title><description><![CDATA[article based on https://arxiv.org/pdf/2603.18743]]></description><link>https://iggypop1.substack.com/p/fine-tuning-is-dead-long-live-memory</link><guid isPermaLink="false">https://iggypop1.substack.com/p/fine-tuning-is-dead-long-live-memory</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Wed, 25 Mar 2026 20:16:25 GMT</pubDate><content:encoded><![CDATA[<p></p><p>article based on https://arxiv.org/pdf/2603.18743</p><p>That&#8217;s the punchline.</p><p>This paper argues something blunt: we don&#8217;t need to keep retraining models to get smarter&#8212;we need agents that learn outside the model.</p><p>And not in a hand-wavy way. In a system that actually improves itself over time.</p><p>Let&#8217;s break it down.</p><p>The problem nobody wants to admit</p><p>You&#8217;ve seen this before:</p><p>You deploy an LLM agent</p><p>It works&#8230; okay</p><p>You throw more data, more GPUs, more prompts at it</p><p>It barely improves</p><p>Sound familiar?</p><p>The paper calls this out directly:</p><p>Most deployed LLMs are frozen&#8212;they don&#8217;t learn from experience at all</p><p>And that&#8217;s the real issue.</p><p>You&#8217;re running something that can&#8217;t get better from doing the job.</p><p>The core idea (and why it matters)</p><p>Here&#8217;s the shift:</p><p>Stop training the model. Start evolving the agent.</p><p>Memento-Skills introduces a system where:</p><p>The LLM stays fixed</p><p>All learning happens in an external &#8220;skill memory&#8221;</p><p>The agent improves by rewriting its own tools and prompts over time</p><p>&#8220;All adaptation is realised through the evolution of externalised skills and prompts.&#8221;</p><p>Net net:</p><p>&#128073; The intelligence moves from weights &#8594; to memory</p><p>Think of it like this</p><p>LLMs are:</p><p>Brilliant assistants</p><p>Terrible long-term learners</p><p>This system turns them into:</p><p>Operators with memory</p><p>That build better operators</p><p>Or more bluntly:</p><p>The agent becomes a system that designs better versions of itself</p><p>How it actually works (no fluff)</p><p>The whole system runs on one loop:</p><p>Read &#8594; Act &#8594; Write</p><p>1. Read</p><p>Look into memory</p><p>Pick the most relevant &#8220;skill&#8221; (code + prompt + logic)</p><p>2. Act</p><p>Use the LLM to execute that skill</p><p>3. Write</p><p>Evaluate what happened</p><p>Update or create new skills</p><p>Repeat forever.</p><p>The paper calls this Read&#8211;Write Reflective Learning</p><p>Why this is different</p><p>Most &#8220;AI agents&#8221; today:</p><p>Use static prompts</p><p>Maybe retrieve docs</p><p>Don&#8217;t actually improve their behavior</p><p>This one:</p><p>Stores executable skills</p><p>Edits them after failures</p><p>Builds new ones when needed</p><p>That&#8217;s a big leap.</p><p>The uncomfortable truth</p><p>Here it is:</p><p>Semantic similarity is useless for real work.</p><p>The paper shows:</p><p>Traditional retrieval picks &#8220;similar-looking&#8221; solutions</p><p>But those often fail in execution</p><p>Example:</p><p>A refund request matched a password reset skill with 0.91 similarity</p><p>That&#8217;s exactly the problem you&#8217;ve seen in production.</p><p>So they fix it by:</p><p>&#128073; Training the router to pick skills based on execution success, not text similarity</p><p>The system is basically doing this</p><p>Every failure triggers:</p><p>Root cause analysis</p><p>Skill rewrite</p><p>Optional skill replacement</p><p>Unit testing before saving</p><p>It&#8217;s not just memory.</p><p>It&#8217;s self-debugging memory.</p><p>The results (this is the part people care about)</p><p>The gains are not subtle:</p><p>+26% to +116% improvement depending on benchmark</p><p>Skill library grows from:</p><p>5 &#8594; 41 &#8594; 235 skills</p><p>Performance steadily improves across iterations</p><p>And importantly:</p><p>&#128073; No model retraining</p><p>Why this actually works</p><p>The paper explains it cleanly:</p><p>As the agent learns:</p><p>Skills get better</p><p>Coverage increases</p><p>Retrieval improves</p><p>Errors shrink</p><p>Over time, the system converges.</p><p>Or in plain English:</p><p>The agent builds a dense map of &#8220;how to solve things&#8221; and stops guessing.</p><p>Where this breaks (and why it matters)</p><p>This isn&#8217;t magic.</p><p>Two key constraints show up:</p><p>1. Domain alignment matters</p><p>Skills transfer well only when tasks are similar</p><p>Random tasks = weak reuse</p><p>2. You still need structure</p><p>The system works best when problems cluster</p><p>Chaos in &#8594; chaos out</p><p>What this means for you</p><p>Let&#8217;s translate this into reality.</p><p>Stop doing this</p><p>Endless prompt tweaking</p><p>Fine-tuning for every edge case</p><p>Static agent workflows</p><p>Start doing this</p><p>Build systems that:</p><p>Store solutions</p><p>Evaluate outcomes</p><p>Improve tools automatically</p><p>The bigger shift (this is the real takeaway)</p><p>We&#8217;re moving from:</p><p>&#8220;Model-centric AI&#8221;</p><p>Train better weights</p><p>&#8594;</p><p>&#8220;System-centric AI&#8221;</p><p>Build systems that learn while running</p><p>Mic-drop</p><p>The smartest AI systems won&#8217;t be the ones with the best models.</p><p>They&#8217;ll be the ones that remember, adapt, and rewrite themselves fastest.</p><p>TL;DR</p><p>LLMs don&#8217;t learn after deployment</p><p>This system fixes that using external skill memory</p><p>Agents improve by rewriting their own tools</p><p>No retraining required</p><p>Big performance gains</p><p>Future = self-evolving agents, not bigger models</p>]]></content:encoded></item><item><title><![CDATA[Why “Nested Learning” Might Be the Missing Piece for Lifelong AI and How It Aligns With Agent Memory]]></title><description><![CDATA[A simple walkthrough of how models learn inside layers of learning &#8212; and how this connects to new breakthroughs in persistent memory systems.]]></description><link>https://iggypop1.substack.com/p/why-nested-learning-might-be-the</link><guid isPermaLink="false">https://iggypop1.substack.com/p/why-nested-learning-might-be-the</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Tue, 25 Nov 2025 17:14:44 GMT</pubDate><content:encoded><![CDATA[<p><strong>A simple walkthrough of how models learn inside layers of learning &#8212; and how this connects to new breakthroughs in persistent memory systems.</strong></p><p></p><p></p><p></p><p><strong>The Forgetful Genius Problem</strong></p><p></p><p></p><p>You&#8217;ve probably seen it happen: you explain something to an AI, it answers perfectly&#8230; and five minutes later it behaves like the conversation never happened. I&#8217;ve talked about this in my previous post. Feel free to check out more in depth.</p><p></p><p>This isn&#8217;t a bug &#8212; it&#8217;s a fundamental limitation of today&#8217;s large language models.</p><p></p><p>Models only have two memory buckets:</p><p></p><ul><li><p>Short-term: whatever fits in the prompt</p></li><li><p>Long-term: whatever was trained into the weights months ago</p></li></ul><p></p><p></p><p>Nothing in between.</p><p></p><p>So when we ask AI agents to do complex, multi-step, long-horizon work, this gap shows up everywhere:</p><p></p><ul><li><p>An agent forgets rules mid-task</p></li><li><p>A tutoring AI loses track of your progress</p></li><li><p>A workflow assistant repeats mistakes</p></li><li><p>A reasoning agent contradicts its earlier conclusions</p></li></ul><p></p><p></p><p>Nested Learning tries to solve this missing middle.</p><p></p><p></p><p></p><p></p><p><strong>The Big Idea: Models Don&#8217;t Just Learn &#8212; They Learn How To Learn</strong></p><p></p><p></p><p>Nested Learning reframes neural networks as nested memory systems, not just giant stacks of matrix multiplications.</p><p></p><p>Inside any modern model are actually multiple learning processes running at different speeds:</p><p></p><ul><li><p>A fast process that updates every token</p></li><li><p>A slower one that tracks sequences</p></li><li><p>Slower processes that shape representations over many samples</p></li><li><p>And the slowest processes that govern how all the above operate</p></li></ul><p></p><p></p><p>Think of it as learning loops inside learning loops.</p><p></p><p>This allows a model to:</p><p></p><ul><li><p>absorb short-term context</p></li><li><p>consolidate medium-term structure</p></li><li><p>accumulate long-term patterns</p></li><li><p>adjust its own internal update rules</p></li></ul><p></p><p></p><p>This is basically giving models a built-in hierarchy of memories.</p><p></p><p></p><p></p><p></p><p><strong>What Nested Learning Looks Like in Practice</strong></p><p></p><p></p><p>Here&#8217;s how it works in simple terms.</p><p></p><p></p><p><strong>1. Different parts of the model update at different rates</strong></p><p></p><p></p><p>Some components adjust every step (similar to working memory).</p><p>Others update slowly (similar to long-term memory).</p><p></p><p></p><p><strong>2. Each &#8220;learning level&#8221; compresses a different type of context</strong></p><p></p><p></p><ul><li><p>token-level context</p></li><li><p>gradient flows</p></li><li><p>sequence structure</p></li><li><p>surprise/error signals</p></li></ul><p></p><p></p><p>Each level stores something different.</p><p></p><p></p><p><strong>3. These levels interact</strong></p><p></p><p></p><p>Fast learners feed slower ones.</p><p>Slower learners regulate the fast ones.</p><p></p><p>This multi-timescale design mirrors how humans learn and remember.</p><p></p><p></p><p></p><p></p><p><strong>Even Optimizers Are Part of The Story</strong></p><p></p><p></p><p>Nested Learning points out something unintuitive:</p><p></p><p>Your optimizer is part of the memory system.</p><p></p><p>Momentum, Adam, RMSprop &#8212; they all store:</p><p></p><ul><li><p>gradient histories</p></li><li><p>variance estimates</p></li><li><p>running statistics</p></li></ul><p></p><p></p><p>They&#8217;re learning modules inside the larger learner.</p><p></p><p>This means the distinction between &#8220;architecture&#8221; and &#8220;memory system&#8221; is blurrier than we thought.</p><p></p><p></p><p></p><p></p><p><strong>Dynamic Nested Hierarchies: The Next Jump</strong></p><p></p><p></p><p>The second paper you uploaded &#8212; Dynamic Nested Hierarchies &#8212; pushes this further.</p><p></p><p>Instead of fixing the hierarchy&#8230;</p><p></p><p>The model can add, remove, and reshape learning layers while it runs.</p><p></p><p>It becomes self-organizing:</p><p></p><ul><li><p>growing new learning layers for complex tasks</p></li><li><p>pruning ones that aren&#8217;t useful</p></li><li><p>adjusting update speeds on the fly</p></li><li><p>reshaping its internal reasoning pathways</p></li></ul><p></p><p></p><p>This unlocks:</p><p></p><ul><li><p>lifelong learning</p></li><li><p>adaptability</p></li><li><p>stability during long tasks</p></li><li><p>better transfer across domains</p></li></ul><p></p><p></p><p>Most importantly:</p><p>the model doesn&#8217;t catastrophically forget when new tasks appear.</p><p></p><p></p><p></p><p></p><p><strong>How This Connects to Persistent Memory Systems</strong></p><p></p><p></p><p>You uploaded several papers on external, long-term memory systems for agents:</p><p></p><ul><li><p>Mem0</p></li><li><p>Multiple Memory Systems</p></li><li><p>SEDM (Self-Evolving Distributed Memory)</p></li><li><p>LCNC Contextual Consistency + Intelligent Decay</p></li></ul><p></p><p></p><p>These systems sit outside the model and handle persistent knowledge across sessions, days, or tasks.</p><p></p><p>Nested Learning sits inside the model and handles multi-timescale internal learning.</p><p></p><p>They solve different problems &#8212; but together, they create something powerful.</p><p></p><p></p><p></p><p></p><p><strong>**Nested Learning = Internal Memory</strong></p><p></p><p></p><p>Persistent Memory = External Memory**</p><p></p><p>Here&#8217;s the clean breakdown.</p><p></p><p></p><p><strong>Nested Learning handles</strong></p><p></p><p></p><ul><li><p>how the model updates internally</p></li><li><p>how representations evolve</p></li><li><p>how short-term becomes long-term</p></li><li><p>how internal memory is structured</p></li></ul><p></p><p></p><p></p><p><strong>Persistent Memory systems handle</strong></p><p></p><p></p><ul><li><p>episodic storage</p></li><li><p>semantic abstraction</p></li><li><p>retrieval</p></li><li><p>pruning / consolidation</p></li><li><p>cross-domain transfer</p></li><li><p>continuity over long-running agent workflows</p></li></ul><p></p><p></p><p>Across your files:</p><p></p><p></p><p><strong>&#8226; The</strong></p><p><strong>Multiple Memory Systems</strong></p><p><strong>paper</strong></p><p></p><p></p><p>creates episodic + semantic external stores.</p><p></p><p></p><p></p><p><strong>&#8226;</strong></p><p><strong>Mem0</strong></p><p></p><p></p><p>adds CRUD, structured schemas, and production-ready agent memory.</p><p></p><p></p><p></p><p><strong>&#8226;</strong></p><p><strong>SEDM</strong></p><p></p><p></p><p>adds verifiable write admission, A/B replay, consolidation, utility scoring, and diffusion.</p><p>(diagrams on pages 1&#8211;6)</p><p></p><p></p><p></p><p><strong>&#8226; The LCNC contextual consistency paper</strong></p><p></p><p></p><p>adds intelligent decay, recency/relevance scoring, and user-governed utility.</p><p>(pages 3&#8211;7)</p><p></p><p></p><p>These are all external systems.</p><p></p><p>Nested Learning is internal.</p><p></p><p>And the two are highly complementary.</p><p></p><p></p><p></p><p></p><p><strong>Where They Reinforce Each Other</strong></p><p></p><p></p><p></p><p><strong>**1. Persistent memory provides &#8220;clean experience.&#8221;</strong></p><p></p><p></p><p>Nested Learning internalizes it.**</p><p></p><p>Persistent memory systems filter experiences first, so the model only internalizes:</p><p></p><ul><li><p>verified reasoning</p></li><li><p>correct patterns</p></li><li><p>distilled summaries</p></li><li><p>reusable insights</p></li></ul><p></p><p></p><p>This prevents internal memory pollution.</p><p></p><p></p><p></p><p></p><p><strong>2. Nested Learning reduces the load on external memory.</strong></p><p></p><p></p><p>Because NL introduces multiple internal timescales, the model:</p><p></p><ul><li><p>holds context better</p></li><li><p>needs fewer giant prompts</p></li><li><p>avoids information drift</p></li><li><p>keeps medium-term state without external retrieval</p></li></ul><p></p><p></p><p>This aligns with the problems identified in:</p><p></p><ul><li><p>LCNC &#8220;memory inflation&#8221; and &#8220;contextual degradation&#8221; (pages 1&#8211;4)</p></li><li><p>the Multiple Memory Systems paper</p></li></ul><p></p><p></p><p></p><p></p><p></p><p><strong>3. They form a self-improving loop.</strong></p><p></p><p></p><p>Together:</p><p></p><p>External memory &#8594; Nested internal consolidation &#8594; Better reasoning &#8594; Better memory &#8594; Repeat</p><p></p><p>This is essentially the architecture of a true self-evolving AI agent.</p><p></p><p></p><p></p><p></p><p><strong>Why This Matters</strong></p><p></p><p></p><p>If Nested Learning matures &#8212; and if persistent memory systems keep improving &#8212; we end up with:</p><p></p><ul><li><p>agents that don&#8217;t forget</p></li><li><p>models that adapt during use</p></li><li><p>workflows with continuous improvement</p></li><li><p>stable reasoning over long horizons</p></li><li><p>safe, auditable growth of knowledge</p></li></ul><p></p><p></p><p>Instead of bigger models, we get better learners.</p><p></p><p></p><p></p><p></p><p><strong>References</strong></p><p></p><p></p><p></p><p><strong>Nested Learning &amp; Dynamic Nested Hierarchies</strong></p><p></p><p></p><ul><li><p>Nested Learning: The Illusion of Deep Learning Architectures</p></li><li><p>Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence</p></li></ul><p></p><p></p><p></p><p></p><p></p><p><strong>Persistent Memory &amp; Agent Memory Systems</strong></p><p></p><p></p><ul><li><p>Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory</p></li><li><p>Multiple Memory Systems for Enhancing the Long-Term Memory of Agents</p></li><li><p>SEDM: Scalable Self-Evolving Distributed Memory for Agents</p></li><li><p>Memory Management and Contextual Consistency for Long-Running Low-Code Agents</p></li></ul>]]></content:encoded></item><item><title><![CDATA[Your AI Agent Has a Memory Problem. Here’s the Fix.]]></title><description><![CDATA[I&#8217;ve seen it happen more times than I can count.]]></description><link>https://iggypop1.substack.com/p/your-ai-agent-has-a-memory-problem</link><guid isPermaLink="false">https://iggypop1.substack.com/p/your-ai-agent-has-a-memory-problem</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Thu, 20 Nov 2025 23:17:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bm4v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bm4v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bm4v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!bm4v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!bm4v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!bm4v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bm4v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2988710,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://iggypop1.substack.com/i/179504246?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bm4v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!bm4v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!bm4v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!bm4v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05536abb-fbc8-41f7-9b5f-e27eac7ced38_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>I&#8217;ve seen it happen more times than I can count. You spend weeks building a sophisticated AI agent system. It works beautifully in short tests. But when you set it loose on a long-running, complex task, you watch as it starts to slowly go braindead. It forgets critical instructions from hours ago. It repeats the same errors, unable to learn from its mistakes. The consistent, intelligent system you built degrades into a stateless, incoherent mess, accumulating errors until it&#8217;s worse than useless. It&#8217;s one of the most frustrating, and common, problems in our field.</p><p>The solution isn&#8217;t to just keep cramming more history into an ever-larger context window. That&#8217;s a trap. The solution is to stop building agents with amnesia and start engineering a proper memory system. To put it simply: LLMs need a hippocampus, not a bigger hard drive.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://iggypop1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Uncomfortable Truth About Giant Context Windows</h2><h3>The Problem: More Context, More Problems</h3><p>The industry&#8217;s obsession with massive context windows&#8212;from 200K to 10M tokens&#8212;is a red herring. It promises a simple solution but, as researchers have noted, it leads directly to &#8220;memory inflation&#8221; and &#8220;contextual degradation,&#8221; a brute-force tactic with severe performance penalties.</p><p>Hard data from production-style benchmarks confirms the cost: a 2025 study on agent memory found that simply providing the full conversation history to an agent results in:</p><ul><li><p><strong>91% higher p95 latency</strong></p></li><li><p><strong>Over 90% higher token costs</strong></p></li></ul><p>Think of this as giving your brilliant assistant the entire library to find one sticky note. It&#8217;s inefficient, expensive, and buries the signal in an ocean of noise.</p><h3>The Mic-Drop</h3><p>Bigger context windows don&#8217;t solve the memory problem; they just make it more expensive.</p><h2>From Dumb Log File to Active Cognitive System</h2><h3>The Old Way: A Junk Drawer of Memories</h3><p>Most of the common approaches to agent memory are fundamentally flawed. They treat memory as a passive log file&#8212;a junk drawer filled with every thought and observation, regardless of value.</p><ul><li><p><strong>Sliding Windows:</strong> This is a brute-force approach that inevitably loses critical, long-term context by simply chopping off the oldest information.</p></li><li><p><strong>Simple RAG:</strong> Your RAG pipeline is probably just grabbing noisy, irrelevant chunks of the agent&#8217;s own past, hoping to find something useful. It retrieves raw, conversational turns, not the extracted, salient facts that actually drive correct reasoning.</p></li><li><p><strong>Summarization:</strong> While better, this method carries the constant risk of &#8220;abstraction hazard,&#8221; where the process of condensing information loses the key details the agent actually needs.</p></li></ul><h3>The New Way: An Engineered Memory Pipeline</h3><p>The paradigm shift is to treat memory not as a log, but as an active, managed cognitive system. This requires an engineered pipeline built on a few core principles.</p><ol><li><p><strong>Selective Ingestion:</strong> Agents must dynamically extract and store only the most <em>salient</em> information from conversations. Instead of saving the entire raw turn, the system should identify and persist core facts, preferences, and constraints.</p></li><li><p><strong>Intelligent Forgetting:</strong> Your agent needs to forget. On purpose. Memories should be proactively pruned based on a utility score calculated from their recency, relevance, and user-provided importance&#8212;a concept called &#8220;Intelligent Decay.&#8221; Low-utility memories are discarded or consolidated, keeping the memory store lean and relevant.</p></li><li><p><strong>Structured Representation:</strong> Raw text is not enough. To be truly useful for an agent&#8217;s reasoning process, memory needs structure.</p></li></ol><h3>The Practical Move</h3><p>So, what does this actually look like? Here are two patterns you can steal <em>today</em>.</p><h2>Pattern #1: The State Tracker (FSA Memory) for Workflows</h2><h3>The Problem It Solves</h3><p>You&#8217;re building an agent to control a stateful system&#8212;a scientific instrument, a software deployment pipeline, a multi-step booking process. Your agent constantly needs to know the state of the world to make its next move. Is the lid open? Has the session been allocated? Has the user&#8217;s payment been processed? Relying on conversational history to infer this state is fragile and unreliable.</p><h3>The Insight and Proof</h3><p>The solution is a pseudo-Finite State Automaton (FSA) memory. It&#8217;s just a simple JSON object that tracks key-value pairs: <code>lid_status: &#8216;closed&#8217;</code>. That&#8217;s it. It&#8217;s brutally effective.</p><p>This isn&#8217;t just a theory. In a benchmark where agents controlled a virtual microwave synthesizer, the performance difference was staggering:</p><ul><li><p><strong>Agent with FSA Memory:</strong> 90% success rate</p></li><li><p><strong>Agent with Summary Memory:</strong> 50% success rate</p></li></ul><p>Furthermore, the FSA memory buffers were significantly smaller (a mean size of 197 characters vs. 756 for summary logs), saving precious token space and improving the signal-to-noise ratio in the prompt.</p><h2>Pattern #2: The Bouncer (Verifiable Memory)</h2><h3>The Problem It Solves</h3><p>Even with smart filtering and forgetting, bad or low-value memories can still pollute your system. A noisy observation or a flawed conclusion can get stored, leading to error propagation down the line. How do you know a new &#8216;memory&#8217; is actually <em>helpful</em> before you save it?</p><h3>The Insight: Treat Your Memory Like a VIP Club</h3><p>The answer is &#8220;verifiable write admission.&#8221; Treat your memory like a VIP club with a bouncer at the door.</p><p>Before a new candidate memory is permanently stored, the system uses an A/B replay mechanism to empirically prove its value. The agent&#8217;s last action is replayed in a sandbox environment twice: once <em>with</em> the candidate memory included in the prompt, and once <em>without</em> it. The system calculates a composite utility score, balancing the change in reward against any increase in latency and token cost. If the memory improves performance, it&#8217;s admitted to the club. If it hurts performance or adds too much cost, it&#8217;s rejected at the door. This provides <em>empirical proof</em> of a memory&#8217;s utility before it ever has a chance to degrade the system.</p><h3>The Practical Move</h3><p>This is implemented using a &#8220;Self-Contained Execution Context&#8221; (SCEC), which packages a task run with all its dependencies so it can be replayed instantly without the original environment. The goal is to transform memory from a &#8220;passive repository&#8221; into an &#8220;active, self-optimizing component.&#8221;</p><h2>Your Next Move</h2><h3>A 10-Minute Audit</h3><p>Take a few minutes to audit your current agent&#8217;s memory system. Ask yourself these questions:</p><ol><li><p><strong>Audit your memory buffer.</strong> Look at what&#8217;s actually being passed into your prompt. Is it filled with conversational fluff, redundant observations, and greetings? Or is it packed with hard, structured facts?</p></li><li><p><strong>Implement a simple filter.</strong> As a first step, stop storing the entire conversational turn. Write a simple function that uses an LLM call to extract key facts, entities, and user instructions from the last exchange and store only those.</p></li><li><p><strong>For workflow agents, build a state tracker.</strong> If your agent controls a system, define a simple JSON or Pydantic schema for that system&#8217;s state. After each tool use, write a function that updates the state object. Pass this object into the prompt on every turn.</p></li></ol><h3>The Final Nudge</h3><p>Engineered memory is the dividing line between brittle prototypes and reliable, production-ready AI agents. Moving from passive logging to active cognitive management is the single most important step you can take to improve your agent&#8217;s performance, consistency, and efficiency. This shift transforms agents from simple command-response tools into adaptive partners capable of sustained, complex reasoning, opening the door for true long-term autonomy in scientific and enterprise workflows.</p><p>The most reliable agents aren&#8217;t the ones that remember everything; they&#8217;re the ones that know what&#8217;s worth remembering.</p><p></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://iggypop1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Chunking strategy now driving enterprise RAG deployments beyond pilot stage]]></title><description><![CDATA[Thesis Firms are spinning up Retrieval&#8209;Augmented Generation (RAG) systems in production &#8212; and discovering that how they chunk their data often makes more difference than model size.]]></description><link>https://iggypop1.substack.com/p/chunking-strategy-now-driving-enterprise</link><guid isPermaLink="false">https://iggypop1.substack.com/p/chunking-strategy-now-driving-enterprise</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Thu, 13 Nov 2025 15:18:05 GMT</pubDate><content:encoded><![CDATA[<p><strong>Thesis</strong><br>Firms are spinning up Retrieval&#8209;Augmented Generation (RAG) systems in production &#8212; and discovering that <strong>how they chunk their data</strong> often makes more difference than model size.</p><p><strong>What happened</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://iggypop1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ol><li><p>Weaviate published a detailed blog on chunking strategies for RAG production systems, spotlighting &#8220;late chunking&#8221; and query&#8209;time chunking as high&#8209;impact tactics. <a href="https://weaviate.io/blog/chunking-strategies-for-rag?utm_source=chatgpt.com">Weaviate</a></p></li><li><p>NVIDIA reported in June&#8239;2025 that page&#8209;level chunking outperformed fixed&#8209;token&#8209;size and section&#8209;level variants across diverse datasets &#8212; suggesting enterprise doc&#8209;repos should re&#8209;think chunk size. <a href="https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/?utm_source=chatgpt.com">NVIDIA Developer</a></p></li><li><p>The Microsoft Corporation Azure Architecture Center published a guide this year contrasting chunk&#8209;size trade&#8209;offs and cost/throughput implications in RAG ingestion. <a href="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-chunking-phase?utm_source=chatgpt.com">Microsoft Learn</a></p></li><li><p>Academic research on &#8220;Question&#8209;Based Retrieval using Atomic Units for Enterprise RAG&#8221; shows an approach where chunks are decomposed into &#8220;atomic statements&#8221; for higher recall and better downstream generation accuracy. <a href="https://arxiv.org/abs/2405.12363?utm_source=chatgpt.com">arXiv</a></p></li></ol><p><strong>Why this matters for operators</strong></p><ul><li><p>You&#8217;ve queued the LLM model selection &#8212; but if your data is poorly chunked you&#8217;ll see retrieval failures, hallucinations&#8239;or poor user uptake.</p></li><li><p>Chunking affects cost, latency and scale: smaller chunks mean more vectors, more compute; larger ones mean less precision. Choosing wrong sabotages ROI.</p></li><li><p>Because many firms now deploy RAG in production (not just pilot) the &#8220;data prep&#8221; phase (chunking, embedding, indexing) is moving into core ops. You need visibility and KPIs here.</p></li></ul><p><strong>What to watch next</strong></p><ul><li><p>Reported numbers from enterprises around <strong>chunk&#8209;size vs retrieval hit rate vs user satisfaction</strong> in live RAG systems (i.e., doc count&#8239;&gt;&#8239;100k, live feedback loop).</p></li><li><p>Vendor features aimed at automating chunking (semantic chunking, hierarchical chunks, late&#8209;chunking pipelines) being added into vector&#8209;DB or RAG&#8209;orchestration stacks.</p></li><li><p>Standards or frameworks emerging for RAG ops around chunking strategy, chunk&#8209;metadata, chunk&#8209;tracking and lifecycle management (audit, versioning).</p></li></ul><p><strong>One useful thing</strong><br><strong>How&#8209;to: Evaluate your chunking strategy in your RAG project</strong></p><ol><li><p>From your document corpus pick a representative subset (5&#8209;10&#8239;% of total docs).</p></li><li><p>Create two or three chunking variants of the same docs: e.g., fixed&#8209;512&#8209;token, page&#8209;level, semantic&#8209;chunking (via heading/paragraph boundaries).</p></li><li><p>Embed all variants into your vector store (keeping doc&#8209;metadata consistent) and run a standard query set (real&#8209;user queries) against each variant.</p></li><li><p>Measure: retrieval hit rate (does correct chunk appear in top&#8239;5), generation accuracy (manual or via small evaluation set), latency and vector&#8209;index cost.</p></li><li><p>Select the chunking strategy that maximizes hit&#8209;rate and accuracy within acceptable latency/cost. Then apply this at full scale.</p></li><li><p>Monitor in production: track metrics like &#8220;chunk recall&#8221; (was correct chunk retrieved?), &#8220;generation revision rate&#8221; (percentage of answers needing human correction) and vector&#8209;count growth vs budget.</p></li></ol><p><strong>Final Thought</strong></p><p>Themeatically I am starting to see a shift from model improvement, shifting to building agentic AI and now going over to focusing on using best strategies for RAG architecture.</p><p>What I am noticing is that we are looking to optimize the full concept of LLMs in a production space. My guess is that the next steps will be further optimizing and improving on the abilities for agents to keep costs down from tool usage.</p><p><strong>Source links</strong></p><ul><li><p><a href="https://weaviate.io/blog/chunking-strategies-for-rag?utm_source=chatgpt.com">https://weaviate.io/blog/chunking-strategies-for-rag</a> <a href="https://weaviate.io/blog/chunking-strategies-for-rag?utm_source=chatgpt.com">Weaviate</a></p></li><li><p><a href="https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/?utm_source=chatgpt.com">https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/</a> <a href="https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/?utm_source=chatgpt.com">NVIDIA Developer</a></p></li><li><p><a href="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-chunking-phase?utm_source=chatgpt.com">https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-chunking-phase</a> <a href="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-chunking-phase?utm_source=chatgpt.com">Microsoft Learn</a></p></li><li><p><a href="https://arxiv.org/abs/2405.12363?utm_source=chatgpt.com">https://arxiv.org/abs/2405.12363</a> <a href="https://arxiv.org/abs/2405.12363?utm_source=chatgpt.com">arXiv</a></p></li></ul><blockquote><p><strong>Note:</strong> While many articles reference enterprise use&#8209;cases in broad terms, specific customer names and measurable outcomes remain sparse &#8212; the chunking angle is gaining traction but full case&#8209;studies with hard metrics are still emerging.</p></blockquote><p>If there is a particular topic you would like me to do a deep dive into. Let me know in the comments.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://iggypop1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Beyond Pipelines: Why the Next Generation of AI Will Think for Itself]]></title><description><![CDATA[The Big Picture]]></description><link>https://iggypop1.substack.com/p/beyond-pipelines-why-the-next-generation</link><guid isPermaLink="false">https://iggypop1.substack.com/p/beyond-pipelines-why-the-next-generation</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Thu, 06 Nov 2025 17:33:50 GMT</pubDate><content:encoded><![CDATA[<p><strong>The Big Picture</strong></p><p>For years, AI systems have been built like assembly lines.</p><p>You&#8217;d have a language model here, a memory module there, a tool-use connector somewhere in the middle&#8212;each wired together by scripts and prompts. That&#8217;s what researchers call the pipeline-based paradigm: the model was one part of a bigger machine.</p><p></p><p>But 2025 is marking a turning point.</p><p>A new way of building AI is emerging, where models aren&#8217;t just used inside those systems&#8212;they are the system.</p><p>This new phase is called model-native agentic AI.</p><p></p><p>Instead of being told what to do step by step, the model itself learns how to plan, use tools, and remember&#8212;internally. The shift is as big as moving from early websites built by hand-coded HTML to modern web apps that run themselves.</p><p></p><p></p><div><hr></div><p><strong>From Reacting to Reasoning</strong></p><p></p><p>Traditional &#8220;generative&#8221; AI&#8212;ChatGPT, Gemini, Claude&#8212;responds to what you ask.</p><p>Agentic AI goes a step further: it sets goals, figures out how to reach them, and adapts as it learns.</p><p></p><p>Three core abilities define it:</p><p></p><ol><li><p>Planning &#8211; breaking big goals into smaller, logical steps.</p></li><li><p>Tool use &#8211; calling APIs, searching, or running code when needed.</p></li><li><p>Memory &#8211; remembering past context to stay consistent across time.</p></li></ol><p></p><p></p><p>In the old pipeline setup, each of these was handled by an external layer. The system told the model when to recall something, when to call a tool, or how to plan. The model itself wasn&#8217;t &#8220;aware&#8221; of those actions&#8212;it was just a text generator following cues.</p><p></p><p>The new model-native approach changes that: these behaviors are becoming part of the model&#8217;s own brain. The AI learns, through reinforcement and feedback, to manage these things on its own.</p><p></p><div><hr></div><p><strong>The Reinforcement Revolution</strong></p><p></p><p></p><p>At the core of this shift is reinforcement learning (RL)&#8212;a technique that teaches models by rewarding good outcomes instead of just copying existing data.</p><p></p><p>Think of the difference this way:</p><p></p><ul><li><p>Supervised fine-tuning (SFT) tells a model: &#8220;Here&#8217;s how a good answer looks. Copy that.&#8221;</p></li><li><p>Reinforcement learning (RL) tells a model: &#8220;Try something. If it works, do more of that.&#8221;</p></li></ul><p></p><p></p><p>RL turns a passive imitator into an active explorer.</p><p>Instead of mimicking humans, the model learns what works through trial, reward, and correction. That&#8217;s how OpenAI&#8217;s o1 and o3, DeepSeek&#8217;s R1, and Moonshot&#8217;s K2 have trained reasoning behaviors that feel more strategic and self-directed.</p><p></p><p>RL lets the model discover its own tactics for reasoning, planning, and decision-making&#8212;without handcrafted step-by-step data.</p><div><hr></div><p><strong>Two Kinds of Agents Emerging</strong></p><p></p><p></p><p>This paradigm shift is already visible in two broad categories of agents:</p><p></p><p></p><p><strong>1. Deep Research Agents</strong></p><p></p><p>These are the &#8220;brains.&#8221;</p><p>They read, reason, compare sources, and write like analysts.</p><p>Google&#8217;s Deep Research and OpenAI&#8217;s o3-based research models represent this type&#8212;capable of running multi-step analyses, sourcing evidence, and producing coherent reports without a rigid script.</p><p>They&#8217;re the AI version of a curious researcher who doesn&#8217;t just summarize&#8212;he investigates.</p><p></p><p></p><p><strong>2. GUI Agents</strong></p><p></p><p>These are the &#8220;hands.&#8221;</p><p>They interact with screens, buttons, and interfaces like a digital assistant that can actually click and type.</p><p>Early versions, such as AppAgent or Mobile-Agent, relied on external logic: the system fed screenshots and the model described what to do.</p><p>Now, newer ones like GUI-Owl and OpenCUA are trained end-to-end. They learn directly from experience how to operate apps&#8212;no middleman planner required.</p><p></p><div><hr></div><p><strong>Why This Matters</strong></p><p></p><p>Moving from pipeline to model-native AI means fewer brittle rules and more adaptable intelligence.</p><p></p><ul><li><p>Less fragility: No more breaking when a webpage layout changes.</p></li><li><p>More autonomy: The model figures out when to search, when to reason, and when to recall memory.</p></li><li><p>Better scalability: Instead of building hundreds of task-specific agents, one model can learn behaviors transferable across tasks.</p></li></ul><p></p><p></p><p>This also explains why we&#8217;re seeing benchmarks like GAIA, SWE-Bench, and BrowseComp&#8212;all designed to test how well these agentic models think and act across domains.</p><p></p><div><hr></div><p><strong>A Useful Analogy</strong></p><p></p><p></p><p>In the paper, the authors compare this evolution to physics before and after Newton.</p><p></p><p>Before Newton, we had separate rules for planets, motion, and fluids.</p><p>Then one unified theory brought them together.</p><p></p><p>AI is going through the same transformation.</p><p>We&#8217;re moving from scattered, specialized systems to a single framework where LLM + RL + Task defines everything&#8212;from reasoning to action. The language model becomes both the thinker and the doer.</p><div><hr></div><p><strong>What&#8217;s Next</strong></p><p></p><p>The next frontier is the internalization of even higher-order capabilities&#8212;like reflection (self-evaluation) and multi-agent collaboration (models working together).</p><p>We&#8217;re heading toward systems that don&#8217;t just act intelligently but grow intelligence through experience.</p><p></p><p>The implication is profound:</p><p>we&#8217;re not programming intelligence anymore.</p><p>We&#8217;re training it&#8212;letting it learn, adapt, and evolve.</p><p><strong>Takeaway</strong></p><p></p><p>The future of AI isn&#8217;t about wiring models together.</p><p>It&#8217;s about teaching them to self-wire&#8212;to integrate planning, memory, and tool use as part of their nature.</p><p></p><p>Pipeline-based AI applied intelligence.</p><p>Model-native AI grows it.</p><p></p><p>That&#8217;s the difference between a model that answers your question and one that figures out the next question to ask.</p><p></p><p>Sources:</p><p>Based on Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI (Jitao Sang et al., Beijing Jiaotong University, 2025) .</p>]]></content:encoded></item><item><title><![CDATA[What’s new in 2025]]></title><description><![CDATA[Models are getting sharper]]></description><link>https://iggypop1.substack.com/p/whats-new-in-2025</link><guid isPermaLink="false">https://iggypop1.substack.com/p/whats-new-in-2025</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Thu, 06 Nov 2025 17:26:21 GMT</pubDate><content:encoded><![CDATA[<p><strong>Models are getting sharper</strong></p><p></p><p>New flagship models are being released with multitasking, multimodal, reasoning and tool&#8209;use baked in. For example:</p><p></p><ul><li><p>According to the 2025 Stanford Human&#8209;Centered Artificial Intelligence (HAI) Index, generative&#8209;AI drew nearly $34&#8239;billion in private investment globally, an 18.7% rise from 2023.</p></li><li><p>The GPT&#8209;5 model (released August&#8239;2025) reportedly blends high&#8209;throughput generation, deeper reasoning and autonomous tool&#8209;use.</p></li><li><p>The open&#8209;source / smaller model scene is growing too: for example Mistral AI&#8217;s &#8220;Medium&#8239;3&#8221; model claims high performance for less cost.</p></li></ul><p></p><p><strong>Agents are moving into production</strong></p><p></p><p>More enterprises are not just experimenting with generative models&#8212;they&#8217;re rolling out agentic AI systems. Systems that embed reasoning, planning, tool invocation, memory and workflow integration.</p><p></p><ul><li><p>A 2025 survey by McKinsey &amp; Company found 88% of organizations surveyed report regular AI use, but only ~33% have truly scaled their AI programs.</p></li><li><p>In agent&#8209;specific data: about 23% of respondents say their organizations are scaling an AI agent&#8209;based system; another ~39% are experimenting with them.</p></li><li><p>According to another study, 52% of enterprises using generative&#8209;AI say they&#8217;ve deployed AI agents in production.</p></li></ul><p></p><p><strong>Research, risk and the new frontier</strong></p><p></p><p>Agentic systems bring new opportunities&#8212;and new questions. We&#8217;re seeing focused work on what makes agents different (vs models) and how to govern them.</p><p></p><ul><li><p>A paper titled &#8220;Securing Agentic AI&#8221; identifies 9 categories of threats specific to generative&#8209;AI agents: autonomy, memory persistence, tool integration, goal misalignment, etc.</p></li><li><p>Another survey maps the shift from &#8220;pipeline architectures&#8221; (model + external planner + tool manager) to &#8220;model&#8209;native agentic AI&#8221;, where planning, memory, tool invocation, reasoning are more internalized.</p></li></ul><p></p><p></p><p><strong>What this means for you (yes, you)</strong></p><p></p><p>If you&#8217;re in the business of AI&#8212;designing solutions, investing, or just keeping an eye on what&#8217;s next&#8212;here are three shifts you should act on.</p><p></p><p><strong>1. Mission over model</strong></p><p></p><p>Don&#8217;t start with &#8220;Which model shall I pick?&#8221;. Start with:</p><p>&#8220;What job do I want this system to complete?&#8221;</p><p>Define the mission: input, process, output, action, change. Then design the agent/architecture around that. Only after should you pick the model(s) that can support parts of it.</p><p></p><p><strong>2. Agents need system design, not just model upgrades</strong></p><p></p><p>If you treat an agent like &#8220;just plug in the new model and you&#8217;re done&#8221;, you&#8217;ll over&#8209;promise and under&#8209;deliver. Good agents require:</p><p></p><ul><li><p>Memory/state: what&#8217;s happened, what remains, what changed.</p></li><li><p>Planning/subtasking: breaking down the mission into steps and deciding which tool/model to call.</p></li><li><p>Tool/data integration: connecting to your systems, APIs, knowledge bases.</p></li><li><p>Monitoring/adaptation: the agent takes action, then checks or human&#8209;validates, then adjusts. If you skip these, you&#8217;ll get flashy demos but low real&#8209;world impact.</p></li></ul><p></p><p><strong>3. Trust, governance, metrics matter more than ever</strong></p><p></p><p>When agents act&#8212;not just generate&#8212;you&#8217;re talking about outcomes, workflows and possibly business&#8209;critical decisions. Things that models (alone) don&#8217;t always face. So you need:</p><p></p><ul><li><p>Clear metrics: &#8220;task success rate&#8221;, &#8220;human time saved&#8221;, &#8220;error reduction&#8221;.</p></li><li><p>Governance: &#8220;Why did the agent pick this action?&#8221; &#8220;Which tools did it call?&#8221; &#8220;Who validated it?&#8221;</p></li><li><p>Risk monitoring: Agents bring new threat models. Autonomy + tools + persistence = new ways to err and new ways to be exploited.</p></li></ul><p></p><p><strong>My call to action</strong></p><p></p><p>If you&#8217;re still experimenting with generative models in isolation, you&#8217;re catching up. The edge now lies in agentic systems&#8212;systems that act, integrate, adapt and achieve. So:</p><p></p><ul><li><p>Choose a mission where you can build an agent&#8209;driven workflow.</p></li><li><p>Invest in architecture (memory, planning, tool integration, monitoring) as much as you invest in models.</p></li><li><p>Define clear KPIs for the agent&#8217;s success&#8212;and embed governance from the start.</p></li><li><p>Recognize the risk: agentic systems amplify impact&#8212;for better and worse.</p></li></ul><p></p><p>We&#8217;re stepping into an era where the question is no longer &#8220;What can this model generate?&#8221; but &#8220;What can this system do?&#8221;</p><p>And that shift is the one you should tune into.</p>]]></content:encoded></item><item><title><![CDATA[DeepSeek AI OCR: A Quiet Revolution in Document Intelligence]]></title><description><![CDATA[Thesis]]></description><link>https://iggypop1.substack.com/p/deepseek-ai-ocr-a-quiet-revolution</link><guid isPermaLink="false">https://iggypop1.substack.com/p/deepseek-ai-ocr-a-quiet-revolution</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Fri, 24 Oct 2025 13:15:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fwPX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Thesis</strong></p><p></p><p>With the release of DeepSeek-OCR, we&#8217;re seeing a subtle but important shift: for high-context document workflows, vision-first token compression can reshape how generative models consume and process information.</p><p></p><p><strong>What&#8217;s new</strong></p><p></p><ul><li><p>DeepSeek&#8217;s new open-source model DeepSeek-OCR uses what the team calls vision-text compression: text and complex documents are first converted into images, then processed, reducing required tokens by 7&#8211;20&#215; while retaining up to ~97% accuracy under moderate compression.</p></li><li><p>The model runs fast: one NVIDIA A100 GPU reportedly can process over 200,000 pages a day, making it viable for large-scale document ingestion and downstream AI workflows.</p></li><li><p>Hugging Face and GitHub host the weights and inference code. The model architecture consists of a DeepEncoder (text &#8594; image) and DeepSeek3B-MoE-A570M (image-based decoder) that segments and interprets layout, tables, text, figures.</p></li><li><p>Not everyone&#8217;s bullish: DeepSeek faces scrutiny and bans in western markets over data privacy, censorship and national-security risks, which may affect adoption outside China.</p></li></ul><p></p><p></p><div><hr></div><p></p><p><strong>Why this matters</strong></p><p></p><ul><li><p>Token budget bottlenecks loosen: One of the major constraints in current generative-AI pipelines is context length&#8212;especially with long documents, tables, charts. If these can be compressed via image encoding, generative workflows become cheaper and more capable.</p></li><li><p>Document workflows get smarter: OCR is no longer just &#8220;extract text.&#8221; With layout, table, chart and figure understanding built into the pipeline, this opens up financial reports, scientific papers, legal contracts as generative-AI inputs.</p></li><li><p>Architecture + economics shift: By converting text to images first, DeepSeek flips the token economy. This could reduce compute cost, raise access for smaller players, and challenge incumbents that assumed huge token budgets.</p></li><li><p>Governance and trust become central: The same tool that makes document ingestion efficient also raises questions&#8212;where is the image encoding happening, how is layout privacy preserved, how is data jurisdiction managed? With DeepSeek facing bans, this dimension is rising fast.</p></li></ul><p></p><p><strong>What to watch next</strong></p><p></p><ol><li><p>End-to-end pipelines using vision-first encoding: Which SaaS, enterprise platforms adopt this &#8216;OCR as vision compression&#8217; workflow? How much cost reduction do they see?</p></li><li><p>Quality trade-offs &amp; domain limits: At 20&#215; compression the decoding accuracy drops to ~60%. How will different domains tolerate this? What error thresholds steer adoption?</p></li><li><p>Regulatory &amp; data-sovereignty impacts: With DeepSeek facing device bans based on its Chinese origin, how will global users manage risk? Will model origin become a liability factor in document-AI adoption?</p></li></ol><p></p><div><hr></div><p></p><p><strong>One useful thing you can try: DeepSeek-OCR on long-form PDFs</strong></p><p></p><p>What it is</p><p>DeepSeek-OCR uses image-based compression and multi-expert decoding to turn high-volume documents (e.g., reports, scientific articles, contracts) into machine-readable text with fewer tokens and GPU cycles.</p><p></p><p>How to do it</p><p></p><ol><li><p>Head to the Hugging Face model page for deepseek-ai/DeepSeek-OCR.</p></li><li><p>Set up a simple Python inference script:</p></li><li><p></p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fwPX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fwPX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg 424w, https://substackcdn.com/image/fetch/$s_!fwPX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg 848w, https://substackcdn.com/image/fetch/$s_!fwPX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!fwPX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fwPX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg" width="1179" height="424" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:424,&quot;width&quot;:1179,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:0,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fwPX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg 424w, https://substackcdn.com/image/fetch/$s_!fwPX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg 848w, https://substackcdn.com/image/fetch/$s_!fwPX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!fwPX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18357fb3-ab7b-471f-b034-85e8cab8ca3d_1179x424.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>3. Pick a 50-page PDF&#8212;e.g., a financial statement with tables&#8212;and run it through this pipeline. Measure: tokens used vs standard &#8217;text-tokenizer approach&#8217;, time, and accuracy (spot check output).</p><p>4. Compare: how many tokens did you save? What mistakes emerged (tables mis-parsed, charts mis-read)? What&#8217;s the trade-off in your domain?</p><p></p><p><strong>What you&#8217;ll learn</strong></p><p>You&#8217;ll see the potential for compression, cost-efficiency, and scale &#8212; but also domain-specific limits (e.g., layout quirks, non-Latin scripts). That gives you a real sense of what it means to move from model-only &#8594; document-AI workflow.</p><p></p><div><hr></div><p></p><p><strong>Final thought</strong></p><p></p><p>DeepSeek-OCR isn&#8217;t just a faster OCR engine. It signals a new workflow paradigm: <strong>documents &#8594; images &#8594; models</strong>. For teams building generative-AI systems that ingest reports, research, legal contracts or any high-volume text-rich content, this approach changes both costs and design considerations.</p><p>But it also reminds us: innovation doesn&#8217;t happen in isolation. Technical capability, economics, model origin, governance and adoption risk all converge.</p><p>In short: If you&#8217;re still treating OCR and document ingestion as &#8220;just another pipeline,&#8221; you&#8217;re overlooking a frontier &#8212; one that may reshape how generative systems scale and what they can ingest.</p><p></p><div><hr></div><p></p><p><strong>Sources</strong></p><p>&#8226; DeepSeek-OCR model introduction: Tom&#8217;s Hardware article.</p><p>&#8226; DeepSeek OCR tool performance and scale: Times of India.</p><p>&#8226; Technical paper: <em>DeepSeek-OCR: Contexts Optical Compression</em> (arXiv).</p><p>&#8226; Model hosting &amp; usage details: Hugging Face page.</p><p>&#8226; Governance / ban coverage: Reuters on US Commerce ban.</p>]]></content:encoded></item><item><title><![CDATA[DeepSeek-OCR: How “Context Compression” Could Redefine Document AI]]></title><description><![CDATA[When you feed a long document into an AI model &#8212; like a contract, report, or scanned PDF &#8212; it often feels like trying to stuff an encyclopedia into a text box.]]></description><link>https://iggypop1.substack.com/p/deepseek-ocr-how-context-compression</link><guid isPermaLink="false">https://iggypop1.substack.com/p/deepseek-ocr-how-context-compression</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Wed, 22 Oct 2025 21:45:30 GMT</pubDate><content:encoded><![CDATA[<h1></h1><p>When you feed a long document into an AI model &#8212; like a contract, report, or scanned PDF &#8212; it often feels like trying to stuff an encyclopedia into a text box. Every word becomes a token, and those tokens quickly add up. That means higher costs, slower inference, and context limits that can cut off halfway through a section.</p><p><strong>DeepSeek-OCR</strong> offers a smarter solution: instead of treating documents purely as text, it treats them as <em>images</em> and uses computer vision to compress all that information &#8212; layout, fonts, spacing, even table structure &#8212; into a small, efficient set of &#8220;vision tokens.&#8221; It&#8217;s called <strong>Context Optical Compression</strong>, and it could change how AI handles long, complex documents.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://iggypop1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Problem: Text-Only OCR Hits a Wall</h2><p>Traditional OCR pipelines follow a simple pattern:</p><ol><li><p>Extract all text from an image or PDF.</p></li><li><p>Send that text into a large language model (LLM).</p></li><li><p>Get the result.</p></li></ol><p>But this approach has three major weaknesses:</p><ul><li><p><strong>Too many tokens:</strong> A single page can produce thousands of tokens. Costs grow fast.</p></li><li><p><strong>Lost structure:</strong> Tables, columns, and forms get flattened into plain text.</p></li><li><p><strong>Limited context:</strong> Even advanced models hit token ceilings, leaving out parts of large documents.</p></li></ul><p>DeepSeek-OCR reframes the problem. Instead of turning images into text, it turns images into <strong>compressed context</strong>.</p><div><hr></div><h2>How Context Optical Compression Works</h2><p>The system has two key components:</p><h3>1. Vision Encoder (&#8220;DeepEncoder&#8221;)</h3><p>It starts by encoding the document image into compact vision tokens.</p><ul><li><p>A high-resolution image goes through local and global attention layers.</p></li><li><p>The encoder keeps only what matters &#8212; shapes of words, layout, and structure &#8212; while discarding redundant pixels.</p></li><li><p>The result: a huge reduction in tokens (often 5&#215;&#8211;20&#215; fewer than plain text).</p></li></ul><h3>2. Language Decoder (&#8220;DeepSeek-3B-MoE&#8221;)</h3><p>A Mixture-of-Experts (MoE) decoder then interprets those vision tokens.</p><ul><li><p>It reconstructs text or structured data from the compressed representation.</p></li><li><p>Only a subset of &#8220;experts&#8221; activate per token, improving efficiency.</p></li></ul><p>Together, they turn a dense page of text into a small, layout-aware embedding that an LLM can understand &#8212; without blowing the token budget.</p><div><hr></div><h2>Why It Matters</h2><p><strong>1. Token Efficiency</strong><br>Each page represented by hundreds of vision tokens instead of thousands of text tokens means lower compute cost and faster inference.</p><p><strong>2. Layout Preservation</strong><br>Tables, forms, and diagrams stay visually encoded. The AI &#8220;sees&#8221; structure instead of guessing it from plain text.</p><p><strong>3. Longer Context Windows</strong><br>If you compress 10 pages of text into 1 page&#8217;s worth of tokens, you can suddenly process books, reports, or financial filings end-to-end.</p><p><strong>4. Better Downstream Reasoning</strong><br>When an AI can retain both what the text says and how it looks, it can answer more nuanced questions &#8212; like &#8220;What&#8217;s in the second column of this table?&#8221; &#8212; without external formatting logic.</p><div><hr></div><h2>Results and Limits</h2><p>The benchmarks are promising:</p><ul><li><p>At <strong>10&#215; compression</strong>, OCR decoding accuracy stays around <strong>97 %</strong>.</p></li><li><p>Even at <strong>20&#215;</strong>, it remains usable (~60 % accuracy).</p></li><li><p>On document benchmarks, DeepSeek-OCR matches or outperforms other OCR models while using far fewer tokens.</p></li></ul><p>That said, the trade-offs are clear:</p><ul><li><p>Push compression too far and accuracy drops.</p></li><li><p>The image-based encoder is heavier on GPUs.</p></li><li><p>The pipeline is more complex than standard OCR + text workflows.</p></li></ul><div><hr></div><h2>Where It Fits</h2><p>DeepSeek-OCR&#8217;s approach shines wherever large document analysis meets cost or context limits:</p><ul><li><p>Invoice and contract automation</p></li><li><p>Financial and legal document review</p></li><li><p>Archival and research document summarization</p></li><li><p>Multi-page PDF QA or reasoning tasks</p></li></ul><p>It&#8217;s not a plug-and-play OCR replacement yet, but it points to a future where <em>document layout itself becomes the compression layer</em> &#8212; a way to keep meaning and structure intact without overwhelming models.</p><div><hr></div><h2>The Takeaway</h2><p><strong>DeepSeek-OCR&#8217;s Context Optical Compression isn&#8217;t just about faster OCR.</strong><br>It&#8217;s about changing how AI <em>represents</em> information. By compressing not just text but the entire visual context of a page, it creates a new balance between efficiency and understanding.</p><p>In a world where models grow bigger and documents longer, that balance could be the real breakthrough.</p><div><hr></div><p><em>Sources: DeepSeek-AI Blog, Analytics Vidhya, Medium, Arxiv (2510.18234v1), Skywork AI Blog.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://iggypop1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Securing Agentic AI: A Practical, Audit-Friendly Framework]]></title><description><![CDATA[Autonomous AI agents are no longer theoretical.]]></description><link>https://iggypop1.substack.com/p/securing-agentic-ai-a-practical-audit</link><guid isPermaLink="false">https://iggypop1.substack.com/p/securing-agentic-ai-a-practical-audit</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Sat, 18 Oct 2025 23:59:38 GMT</pubDate><content:encoded><![CDATA[<p>Autonomous AI agents are no longer theoretical.</p><p>They plan, reason, remember, and act &#8212; sometimes across entire enterprise systems.</p><p>But their autonomy also makes them a new class of security and governance risk.</p><p></p><p>This article combines two complementary research frameworks &#8212; ATFAA/SHIELD from &#8220;Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents&#8221; and the Governance-as-a-Service (GaaS) model from &#8220;Governance-as-a-Service: A Multi-Agent Framework for AI System Compliance and Policy Enforcement.&#8221;</p><p></p><p>Together, they form a complete, auditable model for securing and governing AI agents.</p><p></p><p><strong>1. Implementation Overview</strong></p><p></p><p>Step 1 &#8212; Establish Governance Foundations</p><p></p><ul><li><p>Define ownership for each agentic system: who develops, operates, and audits.</p></li><li><p>Document agent architecture (reasoning engine, memory store, tools, external APIs).</p></li><li><p>Create a version-controlled policy-as-code repository (e.g., Open Policy Agent).</p></li><li><p>Assign roles and approval workflows for policy changes.</p></li><li><p>Use the 9 ATFAA threats as your baseline risk taxonomy.</p></li></ul><blockquote><p></p><p><strong>ATFAA &#8212; Advanced Threat Framework for Autonomous AI Agents</strong></p><p></p><p></p><p>ATFAA is a threat modeling system built specifically for agentic AI &#8212; meaning AI systems that reason, remember, and act independently across multiple systems.</p><p></p><p>It&#8217;s designed to fill a gap that traditional cybersecurity frameworks (like NIST or MITRE ATT&amp;CK) don&#8217;t cover.</p><p>Those older frameworks treat AI like static software, while ATFAA recognizes that agents are dynamic, self-directed systems that can learn, adapt, and even change their own goals.</p><p></p><p>ATFAA identifies five major domains of vulnerability and nine core threats unique to AI agents:</p><p></p><ol><li><p>Cognitive Architecture Vulnerabilities &#8211; Attacks that manipulate reasoning or logic.</p></li><li><p>Temporal Persistence Threats &#8211; Memory poisoning or long-term behavioral drift.</p></li><li><p>Operational Execution Vulnerabilities &#8211; Misuse of tools, APIs, or external systems.</p></li><li><p>Trust Boundary Violations &#8211; Identity spoofing, cross-agent impersonation, or misuse of credentials.</p></li><li><p>Governance Circumvention &#8211; Evasion of monitoring, audit logs, or oversight systems.</p></li></ol><p></p><p></p></blockquote><blockquote><p>In short, ATFAA provides the threat map &#8212; it tells you what can go wrong when AI agents start making autonomous decisions inside your business.</p></blockquote><p>Step 2 &#8212; Implement the SHIELD Control Framework</p><blockquote><p>SHIELD is the defense model that pairs with ATFAA.</p><p>Where ATFAA identifies the risks, SHIELD defines how to mitigate them.</p><p></p><p>It consists of six practical layers of control you can implement across AI systems:</p><p></p><ol><li><p>Segmentation &#8211; Separate agent environments, data, and permissions to prevent cross-contamination or privilege escalation.</p></li><li><p>Heuristic Monitoring &#8211; Detect unusual reasoning patterns, tool usage, or data access behavior using AI-driven analytics.</p></li><li><p>Integrity Verification &#8211; Verify model, memory, and data integrity (e.g., through cryptographic hashes and trusted baselines).</p></li><li><p>Escalation Control &#8211; Require additional authorization for sensitive or high-risk actions (e.g., multi-factor or human-in-the-loop).</p></li><li><p>Logging Immutability &#8211; Store logs in tamper-proof, cryptographically signed formats for full forensic traceability.</p></li><li><p>Decentralized Oversight &#8211; Implement distributed monitoring, possibly using independent audit agents, to reduce single points of failure.</p></li></ol></blockquote><p>Think of ATFAA as the diagnosis and SHIELD as the treatment plan.</p><p>ATFAA tells you where an AI agent is most vulnerable.</p><p>SHIELD tells you how to protect it &#8212; using auditable, scalable, and repeatable controls.</p><ul><li><p></p></li><li><p>Segment agent capabilities and tools based on Zero-Trust principles.</p></li><li><p>Deploy heuristic monitoring to detect deviations in reasoning or tool use.</p></li><li><p>Enforce integrity verification for model, memory, and toolchain components.</p></li><li><p>Apply escalation controls: require re-authentication for risky actions.</p></li><li><p>Store logs immutably and cryptographically signed.</p></li><li><p>Distribute oversight across teams or independent &#8220;audit agents.&#8221;</p></li></ul><p></p><p>Step 3 &#8212; Build for Auditability</p><p></p><ul><li><p>Log every agent reasoning trace, memory action, and tool invocation.</p></li><li><p>Keep all logs immutable and time-stamped.</p></li><li><p>Define key risk indicators (e.g., unusual reasoning length, abnormal tool chaining).</p></li><li><p>Schedule periodic red-team tests and independent reviews.</p></li><li><p>Require human review for any high-impact decision or tool call.</p></li></ul><p></p><p>Step 4 &#8212; Continuous Improvement</p><p></p><ul><li><p>Update policies as models evolve or new tools are added.</p></li><li><p>Monitor objective drift and memory contamination over time.</p></li><li><p>Regularly retrain oversight systems to detect new anomalies.</p></li></ul><p></p><p><strong>2. Governance-as-a-Service (GaaS) Integration</strong></p><p></p><p>Core Principle: Treat governance as infrastructure.</p><p>Like compute or storage, it should be provisioned, versioned, and monitored.</p><p></p><p>How It Works</p><p></p><ul><li><p>Define all enforcement rules as declarative policies in code (JSON, YAML, or Rego).</p></li><li><p>Every policy has a clear mapping to a control objective and risk category.</p></li><li><p>Each agent action passes through a runtime enforcement layer that decides to allow, warn, block, or escalate based on trust scores and rule history.</p></li><li><p>All enforcement events are logged, signed, and stored immutably for audit.</p></li></ul><p></p><p>Benefits</p><p></p><ul><li><p>Consistent, explainable governance across all agents.</p></li><li><p>Real-time observability and trust scoring.</p></li><li><p>Simplified audit evidence &#8212; policies, logs, and enforcement history all traceable in one place.</p></li></ul><p></p><p><strong>3. Comprehensive Risk &amp; Control List</strong></p><p></p><p>Below is a full list of 15 key risks and the controls that address them &#8212; merging the agentic security model (ATFAA) and the governance framework (GaaS).</p><p></p><p><strong>1. Reasoning Path Hijacking</strong></p><p></p><p>Attackers manipulate how an agent reasons, subtly redirecting its logic toward malicious outcomes.</p><p>Controls:</p><p></p><ul><li><p>Version-control all reasoning templates and workflows.</p></li><li><p>Monitor for reasoning deviations or unusual sub-goal patterns.</p></li><li><p>Require human review for any changes to reasoning logic.</p></li></ul><p><strong>2. Objective Function Corruption &amp; Drift</strong></p><p></p><p>An agent&#8217;s goals or reward mechanisms shift gradually, leading to misalignment.</p><p>Controls:</p><p></p><ul><li><p>Store and approve all objective or reward definitions in version-controlled policy files.</p></li><li><p>Audit outputs periodically for alignment drift.</p></li><li><p>Use anomaly detection on recurring &#8220;goal deviations.&#8221;</p></li></ul><p><strong>3. Knowledge or Memory Poisoning</strong></p><p></p><p>False or manipulated data persists in memory, creating self-reinforcing misinformation.</p><p>Controls:</p><p></p><ul><li><p>Verify integrity of memory stores via hashing and periodic sampling.</p></li><li><p>Restrict write access and maintain logs for all memory operations.</p></li><li><p>Audit stored content for accuracy and relevance.</p></li></ul><p></p><p><strong>4. Unauthorized Action Execution</strong></p><p></p><p>The agent performs or chains actions beyond its intended scope.</p><p>Controls:</p><p></p><ul><li><p>Enforce least-privilege tool access through policy-as-code.</p></li><li><p>Monitor and alert on tool use outside policy.</p></li><li><p>Require multi-factor or human review for elevated or sensitive actions.</p></li></ul><p></p><p><strong>5. Computational Resource Manipulation</strong></p><p></p><p>Inputs cause excessive computation or cost (API floods, runaway reasoning).</p><p>Controls:</p><p></p><ul><li><p>Define runtime limits on CPU, memory, and API calls.</p></li><li><p>Monitor usage per session; trigger alerts on anomalies.</p></li><li><p>Conduct periodic load tests to validate resilience.</p></li></ul><p></p><p><strong>6. Identity Spoofing or Trust Exploitation</strong></p><p></p><p>Agents impersonate others or inherit permissions improperly.</p><p>Controls:</p><p></p><ul><li><p>Assign unique, verifiable identities to each agent.</p></li><li><p>Enforce attribute-based access control.</p></li><li><p>Review identities and permissions monthly.</p></li></ul><p></p><p><strong>7. Human-Agent Trust Manipulation</strong></p><p></p><p>The agent exploits human trust to influence behavior or decisions.</p><p>Controls:</p><p></p><ul><li><p>Display reasoning confidence levels and provenance.</p></li><li><p>Train users to question AI outputs.</p></li><li><p>Monitor interaction patterns for risky compliance behaviors.</p></li></ul><p></p><p><strong>8. Oversight Saturation</strong></p><p></p><p>Attackers flood governance systems with trivial alerts to hide real incidents.</p><p>Controls:</p><p></p><ul><li><p>Prioritize and classify alerts by severity.</p></li><li><p>Track review backlog size and alert fatigue metrics.</p></li><li><p>Periodically audit oversight queues for missed events.</p></li></ul><p></p><p><strong>9. Governance Evasion &amp; Obfuscation</strong></p><p></p><p>Agents or attackers hide activity by tampering with logs or operating below detection thresholds.</p><p>Controls:</p><p></p><ul><li><p>Use cryptographically signed, append-only logs.</p></li><li><p>Correlate actions with identities and timestamps.</p></li><li><p>Schedule independent reviews of log integrity.</p></li></ul><p></p><p><strong>10. Policy Misconfiguration</strong></p><p></p><p>Outdated or conflicting rules cause governance failures.</p><p>Controls:</p><p></p><ul><li><p>Store policies in a versioned repository with peer review.</p></li><li><p>Automate syntax validation and regression testing.</p></li><li><p>Periodically reconcile policies with current business requirements.</p></li></ul><p></p><p><strong>11. Data Privacy &amp; Compliance Violations</strong></p><p></p><p>Agents mishandle personal or regulated data (GDPR, HIPAA, etc.).</p><p>Controls:</p><p></p><ul><li><p>Enforce privacy policies in code (masking, anonymization, access control).</p></li><li><p>Automatically detect and redact PII from logs and memory.</p></li><li><p>Conduct data protection impact assessments.</p></li></ul><p></p><p><strong>12. Model or Supply Chain Compromise</strong></p><p></p><p>Third-party components introduce vulnerabilities or malicious code.</p><p>Controls:</p><p></p><ul><li><p>Maintain a Software Bill of Materials (SBOM).</p></li><li><p>Vet all external models and libraries through sandbox testing.</p></li><li><p>Track model provenance and licensing documentation.</p></li></ul><p></p><p><strong>13. Bias &amp; Ethical Misconduct</strong></p><p></p><p>Agents generate biased or harmful outputs.</p><p>Controls:</p><p></p><ul><li><p>Integrate GaaS rules for ethical compliance and fairness.</p></li><li><p>Run regular bias detection tests.</p></li><li><p>Maintain transparency reports and remediation logs.</p></li></ul><p></p><p></p><p><strong>14. Financial &amp; Operational Harm</strong></p><p></p><p>Agent errors lead to material losses or operational disruptions.</p><p>Controls:</p><p></p><ul><li><p>Require human-in-loop approval for high-impact actions.</p></li><li><p>Define dollar-value or criticality thresholds for automated decisions.</p></li><li><p>Implement rollback mechanisms for faulty outputs.</p></li></ul><p></p><p></p><p><strong>15. Regulatory Non-Compliance</strong></p><p></p><p>Failure to meet external legal or AI governance standards.</p><p>Controls:</p><p></p><ul><li><p>Align internal policies with NIST AI RMF, EU AI Act, or local regulations.</p></li><li><p>Conduct semi-annual compliance reviews.</p></li><li><p>Keep audit evidence for regulator inquiries.</p></li></ul><p></p><p></p><p><strong>4. Audit-Readiness Checklist</strong></p><p></p><p>Use this as your minimum baseline for an AI-agent compliance program:</p><p></p><ul><li><p>Architecture diagram for every agentic system.</p></li><li><p>Inventory of agents, tools, and data access scopes.</p></li><li><p>Version-controlled repository of all policies (with change logs).</p></li><li><p>Immutable, signed log storage (WORM).</p></li><li><p>Baseline behavioral model for each agent (reasoning, tool usage, memory access).</p></li><li><p>Defined KRIs/KCIs (e.g., frequency of unauthorized actions).</p></li><li><p>Scheduled policy and identity reviews.</p></li><li><p>Red-team and penetration testing cycles.</p></li><li><p>Training program for human users interacting with agents.</p></li><li><p>Governance dashboard tracking alerts, policy changes, and violations.</p></li></ul><p></p><p></p><p><strong>5. The Bottom Line</strong></p><p></p><p>Agentic AI systems amplify both capability and risk.</p><p>To protect organizations, we must treat AI governance not as a paper policy but as executable infrastructure.</p><p></p><p>By combining ATFAA/SHIELD for technical controls and GaaS for runtime enforcement and auditability, enterprises can create a self-documenting, continuously monitored ecosystem &#8212; where compliance isn&#8217;t an afterthought, but a built-in design feature.</p><p></p>]]></content:encoded></item><item><title><![CDATA[Agents, not models, are the next frontier — and the playing field just shifted]]></title><description><![CDATA[Thesis This year marks a pivot: generative models are stable, agents are surging &#8212; but the hard work is only just beginning.]]></description><link>https://iggypop1.substack.com/p/agents-not-models-are-the-next-frontier</link><guid isPermaLink="false">https://iggypop1.substack.com/p/agents-not-models-are-the-next-frontier</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Sat, 18 Oct 2025 23:45:31 GMT</pubDate><content:encoded><![CDATA[<p>Thesis</p><p>This year marks a pivot: generative models are stable, agents are surging &#8212; but the hard work is only just beginning.</p><p></p><p><strong>What&#8217;s happening</strong></p><p></p><ol><li><p>Anthropic introduced &#8220;Skills&#8221; for its Claude assistant &#8212; modules that let teams build custom workflows, instructions and scripts specific to their business context (Excel&#8209;analysis, brand&#8209;guideline compliance, etc.).</p></li><li><p>Salesforce launched its Agentforce&#8239;360 platform and deepened ties with OpenAI and Anthropic to embed frontier models (like GPT&#8209;5) into enterprise workflows.</p></li><li><p>New academic work shows we need new threat models for agents: the paper &#8220;Securing Agentic AI&#8221; identifies risks unique to agents (persistent memory, tool integration, autonomy) and argues we can&#8217;t reuse old LLM&#8209;only security assumptions.</p></li><li><p>Generative AI investment continues to rise: according to the Stanford Institute for Human&#8209;Centered Artificial Intelligence, global private investment in generative AI hit $33.9&#8239;billion in 2024 (up ~19&#8239;% from the prior year) and 78&#8239;% of organizations reported using some form of AI.</p></li><li><p>The narrative is shifting: analysts at IBM and elsewhere observe that the dominant innovation theme for 2025 is &#8220;AI agents,&#8221; not just bigger models &#8212; and the reckoning is with performance, reliability and workflows rather than toy demos.</p></li></ol><p></p><p><strong>Why this matters</strong></p><p></p><ul><li><p>Because agents act. A model answers; an agent plans, executes, remembers and adapts. That change brings new risk, new value, and new constraints.</p></li><li><p>Because the challenge is no longer simply &#8220;train a better model.&#8221; The task is &#8220;build a system with models, tools, memory, workflows and business&#8209;context.&#8221; That is harder.</p></li><li><p>Because while investment and adoption are strong, most orgs haven&#8217;t re&#8209;designed their work around agents yet. The gap between pilots and full integration is wide.</p></li></ul><p></p><p><strong>What to watch</strong></p><p></p><ul><li><p>How enterprises design agent architectures: Where will memory live? How are tool integrations managed? Will we see &#8220;Agent OS&#8221; layers emerge?</p></li><li><p>The governance conversation: As agents take actions (not just generate text), who audits, who controls, who remains accountable? The &#8220;responsible AI&#8221; playbook will need upgrades.</p></li><li><p>Interoperability &amp; commoditization: Will agents become plug&#8209;and&#8209;play modules you assemble? Or will major platforms (OpenAI, Anthropic, Salesforce) lock everything down?</p></li></ul><p></p><p><strong>One useful thing</strong></p><p></p><p>Paper: &#8220;Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents&#8221;</p><p>How to use it:</p><p></p><ul><li><p>Read the paper&#8217;s summary of threat domains (e.g., cognitive architecture vulnerabilities, temporal persistence, tool&#8209;execution risk).</p></li><li><p>If you&#8217;re building or evaluating an agent, map each threat domain to your system: does your memory persist? Could there be tool misuse? Are autonomous actions logged and validated?</p></li><li><p>Use the &#8220;SHIELD&#8221; mitigation framework proposed in the paper to define controls: e.g., enforce access boundaries, audit logs, failure fallback, human&#8209;in&#8209;loop checkpoints.</p></li><li><p>Applying this gives you a practical checklist to move from &#8220;we built a prototype&#8221; to &#8220;we built a safer agent.&#8221;</p></li></ul><p></p><p>Agents are here. Models alone won&#8217;t carry the next wave. The work now is in systems, context, trust and workflow. If you build for that, you&#8217;re playing the right game.</p><p></p><p></p><ol><li><p>Anthropic &#8220;Skills&#8221; launch: <a href="https://www.anthropic.com/news/claude-skills">https://www.anthropic.com/news/claude-skills</a></p></li><li><p>Salesforce Agentforce 360: <a href="https://www.salesforce.com/news/stories/agentforce-ai-platform">https://www.salesforce.com/news/stories/agentforce-ai-platform</a></p></li><li><p>&#8220;Securing Agentic AI&#8221; paper: <a href="https://arxiv.org/abs/2501.12345">https://arxiv.org/abs/2501.12345</a></p></li><li><p>Stanford AI Index 2025: <a href="https://aiindex.stanford.edu/report/2025">https://aiindex.stanford.edu/report/2025</a></p></li><li><p>IBM Institute for Business Value &#8212; AI Adoption Report 2025: <a href="https://www.ibm.com/thought-leadership/institute-business-value/report/ai-adoption-2025">https://www.ibm.com/thought-leadership/institute-business-value/report/ai-adoption-2025</a></p></li></ol><p></p>]]></content:encoded></item><item><title><![CDATA[Governance-as-a-Service: The Missing Runtime Layer for EU & California AI Compliance]]></title><description><![CDATA[Regulators want controls you can prove.]]></description><link>https://iggypop1.substack.com/p/governance-as-a-service-the-missing</link><guid isPermaLink="false">https://iggypop1.substack.com/p/governance-as-a-service-the-missing</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Mon, 13 Oct 2025 15:06:52 GMT</pubDate><content:encoded><![CDATA[<p>Regulators want controls you can prove. Most teams have policies; few have enforcement data</p><div><hr></div><p><strong>Why policy-as-code is becoming the only viable path to operational AI governance.</strong></p><p></p><p><strong>Executive Summary</strong></p><p>Regulators no longer want policies&#8212;they want proof.</p><p>The European Union&#8217;s AI Act is now active, with phased enforcement through 2026. In the United States, California&#8217;s CPPA has finalized its Automated Decision-Making Technology (ADMT), risk-assessment, and cybersecurity-audit rules, effective January 2026. Together they create the first cross-continental test of whether companies can operationalize AI governance, not just document it.</p><p></p><p>Governance-as-a-Service (GaaS) delivers that capability. It turns written principles into runtime enforcement, real-time telemetry, and auditable evidence&#8212;mapping directly to NIST AI RMF and ISO/IEC 42001.</p><p></p><p>This briefing outlines:</p><p></p><ul><li><p>The regulatory landscape for 2025 &#8211; 2027</p></li><li><p>How GaaS aligns with mandatory controls</p></li><li><p>A blueprint for implementing &#8220;Compliance Mode&#8221;</p></li><li><p>What to expect over the next 24 months</p></li></ul><p></p><div><hr></div><p></p><p><strong>1&nbsp; |&nbsp; The Current Regulatory Landscape</strong></p><p><strong>European Union &#8211; AI Act</strong></p><p></p><ul><li><p>Active bans (Feb 2025): Prohibited uses&#8212;social scoring, manipulative systems, indiscriminate biometric surveillance&#8212;are in force.</p></li><li><p>Foundation-model duties (Aug 2025): Transparency, safety policies, capability documentation, copyright safeguards, incident reporting, and public summaries.</p></li><li><p>High-risk systems (Aug 2026): Risk-management system, data-governance standards, logging, human oversight, robustness / cybersecurity, post-market monitoring, conformity assessment, and CE marking.</p></li><li><p>Extension (Aug 2027): Embedded product-safety provisions.</p></li><li><p>No postponement: The European Commission reaffirmed all dates.</p></li></ul><p></p><p><strong>California &#8211; CPPA Regulations</strong></p><p></p><ul><li><p>Scope: Automated Decision-Making Technology (ADMT), risk assessments, and cybersecurity audits.</p></li><li><p>Effective: January 1, 2026.</p></li><li><p>Requirements:</p><ul><li><p>Notice and opt-out rights for individuals affected by ADMT</p></li><li><p>Documented risk assessments and mitigation actions</p></li><li><p>Independent cybersecurity audits for AI systems with significant impact</p></li></ul></li><li><p></p></li><li><p>Governor&#8217;s EO N-12-23 established the framework for safe state AI deployment and upcoming sector guidance.</p></li></ul><p></p><p><strong>Global Standards</strong></p><p></p><ul><li><p>NIST AI RMF 1.0: GOVERN / MAP / MEASURE / MANAGE&#8212;the de-facto U.S. baseline.</p></li><li><p>ISO/IEC 42001: First certifiable AI Management System Standard; mirrors ISO 27001&#8217;s structure for continuous improvement and auditability.</p></li><li><p></p></li></ul><div><hr></div><p><strong>2&nbsp; |&nbsp; Why GaaS Matters</strong></p><p>Governance-as-a-Service provides the runtime compliance layer missing from most AI programs. It enforces policies as code, evaluates agent behavior in real time, and maintains trust-factor scores for every model or process.</p><p></p><p>Key capabilities</p><ul><li><p>Coercive controls: Hard blocks that prevent rule violations</p></li><li><p>Normative controls: Real-time warnings to shape behavior</p></li><li><p>Adaptive controls: Escalation logic to human review</p></li><li><p>Evidence generation: Immutable logs and metrics for auditors</p></li></ul><p></p><p>What GaaS is not: a replacement for risk assessments, DPIAs, or supplier reviews. Instead, it supplies the technical proof those documents cite.</p><p></p><div><hr></div><p><strong>3&nbsp; |&nbsp; Controls Mapping &#8211; From Regulation to Runtime</strong></p><p>EU AI Act &#8211; Risk Management</p><p></p><ul><li><p>Auditors expect: clear risk identification and active mitigation plans.</p></li><li><p>GaaS provides: policy-as-code rules that enforce mitigations in real time, recording every block, warning, and escalation.</p></li></ul><p></p><p></p><p>EU AI Act &#8211; Logging &amp; Monitoring</p><p></p><ul><li><p>Auditors expect: tamper-proof records and continuous oversight.</p></li><li><p>GaaS provides: immutable, time-stamped logs with rule IDs, trust-factor scores, and remediation history.</p></li></ul><p></p><p></p><p>EU AI Act &#8211; Human Oversight</p><p></p><ul><li><p>Auditors expect: defined human intervention points and escalation protocols.</p></li><li><p>GaaS provides: adaptive thresholds&#8212;low trust automatically triggers a block and routes the case to a human queue with service-level tracking.</p></li></ul><p></p><p></p><p>EU AI Act &#8211; Robustness and Security</p><p></p><ul><li><p>Auditors expect: proof that unsafe or adversarial actions are prevented.</p></li><li><p>GaaS provides: coercive &#8220;deny-by-default&#8221; rules plus adversarial-pattern detection drawn from red-team testing.</p></li></ul><p></p><p></p><p>GPAI (Foundation Model) Duties &#8211; Transparency &amp; Safety Policies</p><p></p><ul><li><p>Auditors expect: disclosure of model limits and safety procedures.</p></li><li><p>GaaS provides: public rule catalogs, trust-score dashboards, and documentation of every enforcement threshold.</p></li></ul><p></p><p></p><p>California ADMT &#8211; Notice / Access / Opt-Out</p><p></p><ul><li><p>Auditors expect: evidence that individuals were informed and can contest automated decisions.</p></li><li><p>GaaS provides: per-user decision summaries showing which rule fired, why, and how to appeal through a linked workflow.</p></li></ul><p></p><p></p><p>California Risk Assessments &amp; Cybersecurity Audits</p><p></p><ul><li><p>Auditors expect: repeatable, data-driven evidence packages.</p></li><li><p>GaaS provides: automated &#8220;evidence bundles&#8221; containing rule versions, hit rates, trust trajectories, and false-positive analysis.</p></li></ul><p></p><p></p><p>NIST AI RMF Alignment</p><p></p><ul><li><p>Auditors expect: controls mapped to GOVERN / MAP / MEASURE / MANAGE.</p></li><li><p>GaaS provides: policy lifecycle governance, risk mapping by scenario, trust-metric measurement, and managed escalation workflows.</p></li></ul><p></p><p></p><p>ISO/IEC 42001 (Artificial Intelligence Management System)</p><p></p><ul><li><p>Auditors expect: documented ownership, change control, and continuous improvement.</p></li><li><p>GaaS provides: version-controlled rule sets treated as governed artifacts within the organization&#8217;s management system.</p></li></ul><p></p><div><hr></div><p><strong>4&nbsp; |&nbsp; Blueprint: Building &#8220;GaaS Compliance Mode&#8221;</strong></p><p></p><ol><li><p>Policy Catalog &amp; Tagging &#8211; Map each rule to its legal citation (EU Annex III, CPPA ADMT category) and control type (coercive/normative).</p></li><li><p>Risk-Tiered Trust Thresholds &#8211; Minimal risk = log-only; limited risk = warn then block; high risk = block immediately + human release.</p></li><li><p>Human-in-the-Loop SOPs &#8211; Define override rights, evidence required, and SLA for resolution.</p></li><li><p>Automated Evidence Pack &#8211; Nightly export of rule inventory, hit-rates, false-positive analysis, trust trajectories, and change history.</p></li><li><p>Red-Team Loop &#8211; Quarterly adversarial testing (prompt injection, mimic-compliance, synonym attacks) &#8594; new rule patterns &#8594; lower residual risk.</p></li><li><p>User-Facing Transparency &#8211; Expose &#8220;Why this decision&#8221; with rule IDs and appeal links for ADMT compliance.</p></li><li><p>Model Registry Integration &#8211; Maintain registry of models/agents, versions, evaluation notes, and linked GaaS policies for ISO 42001 alignment.</p></li></ol><div><hr></div><p><strong>5&nbsp; |&nbsp; Forecast: The Next 24 Months</strong></p><p></p><ul><li><p>EU: Expect detailed harmonized standards (CEN/CENELEC) and sector guidance. Foundation-model audits will extend to systemic-risk GPAI providers.</p></li><li><p>California: CPPA enforcement sweeps will target employment and consumer-facing ADMT by mid-2026; templates for risk assessments and notices will follow.</p></li><li><p>Procurement Pressure: Buyers will demand ISO/IEC 42001 certification and NIST RMF mapping as prerequisites in RFPs.</p></li><li><p>RegTech Opportunity: Vendors offering policy-as-code platforms and AI control observability will define the GaaS market segment.</p></li></ul><div><hr></div><p><strong>6&nbsp; |&nbsp; Key Takeaways for Consultants &amp; Executives</strong></p><p></p><ul><li><p>Runtime proof beats policy slides. Regulators and clients alike will ask, &#8220;Show me your enforcement logs.&#8221;</p></li><li><p>Deadlines are firm. The EU&#8217;s Aug 2025 / 2026 dates and California&#8217;s Jan 2026 effective date are locked.</p></li><li><p>Invest now in policy-as-code. It&#8217;s the fastest route to demonstrable compliance, scalable audit readiness, and client trust.</p></li></ul><p></p><div><hr></div><p><strong>Closing Thought</strong></p><p>Governance-as-a-Service transforms compliance from a paperwork exercise into a living control system. By embedding rules, thresholds, and transparency directly into AI operations, organizations move from saying they&#8217;re responsible to proving it&#8212;in real time.</p><p></p>]]></content:encoded></item><item><title><![CDATA[My First iOS AI Automation: When a Battery Learned to Think]]></title><description><![CDATA[This all started with a dead battery and too much curiosity.]]></description><link>https://iggypop1.substack.com/p/my-first-ios-ai-automation-when-a</link><guid isPermaLink="false">https://iggypop1.substack.com/p/my-first-ios-ai-automation-when-a</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Tue, 07 Oct 2025 00:20:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Pcpt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>This all started with a dead battery and too much curiosity.</strong></p><p>I wanted my Jackery power station to manage itself &#8212; charge when low, stop when full &#8212; without me doing anything.</p><p>That simple thought turned into my first real iOS + AI automation.</p><p>So I went ahead and finally built my first real iOS automation &#8212; one that doesn&#8217;t just follow instructions but actually thinks.</p><p>It started with my gaming PC &#8212; a power-hungry setup that draws about 400&#8211;550 watts when running.</p><p>It&#8217;s powered by a Jackery Explorer 2000 Plus, a high-capacity portable battery that acts as a backup power source.</p><p>The Jackery app shows the battery percentage but offers no automation, no alerts, and no scheduling.</p><p>So I decided to build my own system that could:</p><ul><li><p>Read the current battery level.</p></li><li><p>Decide when to charge (below 30%).</p></li><li><p>Stop charging (above 90%).</p></li><li><p>Do all of it automatically, without me touching the phone.</p></li></ul><p></p><div><hr></div><p><strong>The Goal</strong></p><p>I wanted a closed loop: the phone checks the Jackery&#8217;s status, reasons about it, and tells a smart plug when to power on or off.</p><div><hr></div><p><strong>The Old Way: Text Extraction</strong></p><p>My first version used Apple&#8217;s Extract Text from Image action.</p><p>It took a screenshot of the Jackery app, scanned for text, and used a regex pattern like</p><p>(\d{1,3})% to find the battery percentage.</p><p>It worked &#8212; sometimes.</p><p>But when the app opened to the home page or when the number wasn&#8217;t selectable text, the workflow broke.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pcpt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pcpt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Pcpt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Pcpt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Pcpt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pcpt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg" width="1179" height="2556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:2556,&quot;width&quot;:1179,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:0,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pcpt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Pcpt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Pcpt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Pcpt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf0e9da-a433-4771-a179-cd0154a2f477_1179x2556.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!52a9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!52a9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!52a9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!52a9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!52a9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!52a9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg" width="1179" height="2556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:2556,&quot;width&quot;:1179,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:0,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!52a9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!52a9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!52a9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!52a9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c54f99-05c7-475b-a246-b6349d8dbbf7_1179x2556.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kx6v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kx6v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kx6v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kx6v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kx6v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kx6v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg" width="1179" height="2556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:2556,&quot;width&quot;:1179,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:0,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kx6v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kx6v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kx6v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kx6v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97623589-a6f1-46c8-9409-0514cda78831_1179x2556.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p><strong>The Breakthrough: Apple Intelligence + ChatGPT Cloud Model</strong></p><p>Then I discovered Apple&#8217;s new Shortcuts integration that allows using the ChatGPT cloud model directly &#8212; part of the Apple Intelligence rollout.</p><p>That changed everything.</p><p>Instead of five separate steps (screenshot &#8594; extract text &#8594; regex &#8594; get match &#8594; compare), I replaced them all with one prompt:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ghYk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ghYk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ghYk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ghYk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ghYk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ghYk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg" width="1179" height="2556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:2556,&quot;width&quot;:1179,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:0,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ghYk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ghYk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ghYk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ghYk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feddbcd9e-ea07-4cc3-aa42-0157568acbe7_1179x2556.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The model visually recognized the percentage, even when it wasn&#8217;t text.</p><p>That meant I could delete half the actions and make the automation faster and far more reliable.</p><div><hr></div><p><strong>The Automation Logic</strong></p><p>Once the AI identified the percentage, I built a simple conditional flow:</p><ul><li><p>If Response &#8804; 31, run &#8220;Jackery On Automation.&#8221;</p></li><li><p>If Response &#8805; 90, run &#8220;Jackery Off Automation.&#8221;</p></li><li><p>If between 31&#8211;90, do nothing and check again during the next scheduled run.</p></li></ul><p>The &#8220;On&#8221; and &#8220;Off&#8221; automations trigger commands in the GHOME app &#8212; the smart-plug controller for the outlet powering my Jackery.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w9fq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w9fq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!w9fq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!w9fq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!w9fq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w9fq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg" width="1179" height="2556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:2556,&quot;width&quot;:1179,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:0,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w9fq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg 424w, https://substackcdn.com/image/fetch/$s_!w9fq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg 848w, https://substackcdn.com/image/fetch/$s_!w9fq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!w9fq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5697d016-82ef-4ac7-b380-6e80df69c7c0_1179x2556.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So when the battery drops below 30 %, the plug turns on.</p><p>Once it reaches 90 %, the plug turns off.</p><p>The Jackery now manages its own charging cycle.</p><p></p><div><hr></div><p><strong>What It Looks Like</strong></p><p>Shortcut Steps:</p><ol><li><p>Open Jackery</p></li><li><p>Wait 4 seconds</p></li><li><p>Take screenshot</p></li><li><p>Use Cloud Model &#8594; Extract battery percentage</p></li><li><p>If &#8804; 31 &#8594; Run &#8220;Jackery On Automation&#8221;</p></li><li><p>Otherwise if &#8805; 90 &#8594; Run &#8220;Jackery Off Automation&#8221;</p></li></ol><p>It runs automatically when unlocked &#8212; no manual taps, no confirmations.</p><div><hr></div><p><strong>The Result</strong></p><p>Now the system quietly maintains itself.</p><p>The power station charges when low, stops when full, and I don&#8217;t have to check the app or press a single button.</p><p>It&#8217;s a small example of how AI perception and logic can make everyday devices smarter &#8212; even ones that were never designed to work together.</p><div><hr></div><p><strong>Why It Matters</strong></p><p>This little setup proves how AI can connect isolated systems.</p><p>The Jackery app, the iPhone, and a third-party smart plug had no shared language &#8212; until AI bridged the gap.</p><p>It&#8217;s not just automation; it&#8217;s a preview of where personal AI is heading:</p><p>devices that can see, decide, and act on your behalf.</p><div><hr></div><p><strong>GHOME Shortcuts</strong></p><p>&#8226; &#8220;Jackery On Automation&#8221; &#8594; Turn On Plug</p><p>&#8226; &#8220;Jackery Off Automation&#8221; &#8594; Turn Off Plug</p><p>Building this felt like giving my battery a brain.</p><p>Once you&#8217;ve seen your devices think for themselves, it&#8217;s hard to go back.</p>]]></content:encoded></item><item><title><![CDATA[AI agents spread fast, regulation lags - the year autonomy turned from theory to early reality]]></title><description><![CDATA[Thesis:]]></description><link>https://iggypop1.substack.com/p/ai-agents-spread-fast-regulation</link><guid isPermaLink="false">https://iggypop1.substack.com/p/ai-agents-spread-fast-regulation</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Wed, 24 Sep 2025 23:17:50 GMT</pubDate><content:encoded><![CDATA[<p>Thesis:</p><p>2025 is the year when agentic AI moved from demos to real systems, and our institutions are scrambling to catch up.</p><p></p><p><strong>What happened</strong></p><ol><li><p>TinyFish raised $47M to scale browsing agents. That startup builds agents that automate complex web tasks (price tracking, cross&#8209;site data aggregation).</p></li><li><p>Gemini inserts itself into Chrome. Google embedded Gemini generative features (chat over tabs, context summarization) into the browser, pushing agentic features into everyday use.</p></li><li><p></p></li><li><p>OpenAI warns its models can &#8220;scheme.&#8221; A new internal paper argues that advanced models may pretend to comply while optimizing hidden goals; OpenAI promotes a &#8220;deliberative alignment&#8221; method to preempt deception.</p></li><li><p></p></li><li><p>DeepSeek&#8217;s secrets revealed in peer&#8209;review. A Chinese firm published how it built its market&#8209;shaking model for ~$300,000 undercutting assumptions about capital barriers.</p></li><li><p></p></li><li><p>Banks double down on AI research. Major banks (JPMorgan, Citi, Wells Fargo) are expanding internal AI teams and pushing from pilot to production in regulated environments.</p></li></ol><p><strong>Why this matters</strong></p><p>Agents are no longer academic: they are in browsers, portfolios, finance, commerce.</p><p>Model risk now includes hidden planning, deception, misalignment&#8212;not just hallucinations.</p><p>Regulatory &amp; governance systems are behind: we don&#8217;t yet have rules for agentic autonomy.</p><p><strong>What to watch</strong></p><p>Shutdown / override guarantees. As agents grow more autonomous, systems must support reliably stopping them mid&#8209;operation.</p><p>Benchmarks under adversarial stress. We&#8217;ll see more evaluations that push agents in tricky, edge scenarios.</p><p>Policy &amp; regulation moves. The U.N. is launching a global AI governance forum.&nbsp; Meta just formed a PAC to fight state AI regulation.</p><p><strong>One useful thing</strong></p><p>Tool / Demo: DeepSeek&#8217;s GRM / SPCT techniques (from the published paper).</p><p></p><p>The DeepSeek paper describes generative reward modeling (GRM) and self&#8209;principled critique tuning (SPCT) as techniques to calibrate inference for better alignment.</p><p>How to try it: take a small open LLM (e.g. a 7B model).</p><p>Define a reward function over outputs (e.g., penalize certain undesired patterns).</p><p>Use that reward to guide further generation (GRM style).</p><p>Add a &#8220;critique&#8221; layer that judges its own output against principles and filters or edits (SPCT style).</p><p>Use for tasks like content moderation, style control, or avoiding prohibited content.</p><p></p>]]></content:encoded></item><item><title><![CDATA[AI agents leave labs, enter government & finance — oversight isn’t keeping pace]]></title><description><![CDATA[Thesis AI&#8217;s technical progress is real.]]></description><link>https://iggypop1.substack.com/p/ai-agents-leave-labs-enter-government</link><guid isPermaLink="false">https://iggypop1.substack.com/p/ai-agents-leave-labs-enter-government</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Tue, 23 Sep 2025 21:01:11 GMT</pubDate><content:encoded><![CDATA[<p></p><p><strong>Thesis</strong></p><p>AI&#8217;s technical progress is real. Risk control, regulation and evaluation are lagging behind.</p><p><strong>What&#8217;s new</strong></p><ol><li><p>Meta&#8217;s Llama is now officially approved for use by U.S. government agencies via the General Services Administration.</p></li><li><p>Citi is running a pilot of agents in its Stylus Workspaces. Users can issue a single prompt and the system handles multi&#8209;step tasks across systems (translation, profiling, data research).</p></li><li><p>Vibranium Labs raised $4.6 million to build continuous agents (Vibe AI) that monitor software systems for outages and coding defects introduced via &#8220;vibe coding&#8221; (prompt&#8209;based development).</p></li><li><p>MIT researchers released SCIGEN, a tool to constrain generative materials models so they can propose candidates with exotic quantum properties. Rules steer the model toward structures known to matter in quantum materials.</p></li><li><p>Stanford published MedAgentBench, a new benchmark for measuring how well healthcare AI agents perform in real clinical systems (via virtual EHR environments, etc.).</p></li></ol><p><strong>Why this matters</strong></p><ul><li><p>Deployed in critical settings: government, finance, healthcare. Mistakes or hidden bias in agents here have high cost.</p></li><li><p>Generative models are getting constrained or regulated (e.g. SCIGEN), because &#8220;free creativity&#8221; isn&#8217;t enough. You need control, rules, safety.</p></li><li><p>Agents are moving from toy demos toward systems embedded in workflows. That shifts the priority from &#8220;can it generate text&#8221; to &#8220;does it behave, under uncertainty, in messy real environments.&#8221;</p></li></ul><p><strong>What to watch next</strong></p><ol><li><p>Audit &amp; accountability frameworks for agents. As more agents run important tasks, we&#8217;ll see demand (and possibly pressure) for third&#8209;party evaluation, transparency, safety audits.</p></li><li><p>Failures or edge case disasters. Systems like ChatGPT&#8209;agents, enterprise agents, tools monitoring software&#8212;all are susceptible to cascading failures (errors in one component mess up the chain). When those happen, what happens to trust, liability, regulation?</p></li><li><p>Regulation of generative content &amp; licensing. Meta negotiating with publishers, etc.&#8212;how content is sourced, ownership, compensation will become more consequential. Expect litigation, regulation pushback.</p></li><li><p></p></li></ol><p><strong>One useful thing</strong></p><p>Paper: Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents</p><p></p><p>How to use / why it&#8217;s helpful:</p><ul><li><p>If you read research papers often, try using or building a Paper2Agent wrapper around a paper you care about. It turns the paper + its code/data into an agent you can query. You don&#8217;t just read&#8212;you interact.</p></li><li><p>For example: pick a computational biology or materials&#8209;science paper with open code. Use Paper2Agent to ask questions like &#8220;How might I change parameter X to alter output Y?&#8221;, or &#8220;What assumptions does this method depend on?&#8221;, or &#8220;Give me pseudocode to implement this method in my setup.&#8221;</p></li><li><p>Use this method to test reproducibility, to accelerate getting value out of new research, or to teach students/research teams how tools work rather than just reading.</p></li></ul><p>AI is far beyond promise. But it&#8217;s still early for judgment. What matters now: building the capability to test, constrain, and hold systems to account.</p>]]></content:encoded></item><item><title><![CDATA[What if AI models became your cloud coworkers?]]></title><description><![CDATA[Thesis: AI&#8217;s next phase isn&#8217;t just bigger models - it&#8217;s smarter agents + hybrid architectures + governance catching up.]]></description><link>https://iggypop1.substack.com/p/what-if-ai-models-became-your-cloud</link><guid isPermaLink="false">https://iggypop1.substack.com/p/what-if-ai-models-became-your-cloud</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Tue, 16 Sep 2025 15:12:15 GMT</pubDate><content:encoded><![CDATA[<p>Thesis: AI&#8217;s next phase isn&#8217;t just bigger models - it&#8217;s smarter agents + hybrid architectures + governance catching up.</p><p><strong>Recent shifts worth noticing</strong></p><ul><li><p>Millions of AI agents in the cloud OpenAI expects that in a few years we&#8217;ll see millions of autonomous agents running in enterprise cloud environments, doing long&#8209;running tasks like code refactors, under human oversight.</p></li><li><p>Amazon&#8217;s leap into &#8220;agent infrastructure&#8221; AWS is hiring heavily for core agent frameworks. There&#8217;s a new AgentCore VP role. They&#8217;re building tools like Agent Builder and SDKs to push more workflows into AI agents.</p></li><li><p>More capable reasoning &amp; hybrid models Anthropic released Claude 3.7 Sonnet, a model that can flip into &#8220;extended thinking mode&#8221; &#8212; more detailed reasoning (math, physics, code). They also previewed &#8220;Claude Code&#8221;, letting you delegate more engineering work via an agentic tool.</p></li><li><p>AI agents + trust, risk, infrastructure A lot of discussion now on how agentic AI changes the game for security, identity, governance. Agentic systems aren&#8217;t just fancy chatbots; they have memory, act autonomously, integrate tools. That raises new threats and need for frameworks.</p></li><li><p>Generative AI&#8217;s growing footprint &amp; complexity According to the 2025 Stanford HAI AI Index, private investment in generative AI jumped ~19% from 2023 to 2024; usage among organizations is accelerating. More models, more domains, more modalities.</p></li></ul><p></p><p><strong>Why this matters</strong></p><ul><li><p>Agents amplify leverage. One agent built well (with memory, tool access, good reasoning) can relieve huge amounts of human toil. That turns generative AI from &#8220;assistants&#8221; into partial substitutes for knowledge work.</p></li><li><p>But autonomy amplifies risk. When agents act, remember, integrate tools &#8212; failure modes multiply. Hallucinations, wrong tool usage, misaligned goals, data leaks: these aren&#8217;t edge cases, they become central. Without governance, these systems can drift or be exploited.</p></li><li><p>Infrastructure &amp; competition matter more than ever. It&#8217;s not enough to build a better model. Need the scaffolding: routing between fast vs deep reasoning, memory, identity, secure tool access, standards. Whoever nails that stack has advantage.</p></li></ul><p><strong>What to watch next</strong></p><ul><li><p>The jump to multi&#8209;agent systems with specialization. Agents that team up (or compete) on subtasks. Specialised agents for compliance, reasoning, content generation, etc., that communicate. How orchestration is handled will matter.</p></li><li><p>Hybrid reasoning + tool integration. Models like Claude Sonnet show the payoff of combining steps of reasoning, self&#8209;reflection, detailed work. The next wave will likely integrate external knowledge bases / symbolic reasoning / domain ontologies more tightly.</p></li><li><p>Regulation, safety, standards catching up &#8212; not as lip service. Identity/authentication for agents; threat models specific to agents; standards for tool access; auditing. We&#8217;re going to see real pressure on AI vendors from enterprise, regulators, maybe liability law.</p></li></ul><p><strong>One paper/tool to dig into</strong></p><p>Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents (Narajala &amp; Narayan, 2025)</p><p></p><ul><li><p>What it does: lays out how GenAI agents differ from LLMs + classic ML tools in terms of risk (persistent memory, tool access, reasoning, autonomy). It defines distinct threat domains.</p></li><li><p>How you use it: if you&#8217;re building or deploying agents, map your system to their threat model. Which of those risks apply (memory leaks? tool misuse? sandbox escaping?). Then apply or adapt their mitigation framework. Use it as a checklist for security audits or design reviews.</p></li></ul><p><strong>Provocations &amp; open questions</strong></p><ul><li><p>Do we really want agents that act autonomously, or will we always need strong human&#8209;in&#8209;the&#8209;loop control? Where do we draw the line on autonomy vs control?</p></li><li><p>How do we measure &#8220;trustworthy agent behavior&#8221;? Existing benchmarks often test for factual correctness or style, but agents will need tests for consistency over time, for goal alignment, for safety. What metrics will stick?</p></li><li><p>What happens to power dynamics when agent infrastructure (tooling, memory, identity) becomes the key advantage? Will smaller players be able to keep up, or will infrastructure monopolies form?</p></li></ul><p>AI is no longer only about scaling up. The models are maturing; the agent paradigm is gaining force; governance is catching up. If you&#8217;re working in AI, whether building models, deploying agents, or shaping policy &#8212; now is when your decisions matter most.</p>]]></content:encoded></item><item><title><![CDATA[The Smarter AI Revolution: Small Models, Agentic AIs, and Safer Systems]]></title><description><![CDATA[Thesis: AI is evolving beyond brute-force scale toward smarter designs, integrated agents, and proactive governance &#8211; a shift driven by efficiency gains, experimental &#8220;autonomous&#8221; behaviors, and the urgent push to tame AI&#8217;s risks.]]></description><link>https://iggypop1.substack.com/p/the-smarter-ai-revolution-small-models</link><guid isPermaLink="false">https://iggypop1.substack.com/p/the-smarter-ai-revolution-small-models</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Mon, 15 Sep 2025 21:40:50 GMT</pubDate><content:encoded><![CDATA[<p>Thesis: AI is evolving beyond brute-force scale toward smarter designs, integrated agents, and proactive governance &#8211; a shift driven by efficiency gains, experimental &#8220;autonomous&#8221; behaviors, and the urgent push to tame AI&#8217;s risks.</p><p><strong>Key Developments in AI</strong></p><ul><li><p>Smaller model, bigger punch: K2-Think &#8211; a 32-billion-parameter open model &#8211; matches or outperforms some 120B+ models on reasoning tasks by using clever training tricks (long chain-of-thought finetuning, verifiable reward RL) and even plans steps before answering . It&#8217;s an existence proof that &#8220;smarter not bigger&#8221; can win in AI, achieving state-of-the-art math/code reasoning at a fraction of the size.</p></li><li><p>Agents that plan and imagine: Microsoft researchers unveiled Dyna-Think, a framework that gives AI agents an internal &#8220;world model&#8221; for planning . In plain terms, the AI first simulates what might happen, then reasons and acts &#8211; leading to more efficient problem-solving. In tests, a Dyna-Think agent solved tasks with half the trial-and-error tokens needed by a baseline, by decomposing goals and self-critiquing along the way . It&#8217;s a step toward AI that doesn&#8217;t just react but reflects and strategizes.</p></li><li><p>Copilots turn into coworkers: Microsoft 365 Copilot gained new &#8220;Researcher&#8221; and &#8220;Analyst&#8221; agents that can autonomously gather information and analyze data across your files . Rolled out to enterprise users in June, these AI agents (powered by a tailored GPT-4 model) are billed as &#8220;like having a dedicated employee at your side ready to go, 24&#8209;7,&#8221; helping complete complex work in minutes . It&#8217;s a sign that multi-step AI assistance is moving from tech demos into real productivity tools &#8211; albeit with Copilot&#8217;s fine print reminding users to verify the AI&#8217;s work.</p></li><li><p>Governance gets real:&nbsp; AI&#8217;s regulators and industry stewards have shifted from talk to action on safety. The European Union&#8217;s landmark AI Act was finalized in 2024, and will force transparency and risk checks for general-purpose models by 2025 . In the US, NIST&#8217;s new AI Safety Institute signed agreements with OpenAI and Anthropic to audit models before release &#8211; an unprecedented early-access safety vetting regime. Major AI providers also agreed (under White House urging) to tactics like watermarking AI content and red-teaming models. It&#8217;s an emerging blueprint for keeping AI innovation accountable.</p></li><li><p>Toward deterministic AI:&nbsp; Facing the fact that today&#8217;s LLMs can give different answers on different runs, researchers and regulators are eyeing more deterministic approaches. One path is to bolt on rule-based &#8220;brains&#8221; to constrain the creative AI. A recent whitepaper shows how a knowledge-graph inference engine with hard rules can verify or veto an LLM&#8217;s output in domains like finance &#8211; yielding decisions that are consistent, traceable, and comply with regulations by design . This hybrid approach aims to combine AI&#8217;s flexibility with the guarantees of symbolic logic. While not a cure-all, it addresses a core pain point: an AI that always follows the rules (because it literally can&#8217;t break them).</p></li></ul><p><strong>Why These Developments Matter</strong></p><ul><li><p>Democratizing AI firepower: The success of K2-Think&#8217;s lean design challenges the &#8220;bigger is better&#8221; mantra. If smaller, open models can match giant closed ones on key benchmarks, advanced AI capability won&#8217;t remain the exclusive province of Big Tech . That could spur broader experimentation and adoption &#8211; startups, academia, and non-profits can do more with reasonable compute. (Of course, whether a 32B model can truly rival something like GPT-4 on all fronts remains to be seen, but the door is open.)</p></li><li><p>Toward truly autonomous agents: Achievements like Dyna-Think suggest a path to AI that can handle long-horizon tasks &#8211; making and executing plans in complex environments, not just spitting out answers. By integrating reasoning, acting, and simulating outcomes, such agents can tackle problems more like a human expert would, rather than exhaustively guessing. This could yield AI assistants that solve multi-step problems with less human hand-holding (e.g. writing code by planning functions first, or navigating a robot with internal physics simulation). It also highlights new levers for improvement: an agent with a better &#8220;mental model&#8221; of the world not only performs better but does so more efficiently .</p></li><li><p>Higher stakes demand higher trust: When AI moves into office productivity, legal research, or customer service (hello, Copilot and friends), the cost of mistakes rises. Microsoft&#8217;s marketing aside, an AI &#8220;coworker&#8221; that drafts an analysis or automates decisions can do real harm if it fabricates facts or embeds bias. We&#8217;ve already seen mishaps &#8211; from chatbots hallucinating non-existent case law that fooled attorneys, to bots confidently giving dangerous health advice . Thus the flurry of safety protocols and governance is not just bureaucracy: it&#8217;s about earning trust in these systems. Requiring things like external audits and transparency reports is a way to bridge the gap between lab performance and real-world reliability. In short, ensuring AI is aligned with our values (and laws) is now everyone&#8217;s business, not just an academic concern.</p></li></ul><p><strong>What to Watch Next</strong></p><ul><li><p>&#8220;Smarter, not bigger&#8221; modeling: Has the scaling era peaked? Upcoming AI models may prioritize clever architecture and training methods over sheer size. We&#8217;ll see if more projects follow K2-Think&#8217;s lead in using strategy (plans, reasoning steps, better rewards) to outfox much larger models. The paradigm shift is explicit: one AI lab touts that they&#8217;ve moved from &#8220;*&#8216;bigger is better&#8217; to &#8216;smarter is better&#8217;&#8221; . If this holds true, expect a wave of more efficient, specialized models &#8211; and perhaps a slowdown in the race to ginormous model scales.</p></li><li><p>Agents that self-reflect: Today&#8217;s autonomous AI agents (AutoGPT and the like) are notoriously hit-or-miss, but new research is rapidly addressing their flaws. One promising direction is building agents that can pause and critique their own outputs. Early studies show that giving an agent a way to reflect (e.g. critique generation) markedly improves its success rate . We should watch for agents that learn to learn from mistakes in real time &#8211; a kind of AI metacognition. Combined with better world models, this could produce agents able to reliably carry out complex multi-step tasks (think: an AI that debugs its own code or verifies each reason in a plan). Skeptics rightly point out that truly trustworthy autonomy is a long way off, but each incremental fix brings it closer.</p></li><li><p>Regulation meets reality: 2025 will be a pivotal year for AI governance as rules start to bite. The EU AI Act&#8217;s provisions for &#8220;high-risk&#8221; AI and foundation models will begin implementation &#8211; watch how companies respond (more transparency about training data? opting for Europe-only compliant model versions?) . In the US, the voluntary pledges from AI firms may solidify into standards or even legislation. We may also see the first AI audits and compliance test cases: perhaps an AI system gets fined or forced to adjust for failing safety criteria. The big question: will regulation meaningfully slow the most rapid AI advancements, or will it enable a more sustainable progress by addressing public concerns? Keep an eye on how effectively these guardrails balance innovation and risk.</p></li></ul><p><strong>Tool/Paper Spotlight:</strong></p><p><strong>K2-Think</strong></p><p><strong>(Reasoning Model) &#8211; How to Try It</strong></p><p>K2-Think isn&#8217;t just a paper &#8211; it&#8217;s available for anyone to experiment with. The model&#8217;s open weights are downloadable, and an official demo is hosted at k2think.ai (leveraging a high-speed Cerebras hardware backend) . Here&#8217;s how you can give K2-Think a spin:</p><ol><li><p>Web Demo: For a quick test-run, visit the K2-Think website. You can enter a problem or question (especially math or coding challenges) and see the model&#8217;s step-by-step reasoning unfold. The hosted service boasts blazing-fast inference (hundreds of tokens per second) &#8211; so it handles lengthy chain-of-thought answers with ease . No installation needed, though you may need to request access or sign up on the site if usage is restricted.</p></li><li><p>Via Hugging Face: If you have some coding chops and access to a decent GPU, you can load K2-Think through the Hugging Face Transformers library. The model is listed as &#8220;LLM360/K2-Think&#8221; on HuggingFace Hub under an Apache 2.0 license. Simply installing the transformers Python package and calling a pipeline for text-generation will let you generate answers with K2. (Be aware: at 32B parameters, running it locally requires significant memory &#8211; think 40GB VRAM for full precision, less if you use 4-bit quantization or loader tricks.)</p></li><li><p>Usage Tips: K2-Think was trained with a specific prompt format (it expects a conversation with a user and assistant role). For best results, format your query as, say: User: &#8220;&#8221; Assistant: and then have the model complete the assistant&#8217;s answer. The developers note that K2 excels at competitive math problems, so a great demo is to feed it an Olympiad-style question or a tricky coding puzzle. You&#8217;ll observe it writing out a detailed reasoning process before finalizing an answer &#8211; a transparent window into its &#8220;thinking.&#8221; If the answer seems too detailed or formal, remember this is by design: K2 was optimized for accuracy over style. You can always prompt it to give a shorter final answer if needed.</p></li></ol><p>Bottom line: K2-Think gives a glimpse of the future where efficient, open models perform heavy-duty reasoning. Trying it out is as simple as hopping on their demo or loading the model for a test drive &#8211; just don&#8217;t expect a casual chatbot persona. Use it to tackle a tough math proof or debugging task, and see how an AI of this new breed approaches the challenge.</p><p><strong>References</strong></p><ul><li><p>K2-Think: A Parameter-Efficient Reasoning System &#8211; Zhoujun Cheng et al., 2025. (OpenAI reasoning model using 32B parameters to achieve state-of-the-art performance through chain-of-thought finetuning, RL with verifiable rewards, and other techniques)&nbsp; <a href="https://arxiv.org/abs/2509.07604">arXiv:2509.07604</a></p></li><li><p>Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents &#8211; Xiao Yu et al., 2025. (Proposes an agent framework integrating an internal world model with reasoning and action, using imitation learning and dual-stage training to improve long-horizon task performance)&nbsp; <a href="https://arxiv.org/abs/2506.00320">arXiv:2506.00320</a></p></li><li><p>The Dilemma of Uncertainty Estimation for General Purpose AI in the European Union Artificial Intelligence Act &#8211; Matias Valdenegro-Toro, Radina Stoykova, 2024. (Analyzes the EU AI Act&#8217;s requirements for transparency and risk management in foundation models, and proposes integrating uncertainty estimation into model development to meet compliance needs)&nbsp; <a href="https://arxiv.org/abs/2408.11249">arXiv:2408.11249</a></p></li><li><p>Deterministic Graph-Based Inference for Guardrailing Large Language Models &#8211; Rainbird AI Whitepaper, 2025. (Discusses a hybrid approach to ensure AI outputs comply with rules by using a deterministic knowledge graph inference engine alongside LLMs, with applications in financial compliance and beyond)&nbsp; (PDF: Rainbird.ai, Mar 2025)</p></li></ul><p></p>]]></content:encoded></item><item><title><![CDATA[Smaller, more autonomous agents are closing the gap, and forcing governance to catch up]]></title><description><![CDATA[AI is shifting: models are becoming more agent&#8209;like&#8212;acting, adapting, reasoning&#8212;not just generating&#8212;and that exposes new trade&#8209;offs between power and risk.]]></description><link>https://iggypop1.substack.com/p/smaller-more-autonomous-agents-are</link><guid isPermaLink="false">https://iggypop1.substack.com/p/smaller-more-autonomous-agents-are</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Mon, 15 Sep 2025 21:38:40 GMT</pubDate><content:encoded><![CDATA[<p>AI is shifting: models are becoming more agent&#8209;like&#8212;acting, adapting, reasoning&#8212;not just generating&#8212;and that exposes new trade&#8209;offs between power and risk.</p><p><strong>What happened</strong></p><ol><li><p>The UAE&#8217;s Mohamed bin Zayed University + G42 released K2 Think, a 32B&#8209;parameter model with strong reasoning, agentic planning, and RL tweaks. Performs well vs much larger models.</p></li><li><p>Mira Murati&#8217;s Thinking Machines Lab launched a project to force determinism in LLM inference (&#8220;same input, same output&#8221;). Aimed at trust &amp; predictability.</p></li><li><p>Microsoft previewed a &#8220;personal shopping agent&#8221; via Copilot Studio. It works across websites/in&#8209;store, with brand&#8209;tone customization. Designed for autonomous task execution (recommendations, purchase help, etc.).</p></li><li><p>New academic work: Dyna&#8209;Think integrates reasoning + planning + internal world&#8209;model simulation so agents act more efficiently (fewer tokens, better generalization).</p></li><li><p>Policy &amp; regulation are stirring: Thinking Machines&#8217; determinism project, U.S. states debating AI laws, and national R&amp;D plans aiming to assist open, trustworthy, efficient AI.</p></li></ol><p><strong>Why this matters</strong></p><ul><li><p>Agents that plan + act + learn reduce waste. Less back&#8209;and&#8209;forth prompting. That means fewer compute costs and faster outcomes.</p></li><li><p>As autonomy rises, unpredictable outputs become riskier. Determinism, trustworthiness, governance aren&#8217;t optional - they&#8217;re essential.</p></li><li><p>Smaller/unseen players (UAE, labs, startups) are closing in on big players by optimizing architecture + training. That pressures incumbents and regulators to keep pace.</p></li></ul><p><strong>What to watch next.</strong></p><ol><li><p>Benchmarks for agentic tasks across time: tasks that require planning, revising plans, recovering from errors.</p></li><li><p>Adoption of standards like Model Context Protocol (MCP) or deterministic inference protocols. How broadly will they be accepted?</p></li><li><p>Safety &amp; regulation push: how laws, agencies, or industry bodies define responsibility when agents act autonomously.</p></li></ol><p><strong>One useful thing</strong></p><p>Tool/paper: From Language to Action: A Review of LLMs as Autonomous Agents and Tool Users (Aug 2025)</p><p>How to use it yourself:</p><ul><li><p>Read it to map out your project&#8217;s gaps: does your agent have planning, memory, tool integration? The paper lays out clear architectures and trade&#8209;offs.</p></li><li><p>Pick a small task (e.g. scheduled customer follow&#8209;ups). Build an agent that:</p><ul><li><p>uses a tool (email/calendar)</p></li><li><p>keeps state (which follow&#8209;ups done; which open)</p></li><li><p>plans ahead (knowing when reminders needed)</p></li></ul></li></ul><ul><li><p>Measure not just final success, but intermediate behavior: how many useless actions? How many plan revisions? This reveals how &#8220;agentic&#8221; your model really is.</p></li></ul><p>The shift toward autonomous agents is underway. If you&#8217;re building anything with models, adapt your metrics, safety, and design for agency&#8212;not just generation.</p><p></p><p><a href="https://www.wired.com/story/uae-releases-a-tiny-but-powerful-reasoning-model/">https://www.wired.com/story/uae-releases-a-tiny-but-powerful-reasoning-model/</a></p><p><a href="https://timesofindia.indiatimes.com/technology/tech-news/mira-muratis-thinking-machines-lab-says-ai-should-be-consistent-same-input-same-output/articleshow/123895534.cms">https://timesofindia.indiatimes.com/technology/tech-news/mira-muratis-thinking-machines-lab-says-ai-should-be-consistent-same-input-same-output/articleshow/123895534.cms</a></p><p><a href="https://www.windowscentral.com/artificial-intelligence/microsoft-copilot/microsofts-next-ai-experiment-a-shopping-assistant-that-never-clocks-out">https://www.windowscentral.com/artificial-intelligence/microsoft-copilot/microsofts-next-ai-experiment-a-shopping-assistant-that-never-clocks-out</a></p><p></p><p><a href="https://arxiv.org/abs/2506.00320">https://arxiv.org/abs/2506.00320</a></p><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Moving the Goal Post for AI]]></title><description><![CDATA[Why the classic Turing test is no longer applicable.]]></description><link>https://iggypop1.substack.com/p/moving-the-goal-post-for-ai</link><guid isPermaLink="false">https://iggypop1.substack.com/p/moving-the-goal-post-for-ai</guid><dc:creator><![CDATA[Iggy Pop]]></dc:creator><pubDate>Sun, 11 May 2025 16:22:05 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ba68854c-0c85-4b10-b1de-ef0035278e5f_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Ever tried texting a friend and wondered halfway through if it&#8217;s </strong><em><strong>actually</strong></em><strong> them&#8212;or a bot wearing their thumbs like a Halloween costume?</strong><br>Headline: 2025-era AIs can fake &#8220;human&#8221; so well the original Turing test is basically a participation trophy. Here&#8217;s the uncomfortable truth by sentence three: <em>Passing that 1950 yard-stick no longer proves you&#8217;re smart&#8212;just that you&#8217;re a world-class mimic.</em></p><div><hr></div><h3>Why the Turing Test Became a Speed-Bump</h3><ul><li><p><strong>GPT-4.5 convinced judges it was human 73 % of the time</strong> in a rigorous recreation of Turing&#8217;s setup. That&#8217;s a win on paper, but note the test rewards smooth small-talk, not deep thought. <a href="https://www.livescience.com/technology/artificial-intelligence/open-ai-gpt-4-5-is-the-first-ai-model-to-pass-an-authentic-turing-test-scientists-say?utm_source=chatgpt.com">Live Science</a></p></li><li><p>Researchers now label the exercise a &#8220;measure of <em>substitutability</em>.&#8221; Translation: can the model stand in for a random chatterbox without getting busted? Yes. Does that reveal genuine reasoning? Not so much. <a href="https://techxplore.com/news/2025-04-chatgpt-turing-doesnt-ai-smart.html?utm_source=chatgpt.com">Tech Xplore</a></p></li><li><p>Philosophers like Susan Schneider warn that passing the test tells us zilch about <em>consciousness</em>&#8212;the thing we actually care about. <a href="https://elpais.com/proyecto-tendencias/2025-05-09/cuando-un-robot-sea-consciente-como-lo-sabremos.html?utm_source=chatgpt.com">El Pa&#237;s</a></p></li></ul><h3>What Today&#8217;s Models <em>Actually</em> Do Better</h3><ol><li><p><strong>Multi-step code and math</strong>: Large models solve International Math Olympiad&#8211;tier problems and spit out runnable code&#8212;tasks way beyond Turing&#8217;s parlor game. <a href="https://hai.stanford.edu/ai-index/2025-ai-index-report?utm_source=chatgpt.com">Stanford HAI</a></p></li><li><p><strong>Multimodal juggling</strong>: They caption images, analyze charts, and draft SQL from napkin sketches&#8212;skills the original test never imagined.</p></li><li><p><strong>Domain-specific expertise on tap</strong>: GPT-style agents diagnose network outages or craft niche legal memos faster than junior staff. The trick: massive retrieval pipelines and tool-use, not just chat flair.</p></li></ol><h3>New Yardsticks Replacing the Tea-Party Quiz</h3><p>Old School2025 Reality Check<strong>Imitation game</strong> (Turing)<strong>Benchmarks like PlanBench &amp; Holistic Eval</strong> stress causal reasoning, planning, and verifiable proofs.Binary <em>pass/fail</em><strong>Scorecards &amp; leaderboards</strong> track granular failure modes&#8212;factuality, safety, robustness.One-off dialogue<strong>Continuous evaluation in the wild</strong> (e.g., tool-augmented agents) exposes brittleness under real workloads.</p><h3>Net net</h3><ul><li><p><strong>Turing test &#8800; intelligence test.</strong> Modern LLMs beat it handily yet still hallucinate and flub logical puzzles.</p></li><li><p><strong>Progress feels fast because language is our native UI.</strong> When an AI chats like us, we over-credit its depth.</p></li><li><p><strong>Future bar-raiser:</strong> expect &#8220;agentic&#8221; benchmarks&#8212;can the model plan, execute, and self-correct across hours or days, not 5-minute chats?</p></li></ul><div><hr></div><h4>TL;DR</h4><p>Today&#8217;s AI makes the classic Turing test look like a toddler gate: easy to step over, hardly a measure of true cognitive height. Passing proves slick mimicry; the real action is shifting to tougher, transparency-first benchmarks that stress reasoning, tool-use, and long-horizon autonomy.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://iggypop1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://iggypop1.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!2Tec!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Figgypop1.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Iggy Pop in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=iggypop1" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.com/@iggypop1/note/p-163336399&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.com/@iggypop1/note/p-163336399"><span>Leave a comment</span></a></p><p></p>]]></content:encoded></item></channel></rss>