Latent Space: The AI Engineer Podcast

Technology Podcasts

The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI...

Location:

United States

Genres:

Technology Podcasts

Business & Economics Podcasts

Entrepreneurship

Description:

The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

Language:

English

Website:

https://www.latent.space/podcast

Episodes

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

4/23/2026

Today, we check in a year after the first Unsupervised Learning x Latent Space Crossover special to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe, but before the Cursor-xAI deal. Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs. Thanks to Jacob and the UL production team for hosting and editing this! Jacob Effron * LinkedIn: https://www.linkedin.com/in/jacobeffron/ * X: https://x.com/jacobeffron Full Episode on Their YouTube We discuss: * swyx’s view from the center of the AI engineering zeitgeist: OpenClaw, harness engineering, context engineering, evals, observability, GPUs, multimodality, and why conference tracks now reveal what matters most in AI * Whether AI infrastructure has finally stabilized: why “skills” may be the minimal viable packaging format for agents, why infra companies have had to reinvent themselves every year, and why application companies have had an easier time surviving model volatility * The vertical vs. horizontal AI startup debate: why application companies can act as the outsourced AI team for enterprises, why some horizontal companies still matter, and why sandboxes may be the clearest reinvention of classic cloud infrastructure for the AI era * The “agent lab” playbook: starting with frontier models, specializing for your domain, then training your own models once you have enough data, workload, and user behavior to justify the cost and latency savings * Why domain-specific model training is real, not just marketing: how companies like Cursor and Cognition can get users to choose their in-house models, and why search, domain specialization, and distillation are becoming more important * Open models, custom chips, and alternative inference infrastructure: why swyx has turned more bullish on open source, why non-NVIDIA hardware is suddenly getting real attention, and why every 10x speedup can unlock new product experiences * What it means to sell to agents instead of humans: why agent experience may mostly just be good developer experience by another name, why APIs and docs matter more than ever, and how pretraining-data incumbents are compounding advantages in an agent-first world * Why memory and personalization may become the next big wedge: today’s models mostly reward frequency of mentions, but in the future, swyx expects product choice to be shaped much more by personalized memory systems * The state of the AI coding wars: why coding has become one of the largest and fastest-growing categories in AI, how Anthropic, OpenAI, Cursor, and Cognition have all ridden the wave, and why the category may still have more room to run * Capability exploration vs. efficiency: why the industry is still in a token-maxing, experiment-heavy phase where people are rewarded for spending more rather than less * Claude Code vs. Codex and the strange stickiness of coding products: why first magical product experiences may matter more than expected, and why the bigger mystery may be why only a few names have emerged as real winners so far * What the end state of the coding market might look like: two major players, a longer tail of niche products, and possible disruption if Microsoft, Mistral, xAI, or the Chinese labs push harder into coding * Where application companies still have room against the labs: why frontier labs are trying to expand into verticals like finance and healthcare, but still leave space for focused companies that own the workflow and the last mile * Why coding may be a preview of every other AI market: the first category to truly go parabolic, the clearest example of foundation model companies colliding with application companies, and a template for how future vertical AI...

Duration:00:54:52

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

4/22/2026

Early bird discounts for the San Francisco World’s Fair, the biggest AIE gathering of the year, end today - prices will go up by ~$500 tonight so do please lock in ASAP! From near-universal AI tool adoption inside Shopify to internal systems for ML experimentation, auto-research, customer simulation, and ultra-low-latency search, Mikhail Parakhin joins us for a deep dive into what it actually looks like when a 20-year-old, $200B software company goes all-in on AI. We cover why Shopify has become much more vocal about its internal stack, what changed after the December model-quality inflection, and why the real bottleneck in AI coding is no longer generation, but review, CI/CD, and deployment stability. We also go inside Tangle, Tangent, SimGym, which are three major AI initiatives that Shopify is doing to make experimentation reproducible, optimization automatic, customer behavior simulatable, and search and catalog intelligence faster and cheaper at scale. Along the way, Mikhail explains UCP, Liquid AI, and why token budgets are directionally right but often measured badly, why AI-written code can still increase bugs in production, what makes Shopify’s customer simulation defensible, and what he learned from the Sydney era at Bing. We discuss: * Mikhail’s path from running a major Microsoft business unit spanning Windows, Edge, Bing, and ads to becoming CTO of Shopify * Why Shopify is talking more publicly about AI now, and why staying at the frontier has become necessary for the company * Shopify’s internal AI adoption curve, the December inflection, and why CLI-style tools are rising faster than traditional IDE-based tools * Why Jensen Huang is directionally right on token budgets, but raw token count is still the wrong way to evaluate engineering output * Why the real unlock is not more agents in parallel, but better critique loops, stronger models, and spending more on review than generation * Why AI coding can still lead to more bugs in production even if models write cleaner code on average than humans * Why Shopify built its own PR review flow, and why Mikhail thinks most off-the-shelf review tools miss the point * How PR volume, test failures, and deployment rollback are becoming the real bottlenecks in the agent era * Why Git, pull requests, and CI/CD may need a new metaphor once code is written at machine speed * What Tangle is, and how Shopify uses it to make ML and data workflows reproducible, collaborative, and production-ready from the start * Why Tangle is different from Airflow, and why content-addressed caching creates network effects across teams * What Tangent is, and how Shopify is using auto-research loops to optimize search, themes, prompt compression, storage, and more * Why Tangent is becoming a democratizing tool for PMs and domain experts, not just ML engineers * Why AutoML finally feels real in the LLM era, and where auto-research still falls short today * Why Tangle, Tangent, and SimGym become much more powerful when combined into one system * What SimGym is, why simulated customers only work if you have real historical behavior, and why Shopify’s data gives it a moat * How SimGym evolved from comparing A/B variants to telling merchants what to change on a single live storefront to raise conversions * Why customer simulation is so expensive, from multimodal models to browser farms to serving and distillation costs * How Shopify models merchant and buyer trajectories, runs counterfactuals, and thinks about interventions like discounts, campaigns, and notifications * Why category-level behavior is so different across commerce, and why ideas like Chinese Restaurant Processes are showing up again in practice * Shopify’s new UCP and catalog work, including runtime product search, bulk lookups, and identity linking * Why Shopify is using Liquid AI, and why Mikhail sees it as the first genuinely competitive non-transformer architecture he has used in practice * Where Liquid...

Duration:01:12:25

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

4/20/2026

Today, we explain this piece of “clickbait” from our guest! TL;DR: 95% of cancer treatments fail to pass clinical trials, but it may be a matching problem — if we better understood what patients have which tumors which will respond to which treatments, success rates improve dramatically and millions of lives can be saved — with the treatments we ALREADY have. See our full episode dropping today: Why Big Pharma is licensing AI Models Tolstoy famously wrote, ‘All healthy cells are alike; each cancer cell is unhappy in its own way.’ Or something like that. Cancer might be the most misunderstood disease out there. It’s not one disease, it’s a family of diseases. Hundreds, maybe thousands, of unique diseases each with its own underlying biology. With this lens, saying you’ll “cure cancer” is like saying you’ll solve legos. We keep hearing AI will cure cancer, but sadly it may not be so easy. Today’s guests — Ron Alfa and Daniel Bear from Noetik — thinks they can use AI to break through a core bottleneck in the treatment development process. GSK recently signed a $50M deal for their technology that also includes an (undisclosed) long-term licensing deals for Noetik’s models like the recently announced TARIO-2, an autoregressive transformer trained on one of the largest sets of tumor spatial transcriptomics datasets in the world. Whole-plex spatial transcriptomics is the richest way to read a tumor, and approximately ~0% of cancer patients going through standard care ever get one — and TARIO-2 can now predict an ~19,000-gene spatial map from the H&E assay every patient already has. Most big AI plays in BioTech have focused on discovery, and usually result in an in-house development effort (meaning tools companies usually become drug companies). This deal stands out in that it is a software licensing deal, and represents a commitment to a platform rather than a drug. With attention on other software tools for drug development (see the Boltz episode and Isomorphic for example), it is starting to look like the appetite of Pharma for biotech tools has finally started to grow. Why the sudden interest? Cancer is hard Biology is hard, cancer is harder. But despite this, we’ve made incredible progress. So many cancers that would have been death sentences twenty years ago are routinely survivable. It used to be our main strategy was just chemotherapy — poison you and hope the tumor dies before you do. Now, there are many treatments that actually kill a tumor and leave the rest of you intact! Immune checkpoint inhibitors like Keytruda and Opdivo target the defenses of dozens of tumor types. CAR-T therapy adds modified T-cells to your blood that can target B-cell malignancies very accurately. Antibody Drug Conjugates such as Trastuzumab combine a drug with an antibody, allowing it to target very specific (cancer) cells. We truly live in marvelous times. With that said, we still have a long way to go. For every type of cancer with a miracle treatment, we have many more that are still death sentences. The world spends $20-30 billion a year trying to cure cancers, with hundreds of clinical trials yearly.Yet, progress is slow with a 95% failure rate in clinical trials. The lab doesn’t translate to the clinic Are we leaving something on the table? Enter Noetik and Ron Alfa. Ron’s core thesis is that many of these “failed” treatments actually work! But we’re not looking at the right patients with the right tumors. If only we had a way to really understand the unique types of cancer biologies and which patients will respond to which treatments, we might be able to show a much higher success rate. Millions of lives (and billions of dollars) may ride on this. The Hard part: Blind Faith in Data Collection Ron and Noetik had the conviction to spend almost two years just collecting data. Lots, and lots, and lots, of data. Noetik has acquired thousands of actual human tumors, and collects a large multimodal dataset of hundreds of millions of...

Duration:01:25:21

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

4/14/2026

For all those who missed out on London, see you in Miami next week! Notion, the knowledge work decacorn, has been building AI tooling since before ChatGPT, with many hits from Q&A in 2023 and unified AI in 2024 and Meeting Notes in 2025. At the end of their last Make user conference, Ryan Nystrom teased Notion 3.0’s Custom Agents - and they are finally embracing the Agent Lab playbook! Sarah Sachs and Simon Last of Notion join us for a deep dive into how Notion built Custom Agents, why it took years and multiple rebuilds to get right, and what it means to turn a productivity tool into an agent-native system of record for enterprise work. We go inside the product, engineering, evals, pricing, and org design decisions behind one of the most ambitious AI product efforts in software today — from early failed tool-calling experiments in 2022 to agent harnesses, progressive tool disclosure, meeting notes as data capture, and the long-term vision for software factories and agentic work. We discuss: * Sarah and Simon’s path to launching Notion Custom Agents, and why the feature was rebuilt four or five times before it was ready for production * Why early agent attempts failed: no tool-calling standard, short context windows, unreliable models, and too much complexity exposed to the model * The “Agent Lab” thesis: not just wrapping a model, but understanding how people collaborate and building the right product system around frontier capabilities * How Notion thinks about roadmap timing: not swimming upstream against model limitations, but also building early enough that the product is ready when the models are * Why coding agents feel like the kernel of AGI, and how Notion is thinking about “software factories” made up of agents that spec, code, test, debug, review, and maintain codebases together * How Sarah runs AI engineering at Notion (“notes from Token Town”): objective-setting over idea ownership, low-ego teams comfortable deleting their own work, and a culture designed to swarm around fast-changing opportunities * The “Simon Vortex,” company hackathons, and why security gets pulled in early rather than late * How Notion organizes AI: core AI capabilities and infrastructure, product packaging teams, and a broader company mandate that every product surface must increasingly work for both humans and agents * Why prototypes have become much easier to build internally, and how “demos over memos” changes product development inside a tool the whole company already uses every day * Notion’s eval philosophy: regression tests, launch-quality evals, and “frontier/headroom” evals that intentionally only pass ~30% of the time so the company can see where model capabilities are going * What a “Model Behavior Engineer” is, and why Notion treats eval writing, failure analysis, and model understanding as a distinct function rather than just software engineering * The changing role of software engineers in the age of coding agents, and why the new job looks less like typing code and more like supervising a rigorous outer system of agents, PRs, and verification loops * How the “software factory” should work: specs, self-verification, bug flows, subagents, and minimizing human intervention while preserving the invariants that matter * A live walkthrough of a Notion Custom Agent handling coworking space tenant applications by triaging email, enriching applicants with web search, and writing structured data into a Notion database * How agents compose inside Notion: shared databases as primitives, agents invoking other agents, “manager agents” supervising dozens of specialized agents, and memory implemented simply as pages and databases * Notion’s take on MCP vs CLI: why Simon is bullish on CLI’s self-debugging nature, where MCP still makes sense, and how Sarah thinks about capability, determinism, permissioning, and pricing alignment * The evolution of Notion’s internal agent harness: from early JavaScript coding agents, to...

Duration:01:17:17

Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony

4/7/2026

We’re proud to release this ahead of Ryan’s keynote at AIE Europe. Hit the bell, get notified when it is live! Attendees: come prepped for Ryan’s AMA with Vibhu after. Move over, context engineering. Now it’s time for Harness engineering and the age of the token billionaires. Ryan Lopopolo of OpenAI is leading that charge, recently publishing a lengthy essay on Harness Eng that has become the talk of the town: In it, Ryan peeled back the curtains on how the recently announced OpenAI Frontier team have become OpenAI’s top Codex users, running a >1m LOC codebase with 0 human written code and, crucially for the Dark Factory fans, no human REVIEWED code before merge. Ryan is admirably evangelical about this, calling it borderline “negligent” if you aren’t using >1B tokens a day (roughly $2-3k/day in token spend based on market rates and caching assumptions): Over the past five months, they ran an extreme experiment: building and shipping an internal beta product with zero manually written code. Through the experiment, they adopted a different model of engineering work: when the agent failed, instead of prompting it better or to “try harder,” the team would look at “what capability, context, or structure is missing?” The result was Symphony, “a ghost library” and reference Elixir implementation (by Alex Kotliarskyi) that sets up a massive system of Codex agents all extensively prompted with the specificity of a proper PRD spec, but without full implementation: The future starts taking shape as one where coding agents stop being copilots and start becoming real teammates anyone can use and Codex is doubling down on that mission with their Superbowl messaging of “you can just build things”. Across Codex, internal observability stacks, and the multi-agent orchestration system his team calls Symphony, Ryan has been pushing what happens when you optimize an entire codebase, workflow, and organization around agent legibility instead of human habit. We sat down with Ryan to dig into how OpenAI’s internal teams actually use Codex, why the real bottleneck in AI-native software development is now human attention rather than tokens, how fast build loops, observability, specs, and skills let agents operate autonomously, why software increasingly needs to be written for the model as much as for the engineer, and how Frontier points toward a future where agents can safely do economically valuable work across the enterprise. We discuss: * Ryan’s background from Snowflake, Brex, Stripe, and Citadel to OpenAI Frontier Product Exploration, where he works on new product development for deploying agents safely at enterprise scale * The origin of “harness engineering” and the constraint that kicked off the whole experiment: Ryan deliberately refused to write code himself so the agent had to do the job end to end * Building an internal product over five months with zero lines of human-written code, more than a million lines in the repo, and thousands of PRs across multiple Codex model generations * Why early Codex was painfully slow at first, and how the team learned to decompose tasks, build better primitives, and gradually turn the agent into a much faster engineer than any individual human * The obsession with fast build times: why one minute became the upper bound for the inner loop, and how the team repeatedly retooled the build system to keep agents productive * Why humans became the bottleneck, and how Ryan’s team shifted from reviewing code directly to building systems, observability, and context that let agents review, fix, and merge work autonomously * Skills, docs, tests, markdown trackers, and quality scores as ways of encoding engineering taste and non-functional requirements directly into context the agent can use * The shift from predefined scaffolds to reasoning-model-led workflows, where the harness becomes the box and the model chooses how to proceed * Symphony, OpenAI’s internal Elixir-based orchestration layer for...

Duration:01:12:43

Marc Andreessen introspects on The Death of the Browser, Pi + OpenClaw, and Why "This Time Is Different"

4/3/2026

Fresh off raising a monster $15B, Marc Andreessen has lived through multiple computing platform shifts firsthand, from Mosaic and Netscape to cofounding A16z. In this episode, Marc joins swyx and Alessio in a16z’s legendary Sand Hill Road office to argue that AI is not just another hype cycle, but the payoff of an “80-year overnight success”: from neural nets and expert systems to transformers, reasoning models, coding, agents, and recursive self-improvement. He lays out why he thinks this moment is different, why AI is finally escaping the old boom-bust pattern, and why the real bottleneck may be less about models than about the messy institutions, incentives, and social systems that struggle to absorb technological change. This episode was a dream come true for us, and many thanks to Erik Torenberg for the assist in setting this up. Full episode on YouTube! We discuss: * Marc’s long view on AI: from the 1980s AI boom and expert systems to AlexNet, transformers, and why he sees today’s moment as the culmination of decades of compounding technical progress * Why “this time is different”: the jump from LLMs to reasoning, coding, agents, and recursive self-improvement, and why Marc thinks these breakthroughs make AI real in a way prior cycles were not * AI winters vs. “80-year overnight success”: why the field repeatedly swings between utopianism and doom, and why Marc thinks the underlying researchers were mostly right even when the timelines were wrong * Scaling laws, Moore’s Law, and what to build: why he believes AI scaling laws will continue, why the outside world is messier than lab purists assume, and how startups can still create durable value on top of rapidly improving models * The dot-com crash and AI infrastructure risk: Marc’s comparison between today’s AI capex boom and the fiber/data-center overbuild of 2000, plus why he thinks this cycle is different because the buyers are huge cash-rich incumbents and demand is already here * Why old NVIDIA chips may be getting more valuable: the pace of software progress, chronic capacity shortages, and the idea that even current models are “sandbagged” by supply constraints * Open source, edge inference, and the chip bottleneck: why Marc thinks local models, Apple Silicon, privacy, trust, and economics all point toward a major role for edge AI * American vs. Chinese open source AI: DeepSeek as a “gift to the world,” why open models matter not just because they’re free but because they teach the world how things work, and how open source strategies may shift as the market consolidates * Why Pi and OpenClaw matter so much: Marc’s claim that the combination of LLM + shell + filesystem + markdown + cron loop is one of the biggest software architecture breakthroughs in decades * Agents as the new “Unix”: how agent state living in files allows portability across models and runtimes, and why self-modifying agents that can extend themselves may redefine what software even is * The future of coding and programming languages: why Marc thinks software becomes abundant, why bots may translate freely across languages, and why “programming language” itself may stop being a salient concept * Browsers, protocols, and human readability: lessons from Mosaic and the web, why text protocols and “view source” mattered, and how similar principles may shape AI-native systems * Real-world OpenClaw use: health dashboards, sleep monitoring, smart homes, rewriting firmware on robot dogs, and why the most aggressive users are discovering both the power and danger of agents first * Proof of human vs. proof of bot: why Marc thinks the internet’s bot problem is now unsolvable via detection alone, and why biometric + cryptographic proof of human becomes necessary Timestamps * 00:00 Marc on AI’s “80-Year Overnight Success” * 00:01 A Quick Message From swyx * 01:44 Inside a16z With Marc Andreessen * 02:13 The Truth About a16z’s AI Pivot * 03:29 Why This AI Boom Is Not Like 2016 * 06:33...

Duration:01:16:20

Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

4/2/2026

We’ve been on a bit of a mini World Models series over the last quarter: from introducing the topic with Yi Tay, to exploring Marble with World Labs’ Fei-Fei Li and Justin Johnson, to previewing World Models learned from massive gaming datasets with General Intuition’s Pim de Witte (who has now written down their approach to World Models with Not Boring), to discussing the Cosmos World Model with with Andrew White of Edison Scientific on our new Science pod, to writing up our own theses on Adversarial World Models. Meanwhile Nvidia, Waymo and Tesla have published their own approaches, Google has released Genie 3, and Yann LeCun has raised $1B for AMI and published LeWorldModel. Today’s guests have a radically different approach to World Modeling to every player we just mentioned — while Genie 3 is impressive, its many flaws demonstrate the issues with their approach - terrain clipping, noninteractivity (single player, no physics/no objects other than the player move), and maximum of 60 second immersion. Moonlake AI (inspired by the Dreamworks logo) is the diametric opposite - immediately multiplayer, incredibly interactive, indefinite lifetime, capable of MANY different kinds of world models by simulating environments, predicting outcomes, and planning over long horizons. This is enabled by bootstrapping from game engines and training custom agents: In Towards Efficient World Models, Chris Manning and Ian Goodfellow join Fan-Yun in explaining why their approach to efficiency with structure and casuality instead of just blind scaling is sorely needed: SOTA models still show physical or spatial understanding glitches, such as solid objects floating in mid-air or moving “inside” other solid objects. If the goal is to plan for the next action, how often is a high-resolution pixel view necessary for modeling the world? Our bet is that there is a disproportionately large share of economically valuable tasks where such detail is not required. After all, humans with a wide variety of sensory limitations have little difficulty doing almost everything in the world. Furthermore, for a large number of purposes, describing a scene or a situation in a few words of language (“the car’s tires squealed as it cornered sharply”) is sufficient for understanding and planning. Experiments also show that humans only partially process visual input in a top-down, task-directed way, often making use of abstracted object-level modeling. In almost all cases, partial representations combined with semantic understanding are sufficient. …If the goal is to facilitate the understanding of causality in multimodal environments, then the world model—whether it is used in the virtual world or the physical world—must prioritize properties such as spatial and physical state consistency maintained over long time periods, and an ability to evolve the world that accurately reflects the consequences of actions. That’s what Moonlake is building. Game engines are the right starting point abstraction to efficiently extract causal relationships, and building the interfaces and community (including their new $30,000 Creator Cup) to kickstart the flywheel of actions-to-observations. We were fortunate enough to attend their sessions at GDC 2026 (the Mecca of Game Devs), and were impressed by the huge variety and flexibility of the worlds people were building with Moonlake’s tools already! Live videos on the pod. Full Video Pod on YouTube! Timestamps 00:00 Benchmarking Gets Hard00:47 Meet Moonlake Founders01:26 Why Build World Models03:12 Structure Not Just Scale05:37 Defining Action Conditioned Worlds07:32 Abstraction Versus Bitter Lesson14:39 Language Versus JEPA Debate20:27 Reasoning Traces And Rendering Layer37:00 Gameplay Over Graphics38:02 Fiction Rules And World Tweaks39:15 Code Engines Beat Learned Priors41:10 Diffusion Scaling Limits43:23 Symbolic Versus Diffusion Boundary46:14 Platform Vision Beyond Games50:24 Spatial Audio And Multimodal Latents54:23...

Duration:01:06:47

Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

3/30/2026

Mistral has been on an absolute tear - with frequent successful model launches it is easy to forget that they raised the largest European AI round in history last year. We were long overdue for a Mistral episode, and we were very fortunate to work with Sophia and Howard to catch up with Pavan (Voxtral lead) and Guillaume (Chief Scientist, Co-founder) on the occasion of this week’s Voxtral TTS launch: Mistral can’t directly say it, but the benchmarks do imply, that this is basically an open-weights ElevenLabs-level TTS model (Technically, it is a 4B Ministral based multilingual low-latency TTS open weights model that has a 68.4% win rate vs ElevenLabs Flash v2.5). The contributions are not just in the open weights but also in open research: We also spend a decent amount of the pod talking about their architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens (typically only applied in the Image Generation space, as seen in the Flow Matching NeurIPS workshop from the principal authors that we reference in the pod). You can catch up on the paper here and the full episode is live on youtube! Timestamps 00:00 Welcome and Guests00:22 Announcing Voxtral TTS01:41 Architecture and Codec02:53 Understanding vs Generation05:39 Flow Matching for Audio07:27 Real Time Voice Agents13:40 Efficiency and Model Strategy14:53 Voice Agents Vision17:56 Enterprise Deployment and Privacy23:39 Fine Tuning and Personalization25:22 Enterprise Voice Personalization26:09 Long-Form Speech Models26:58 Real-Time Encoder Advances27:45 Scaling Context for TTS28:53 What Makes Small Models30:37 Merging Modalities Tradeoffs33:05 Open Source Mission35:51 Lean and Formal Proofs38:40 Reasoning Transfer and Agents40:25 Next Frontiers in Training42:20 Hiring and AI for Science44:19 Forward Deployed Engineering46:22 Customer Feedback Loop48:29 Wrap Up and Thanks Transcript swyx: Okay, welcome to Latent Space. We’re here in the studio with our gues co-host Vibh u. Welcome. Thanks. Excited for this one as well as Guillaume and Pavan from Mistral. Welcome. Excited to be here. Guillaume: Thank you. swyx: Pavan. You are leading audio research at Ms. Charles and Guam, your chief scientist. Announcing Voxtral TTS swyx: What are we announcing today where we’re coordinating this release with you guys? Guillaume: Yeah, so we are releasing box trial TTS. So it’s our first audio model that generates speech. It’s not our first audio model. We had a couple of releases before we had one in the summer that was a box trial, our first audio model, but it’s was like transcription models. A SR like a few months later we released some update on top of this, supporting more languages. Also a lot of table stake features for our customers context biasing ization. Time stamping and the transcription. We will start some real time model that can transcribe not just at the end of the, you just don’t need to fill your entire audio file, but that can also come in [00:01:00] real time. And here, this is the natural extension in the audio. So basically speech generation. So yeah. So we support nine languages. And this is a pretty small model 3D models. So very fast. And also status is the same level at the best model, but it’s much more efficient in terms of cost and also much in terms of cost. It’s also much to only a fraction of the cost of parking competitors. They, and we are also releasing the way that this model swyx: yeah. Mainly linked, not, yeah. What’s the decision? Factor him. Guillaume: It’s a good question. Pavan: Ooh. swyx: Yeah. For provide any other sort of research notes to add on what Pavan: we maybe we’ll dive into it later in the forecast too. Architecture and Codec Pavan: But it’s a novel architecture that we develop inhouse. We traded on several internal architectures and ended up with a auto aggressive flow matching architecture. And also have a new in-house neural audio codec. Which, converts...

Duration:00:48:48

🔬Why There Is No "AlphaFold for Materials" — AI for Materials Discovery with Heather Kulik

3/24/2026

Materials science is the unsung hero of the science world. Behind every physical product you interact was decades of research into getting the properties of materials just right. Your gym clothes contain synthetic fibers developed over decades. The glass screen, diodes, and chip substrate technology needed to read this blog post were only viable due to many teams of material scientists. Our guest Prof. Heather Kulik was one of the first material scientists to realize that there was alpha in combining computational tools with data driven modeling — she did AI for science before it was cool. She has a hard-fought perspective for how to succeed in this field. Yes, she believes the wins are real. To get there you must work hard to deeply integrate domain expertise with AI techniques, and also maintain a discriminating mind. Ultimately what matters is you succeed in the lab, and nature doesn’t care about how hyped a model is. These lessons personally resonated with the Latent.Space Science team and our own experience. This episode is a must watch for all aspiring AI for science practitioners. A few highlights: Designing new polymers with AI: Heather’s group recently used AI to design new polymers that are significantly stronger. These materials were created and tested in the lab, and the scientists who built them were surprised by the designs. The AI had figured out certain building blocks could break in a novel way. The AI discovered a purely quantum mechanical effect, and after convincing their lab collaborators to actually synthesize it, the material turned out to be four times tougher! The twenty-two-atom ligand challenge: When asked about the role and need of human scientists, Heather points out that AI has a strong understanding of academic chemistry, but is still lacking intuition. Every time an LLM is updated, Heather asks it to design a ligand that contains exactly twenty-two heavy atoms. She has yet to find one that can succeed at this seemingly simple task that any expert could do in a second! Is this the chemistry counterpart to counting ‘r’s in strawberry? Side note: Heather joked that this comment would date itself immediately, so we decided to see if this was still true three months after recording. We found some interesting results! We asked both Claude and ChatGPT to design a 22 atom ligand for both a metal-organic framework (MOF) and a Kinase protein. * For the Kinase, both models got it right: Claude pulled out RDKit in a python script and iterated on several designs, whereas ChatGPT just one-shotted it. * For MOFs, both models got it wrong, generating ligands with 21, 23, or 24 atoms, yet stubbornly not getting 22 atoms. Is there something different about how LLMs reason in the materials and bio domains? Materials vs biology: The two biggest domains of AI in science have been biology and materials. We asked Heather if there could be an AlphaFold moment for materials. Her answer reframes how we should think about the field: * First, the datasets in material science are woefully lacking in comparison to the bio world. The closest to ground truth in most cases are noisy DFT datasets. These are just approximations to the real world! The datasets that are accurate are all boring, as Heather quipped “We have really good datasets for really boring chemistry.” Furthermore, good experimental structures are hard to come by and require interpretation. So generating generating high-quality, novel datasets at scale would really drive the field forward. * More philosophically, AlphaFold is making predictions in a fairly limited space: there are just twenty amino acids. Sure, even here AlphaFold doesn’t get everything right, but it seems plausible that one could learn the entire design space. For materials, each element is a new set of interactions and chemistry, with little to no transferability. This is a massive open problem in material science that we hope some of the smartest AI scientists will want to work...

Duration:00:35:14

Dreamer: the Personal Agent OS — David Singleton

3/20/2026

For a limited time, Latent Spacenauts can skip the waitline to join Dreamer and also compete for a $10,000 cash prize for most useful tools for Dreamer! Thanks @dps! In 2024, David Singleton left Stripe and joined forces with Hugo Barra for a buzzy stealth startup named /dev/agents. This month they emerged out as Dreamer, a consumer-first platform to discover, build, and use AI agents and agentic apps, centered on a personal “Sidekick” that helps users customize experiences via natural language. Sidekick is nothing less than an “agent that builds agents”, with all the complexity that that entails: You’ve seen many many website builder, app builder, and even agent builder startups by now, but our favorite detail is the sheer amount of work that has gone into the “full stack” nature of the platform, including shipping their own SDK, logging, database, prompt management, serverless functions, and so on. Most platforms restrict the tech stack you can use just to get off the ground — Dreamer does it “right” by letting you push whatever arbitrary code you want to their VMs. Paying the Builders Of course former leaders of Stripe and Android would not stop at just building the tools, but also building the ecosystem. Dreamer is deeply aware of the 4 sided network effect it has going on and is ready to fund all of it - from hiring Builders in Residence to awarding $10,000 cash prizes to the best tool builders for the Dreamer ecosystem. It’s time to Dream! Full Video Episode on youtube. Transcript [00:00:00] Meet Dreamer Purple [00:00:00] swyx: Okay, we’re here in the studio with David Singleton. Welcome. [00:00:08] David Singleton: Hey, Wix. It’s great to be here. [00:00:09] swyx: It’s great to have you. Uh, we have very sympa that your company color is the same as Lean Spaces color. [00:00:15] David Singleton: That’s right. Dreamer Purple. [00:00:17] swyx: It used to be Devrel agents, which I thought was very cool. It’s like you call back to Devrel Payments. [00:00:22] David Singleton: Yeah. [00:00:22] swyx: And you were obviously CTO Stripe. And talk to me about just the origin or thinking process behind Dreamer. Yeah. And maybe, maybe start with like, what, what is Dreamer? [00:00:31] David Singleton: Yeah. [00:00:31] What Is Dreamer [00:00:31] David Singleton: So Dreamer is a new product, uh, which everyone can come and play with today. Um, it’s a place where everyone, literally, everyone can discover, build, and enjoy and use AI agents and agenda apps. [00:00:45] And we really did design it for consumers, for folks who are not necessarily. Uh, have any kind of technical background. It’s really aimed at everyone. I think often of my sister, she’s very smart. She’s not in the slightest bit technical. She has lots of problems in her life that [00:01:00] she would like to be able to have great software and intelligent software to solve. [00:01:04] But you know, even with the rise of tools like Cloud Code and so forth, she’s got no way to get started. And Dreamer is a place where she can come in, grab some intelligent apps that other people in the community have built, start using them right away, and solve real problems in her life. [00:01:19] Sidekick And Waitlist [00:01:19] David Singleton: And at the core, we have a personal agent called the Sidekick. [00:01:24] Um, you can give your sidekick a name, you can give it its own personality, and it really helps you across your entire day, your life. It helps you use all of the agents on the platform, and it also helps you build anything you want. And we’ve been working in this for a little while. We recently launched in beta. [00:01:41] So anyone can go to dreamer.com, join the wait list. Um, and we have many, many, many people in the community now who are building really fun, really powerful, really useful. Agents and the agentic apps for themselves. [00:01:54] swyx: I think we’re gonna go right into a demo. Yeah. I just wanna make an observation that, uh,...

Duration:01:03:35

Why Anthropic Thinks AI Should Have Its Own Computer — Felix Rieseberg of Claude Cowork & Claude Code Desktop

3/17/2026

Claude Cowork came out of an accident. Felix and the Anthropic team noticed something interesting with Claude Code: many users were using it primarily for all kinds of messy knowledge work instead of coding. Even technical builders would use it for lots of non-technical work. Even more shocking, Claude cowork wrote itself. With a team of humans simply orchestrating multiple claude code instances, the tool was ready after a brief week and a half. This isn’t Felix’s first rodeo with impactful and playful desktop apps. He’s helped ship the Slack desktop app and is a core maintainer of Electron the open-source software framework used for building cross-platform desktop applications, even putting Windows 95 into an Electron app that runs on macOS, Windows, and Linux. In this episode, Felix joins us to unpack why execution has suddenly become cheap enough that teams can “just build all the candidates” and why the real frontier in AI products is no longer better chat, but trusted task execution. He also shares why Anthropic is betting on local-first agent workflows, why skills may matter more than most people realize, and how the hardest questions ahead are about autonomy, safety, portability, and the changing shape of knowledge work itself. We discuss * Felix’s path: Slack desktop app, Electron, Windows 95 in JavaScript, and now building Claude Cowork at Anthropic * What Claude Cowork actually is: a more user-friendly, VM-based version of Claude Code designed to bring agentic workflows to non-terminal-native users * Why “user-friendly” does not mean “less powerful”: Cowork as a superset product, much like how VS Code initially looked simpler than Visual Studio but became more hackable and extensible * Anthropic’s prototype-first culture: why Cowork was built in 10 days using many pre-existing internal pieces, and how internal prototypes shaped the final product * Why execution is getting cheap: the shift from long memos, specs, and debate toward rapidly building multiple candidates and choosing based on reality instead of theory * The local debate: why Felix thinks Silicon Valley is undervaluing the local computer, and why putting Claude “where you work” is often more powerful * Why Claude gets its own computer: the VM as both a safety boundary and a capability unlock, letting Claude install tools, run scripts, and work more independently without constant approval * Safety through sandboxing: why “approve every command” is not a real long-term UX, and how virtual machines create a middle ground between uselessly safe and dangerously autonomous * How Cowork differs from Claude Code: coding evals vs. knowledge-work evals, different system-prompt tradeoffs, longer planning horizons, and heavier use of planning and clarification tools * Why skills matter: simple markdown-based instructions as a lightweight abstraction layer for reusable workflows, personalized automation, and portable agent behavior * Skills vs. MCPs: why Felix is increasingly interested in file-based, text-native interfaces that tell the model what to do, rather than forcing everything through rigid tool schemas * The portability problem: why personal skills should move across agent products, and the unresolved tension between public reusable workflows and private user-specific context * Real use cases already happening today: uploading videos, organizing files, handling taxes, managing calendars, debugging internal crashes, analyzing finances, and automating repetitive browser workflows * Why AI products should work with your existing stack: Anthropic’s bias toward integrating with Chrome, Office, and existing workflows instead of rebuilding every app from scratch * Computer use one year later: how much better it has gotten, why vision plus browser context is such a superpower, and why letting Claude see the thing it is working on changes everything * Why many “AI verticals” may get compressed: specialized wrappers may matter in the short term,...

Duration:01:26:59

Retrieval After RAG: Hybrid Search, Agents, and Database Design — Simon Hørup Eskildsen of Turbopuffer

3/12/2026

Turbopuffer came out of a reading app. In 2022, Simon was helping his friends at Readwise scale their infra for a highly requested feature: article recommendations and semantic search. Readwise was paying ~$5k/month for their relational database and vector search would cost ~$20k/month making the feature too expensive to ship. In 2023 after mulling over the problem from Readwise, Simon decided he wanted to “build a search engine” which became Turbopuffer. We discuss:• Simon’s path: Denmark → Shopify infra for nearly a decade → “angel engineering” across startups like Readwise, Replicate, and Causal → turbopuffer almost accidentally becoming a company • The Readwise origin story: building an early recommendation engine right after the ChatGPT moment, seeing it work, then realizing it would cost ~$30k/month for a company spending ~$5k/month total on infra and getting obsessed with fixing that cost structure • Why turbopuffer is “a search engine for unstructured data”: Simon’s belief that models can learn to reason, but can’t compress the world’s knowledge into a few terabytes of weights, so they need to connect to systems that hold truth in full fidelity • The three ingredients for building a great database company: a new workload, a new storage architecture, and the ability to eventually support every query plan customers will want on their data • The architecture bet behind turbopuffer: going all in on object storage and NVMe, avoiding a traditional consensus layer, and building around the cloud primitives that only became possible in the last few years • Why Simon hated operating Elasticsearch at Shopify: years of painful on-call experience shaped his obsession with simplicity, performance, and eliminating state spread across multiple systems • The Cursor story: launching turbopuffer as a scrappy side project, getting an email from Cursor the next day, flying out after a 4am call, and helping cut Cursor’s costs by 95% while fixing their per-user economics • The Notion story: buying dark fiber, tuning TCP windows, and eating cross-cloud costs because Simon refused to compromise on architecture just to close a deal faster • Why AI changes the build-vs-buy equation: it’s less about whether a company can build search infra internally, and more about whether they have time especially if an external team can feel like an extension of their own • Why RAG isn’t dead: coding companies still rely heavily on search, and Simon sees hybrid retrieval semantic, text, regex, SQL-style patterns becoming more important, not less • How agentic workloads are changing search: the old pattern was one retrieval call up front; the new pattern is one agent firing many parallel queries at once, turning search into a highly concurrent tool call • Why turbopuffer is reducing query pricing: agentic systems are dramatically increasing query volume, and Simon expects retrieval infra to adapt to huge bursts of concurrent search rather than a small number of carefully chosen calls • The philosophy of “playing with open cards”: Simon’s habit of being radically honest with investors, including telling Lachy Groom he’d return the money if turbopuffer didn’t hit PMF by year-end • The “P99 engineer”: Simon’s framework for building a talent-dense company, rejecting by default unless someone on the team feels strongly enough to fight for the candidate —Simon Hørup Eskildsen• LinkedIn: https://www.linkedin.com/in/sirupsen• X: https://x.com/Sirupsen• https://sirupsen.com/aboutturbopuffer• https://turbopuffer.com/ Full Video Pod Timestamps 00:00:00 The PMF promise to Lachy Groom00:00:25 Intro and Simon's background00:02:19 What turbopuffer actually is00:06:26 Shopify, Elasticsearch, and the pain behind the company00:10:07 The Readwise experiment that sparked turbopuffer00:12:00 The insight Simon couldn’t stop thinking about00:17:00 S3 consistency, NVMe, and the architecture bet00:20:12 The Notion story: latency, dark fiber, and conviction00:25:03 Build vs....

Duration:01:00:32

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

3/10/2026

Join Kyle, Nader, Vibhu, and swyx live at NVIDIA GTC next week! Now that AIE Europe tix are ~sold out, our attention turns to Miami and World’s Fair! The definitive AI Accelerator chip company has more than 10xed this AI Summer: And is now a $4.4 trillion megacorp… that is somehow still moving like a startup. We are blessed to have a unique relationship with our first ever NVIDIA guests: Kyle Kranen who gave a great inference keynote at the first World’s Fair and is one of the leading architects of NVIDIA Dynamo (a Datacenter scale inference framework supporting SGLang, TRT-LLM, vLLM), and Nader Khalil, a friend of swyx from our days in Celo in The Arena, who has been drawing developers at GTC since before they were even a glimmer in the eye of NVIDIA: Nader discusses how NVIDIA Brev has drastically reduced the barriers to entry for developers to get a top of the line GPU up and running, and Kyle explains NVIDIA Dynamo as a data center scale inference engine that optimizes serving by scaling out, leveraging techniques like prefill/decode disaggregation, scheduling, and Kubernetes-based orchestration, framed around cost, latency, and quality tradeoffs. We also dive into Jensen’s “SOL” (Speed of Light) first-principles urgency concept, long-context limits and model/hardware co-design, internal model APIs (https://build.nvidia.com), and upcoming Dynamo and agent sessions at GTC. Full Video pod on YouTube Timestamps 00:00 Agent Security Basics00:39 Podcast Welcome and Guests07:19 Acquisition and DevEx Shift13:48 SOL Culture and Dynamo Setup27:38 Why Scale Out Wins29:02 Scale Up Limits Explained30:24 From Laptop to Multi Node33:07 Cost Quality Latency Tradeoffs38:42 Disaggregation Prefill vs Decode41:05 Kubernetes Scaling with Grove43:20 Context Length and Co Design57:34 Security Meets Agents58:01 Agent Permissions Model59:10 Build Nvidia Inference Gateway01:01:52 Hackathons And Autonomy Dreams01:10:26 Local GPUs And Scaling Inference01:15:31 Long Running Agents And SF Reflections Transcript Agent Security Basics Nader: Agents can do three things. They can access your files, they can access the internet, and then now they can write custom code and execute it. You literally only let an agent do two of those three things. If you can access your files and you can write custom code, you don’t want internet access because that’s one to see full vulnerability, right? If you have access to internet and your file system, you should know the full scope of what that agent’s capable of doing. Otherwise, now we can get injected or something that can happen. And so that’s a lot of what we’ve been thinking about is like, you know, how do we both enable this because it’s clearly the future. But then also, you know, what, what are these enforcement points that we can start to like protect? swyx: All right. Podcast Welcome and Guests swyx: Welcome to the Lean Space podcast in the Chromo studio. Welcome to all the guests here. Uh, we are back with our guest host Viu. Welcome. Good to have you back. And our friends, uh, Netter and Kyle from Nvidia. Welcome. Kyle: Yeah, thanks for having us. swyx: Yeah, thank you. Actually, I don’t even know your titles. Uh, I know you’re like architect something of Dynamo. Kyle: Yeah. I, I’m one of the engineering leaders [00:01:00] and a architects of Dynamo. swyx: And you’re director of something and developers, developer tech. Nader: Yeah. swyx: You’re the developers, developers, developers guy at nvidia, Nader: open source agent marketing, brev, swyx: and like Nader: Devrel tools and stuff. swyx: Yeah. Been Nader: the focus. swyx: And we’re, we’re kind of recording this ahead of Nvidia, GTC, which is coming to town, uh, again, uh, or taking over town, uh, which, uh, which we’ll all be at. Um, and we’ll talk a little bit about your sessions and stuff. Yeah. Nader: We’re super excited for it. GTC Booth Stunt Stories swyx: One of my favorite memories for Nader, like you always do...

Duration:01:23:37

Cursor's Third Era: Cloud Agents

3/5/2026

All speakers are announced at AIE EU, schedule coming soon. Join us there or in Miami with the renowned organizers of React Miami! Singapore CFP also open! We’ve called this out a few times over in AINews, but the overwhelming consensus in the Valley is that “the IDE is Dead”. In November it was just a gut feeling, but now we actually have data: even at the canonical “VSCode Fork” company, people are officially using more agents than tab autocomplete (the first wave of AI coding): Cursor has launched cloud agents for a few months now, and this specific launch is around Computer Use, which has come a long way since we first talked with Anthropic about it in 2024, and which Jonas productized as Autotab: We also take the opportunity to do a live demo, talk about slash commands and subagents, and the future of continual learning and personalized coding models, something that Sam previously worked on at New Computer. (The fact that both of these folks are top tier CEOs of their own startups that have now joined the insane talent density gathering at Cursor should also not be overlooked). Full Episode on YouTube! please like and subscribe! Timestamps 00:00 Agentic Code Experiments00:53 Why Cloud Agents Matter02:08 Testing First Pillar03:36 Video Reviews Second Pillar04:29 Remote Control Third Pillar06:17 Meta Demos and Bug Repro13:36 Slash Commands and MCPs18:19 From Tab to Team Workflow31:41 Minimal Web UI Philosophy32:40 Why No File Editor34:38 Full Stack Cursor Debate36:34 Model Choice and Auto Routing38:34 Parallel Agents and Best Of N41:41 Subagents and Context Management44:48 Grind Mode and Throughput Future01:00:24 Cloud Agent Onboarding and Memory Transcript EP 77 - CURSOR - Audio version [00:00:00] Agentic Code Experiments Samantha: This is another experiment that we ran last year and didn’t decide to ship at that time, but may come back to LM Judge, but one that was also agentic and could write code. So it wasn’t just picking but also taking the learnings from two models or and models that it was looking at and writing a new diff. And what we found was that there were strengths to using models from different model providers as the base level of this process. Basically you could get almost like a synergistic output that was better than having a very unified like bottom model tier. Jonas: We think that over the coming months, the big unlock is not going to be one person with a model getting more done, like the water flowing faster and we’ll be making the pipe much wider and so paralyzing more, whether that’s swarms of agents or parallel agents, both of those are things that contribute to getting much more done in the same amount of time. Why Cloud Agents Matter swyx: This week, one of the biggest launches that Cursor’s ever done is cloud agents. I think you, you had [00:01:00] cloud agents before, but this was like, you give cursor a computer, right? Yeah. So it’s just basically they bought auto tab and then they repackaged it. Is that what’s going on, or, Jonas: that’s a big part of it. Yeah. Cloud agents already ran in their own computers, but they were sort of site reading code. Yeah. And those computers were not, they were like blank VMs typically that were not set up for the Devrel X for whatever repo the agents working on. One of the things that we talk about is if you put yourself in the model shoes and you were seeing tokens stream by and all you could do was cite read code and spit out tokens and hope that you had done the right thing, swyx: no chance Jonas: I’d be so bad. Like you obviously you need to run the code. And so that I think also is probably not that contrarian of a take, but no one has done that yet. And so giving the model the tools to onboard itself and then use full computer use end-to-end pixels in coordinates out and have the cloud computer with different apps in it is the big unlock that we’ve seen internally in terms of use usage of this going from, oh, we use it for little...

Duration:01:06:39

Every Agent Needs a Box — Aaron Levie, Box

3/4/2026

The reception to our recent post on Code Reviews has been strong. Catch up! Amid a maelstrom of discussion on whether or not AI is killing SaaS, one of the top publicly listed SaaS companies in the world has just reported record revenues, clearing well over $1.1B in ARR for the first time with a 28% margin. As we comment on the pod, Aaron Levie is the rare public company CEO equally at home in both worlds of Silicon Valley and Wall Street/Main Street, by day helping 70% of the Fortune 500 with their Enterprise Advanced Suite, and yet by night is often found in the basements of early startups and tweeting viral insights about the future of agents. Now that both Cursor, Cloudflare, Perplexity, Anthropic and more have made Filesystems and Sandboxes and various forms of “Just Give the Agent a Box” cool (not just cool; it is now one of the single hottest areas in AI infrastructure growing 100% MoM), we find it a delightfully appropriate time to do the episode with the OG CEO who has been giving humans and computers Boxes since he was a college dropout pitching VCs at a Michael Arrington house party. Enjoy our special pod, with fan favorite returning guest/guest cohost Jeff Huber! Note: We didn’t directly discuss the AI vs SaaS debate - Aaron has done many, many, many other podcasts on that, and you should read his definitive essay on it. Most commentators do not understand SaaS businesses because they have never scaled one themselves, and deeply reflected on what the true value proposition of SaaS is. Full Video Episode Timestamps * 00:00 Adapting Work for Agents * 01:29 Why Every Agent Needs a Box * 04:38 Agent Governance and Identity * 11:28 Why Coding Agents Took Off First * 21:42 Context Engineering and Search Limits * 31:29 Inside Agent Evals * 33:23 Industries and Datasets * 35:22 Building the Agent Team * 38:50 Read Write Agent Workflows * 41:54 Docs Graphs and Founder Mode * 55:38 Token FOMO Culture * 56:31 Production Function Secrets * 01:01:08 Film Roots to Box * 01:03:38 AI Future of Movies * 01:06:47 Media DevRel and Engineering Transcript Adapting Work for Agents Aaron Levie: Like you don’t write code, you talk to an agent and it goes and does it for you, and you may be at best review it. That’s even probably like, like largely not even what you’re doing. What’s happening is we are changing our work to make the agents effective. In that model, the agent didn’t really adapt to how we work. We basically adapted to how the agent works. All of the economy has to go through that exact same evolution. Right now, it’s a huge asset and an advantage for the teams that do it early and that are kinda wired into doing this ‘cause you’ll see compounding returns. But that’s just gonna take a while for most companies to actually go and get this deployed. swyx: Welcome to the Lane Space Pod. We’re back in the chroma studio with uh, chroma, CEO, Jeff Hoover. Welcome returning guest now guest host. Aaron Levie: It’s a pleasure. Wow. How’d you get upgraded to, uh, to that? swyx: Because he’s like the perfect guy to be guest those for you. Aaron Levie: That makes sense actually, for We love context. We, we both really love context le we really do. We really do. swyx: Uh, and we’re here with, uh, Aaron Levy. Welcome. Aaron Levie: Thank you. Good to, uh, good to be [00:01:00] here. swyx: Uh, yeah. So we’ve all met offline and like chatted a little bit, but like, it’s always nice to get these things in person and conversation. Yeah. You just started off with so much energy. You’re, you’re super excited about agents. I love Aaron Levie: agents. swyx: Yeah. Open claw. Just got by, got bought by OpenAI. No, not bought, but you know, you know what I mean? Aaron Levie: Some, some, you know, acquihire. Executive swyx: hire. Aaron Levie: Executive hire. Okay. Executive hire. Say, swyx: hey, that’s my term. Okay. Um, what are you pounding the table on on agents? You have so many insightful tweets. Why Every...

Duration:01:16:58

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

2/27/2026

This is a free preview of a paid episode. To hear more, visit www.latent.space AIE Europe CFP and AIE World’s Fair paper submissions for CAIS peer review are due TODAY - do not delay! Last call ever. We’re excited to welcome METR for their first LS Pod, hopefully the first of many: METR are keepers of currently the single most infamous chart in AI: But every Latent Space reader should be sophisticated enough to know that the details matter and that hype and hyperbole go hand in hand in AI social media, because the millions of impressions that got, by people who don’t understand or care about the nuances, disclaimers, and error bars, far outreaches the 69k views on the corrections by the people who actually made the chart: There’s a lot of nuance both in making benchmarks (as we discovered with OpenAI on our SWE-Bench Verified podcast) and in extrapolating results from them, especially where exponentials and sigmoids are concerned. METR’s Long Horizons work itself has known biases that the authors have responsibly disclosed, but go far too underappreciated in the pursuit of doomer chart porn. If you’re interested in a short, sharable TED talk version of this pod, over at AIE CODE we were blessed to feature Joel twice, as a stage talk and with a longer form small workshop with Q&A: We also make sure cover some of METR’s lesser known work on Threat Evaluation but also Developer Productivity, where 2x friend of the pod and now Zyphra founder Quentin Anthony was the ONLY productive participant! Finally, if you’re the sort to read these show notes to the end, then you definitely deserve some pictures of Joel shredding the guitar at Love Band Karaoke which we mention at the end: Full Video Pod Timestamps 00:00 What METR Means00:39 Podcast Intro With Joel01:39 ME vs TR03:33 Time Horizon Origin Story04:56 Picking Tasks And Biases09:13 Time Horizon Misconceptions11:37 Opus 4.5 And Trendlines14:27 Productivity Studies And Explosions29:50 Compute Slows Progress30:47 Algorithms Need Compute32:45 Industry Spend and Data34:57 Clusters and Shipping Timelines36:44 Prediction Markets for Models38:10 Manifold Alpha Story43:04 Beyond Benchmarks Evals51:39 METR Roadmap and Farewell Transcript

Duration:00:56:14

[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka

2/26/2026

Swyx joined SAIL! Thank you SAIL Media, Prof. Tom Yeh, 8Lee, Hamid Bagheri, c9n, and many others for tuning into SAIL Live #6 with Nathan Lambert and Sebastian Raschka, PhD. Sharing here for the LS paid subscribers. We covered: This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Duration:00:52:17

🔬Nature as a Computer: Prof. Max Welling, CuspAI on AI x Materials Science

2/25/2026

Editor’s note: CuspAI raised a $100m Series A in September and is rumored to have reached a unicorn valuation. They have all-star advisors from Geoff Hinton to Yann Lecun and team of deep domain experts to tackle this next frontier in AI applications. In this episode, Max Welling traces the thread connecting quantum gravity, equivariant neural networks, diffusion models, and climate-focused materials discovery (yes, there is one!!!). We begin with a provocative framing: experiments as computation. Welling describes the idea of a “physics processing unit”—a world in which digital models and physical experiments work together, with nature itself acting as a kind of processor. It’s a grounded but ambitious vision of AI for science: not replacing chemists, but accelerating them.Along the way, we discuss: * Why symmetry and equivariance matter in deep learning * The tradeoff between scale and inductive bias * The deep mathematical links between diffusion models and stochastic thermodynamics * Why materials—not software—may be the real bottleneck for AI and the energy transition * What it actually takes to build an AI-driven materials platform Max reflects on moving from curiosity-driven theoretical physics (including work with Gerard ‘t Hooft) toward impact-driven research in climate and energy. The result is a conversation about convergence: physics and machine learning, digital models and laboratory experiments, long-term ambition and incremental progress. Full Video Episode Timestamps * 00:00:00 – The Physics Processing Unit (PPU): Nature as the Ultimate Computer * Max introduces the idea of a Physics Processing Unit — using real-world experiments as computation. * 00:00:44 – From Quantum Gravity to AI for Materials * Brandon frames Max’s career arc: VAE pioneer → equivariant GNNs → materials startup founder. * 00:01:34 – Curiosity vs Impact: How His Motivation Evolved * Max explains the shift from pure theoretical curiosity to climate-driven impact. * 00:02:43 – Why CaspAI Exists: Technology as Climate Strategy * Politics struggles; technology scales. Why materials innovation became the focus. * 00:03:39 – The Thread: Physics → Symmetry → Machine Learning * How gauge symmetry, group theory, and relativity informed equivariant neural networks. * 00:06:52 – AI for Science Is Exploding (Not Emerging) * The funding surge and why AI-for-Science feels like a new industrial era. * 00:07:53 – Why Now? The Two Catalysts Behind AI for Science * Protein folding, ML force fields, and the tipping point moment. * 00:10:12 – How Engineers Can Enter AI for Science * Practical pathways: curriculum, workshops, cross-disciplinary training. * 00:11:28 – Why Materials Matter More Than Software * The argument that everything—LLMs included—rests on materials innovation. * 00:13:02 – Materials as a Search Engine * The vision: automated exploration of chemical space like querying Google. * 01:14:48 – Inside CuspAI: The Platform Architecture * Generative models + multi-scale digital twin + experiment loop. * 00:21:17 – Automating Chemistry: Human-in-the-Loop First * Start manual → modular tools → agents → increasing autonomy. * 00:25:04 – Moonshots vs Incremental Wins * Balancing lighthouse materials with paid partnerships. * 00:26:22 – Why Breakthroughs Will Still Require Humans * Automation is vertical-specific and iterative. * 00:29:01 – What Is Equivariance (In Plain English)? * Symmetry in neural networks explained with the bottle example. * 00:30:01 – Why Not Just Use Data Augmentation? * The optimization trade-off between inductive bias and data scale. * 00:31:55 – Generative AI Meets Stochastic Thermodynamics * His upcoming book and the unification of diffusion models and physics. * 00:33:44 – When the Book Drops (ICLR?) Transcript Max: I want to think of it as what I would call a physics processing unit, like a PPU, right? Which is you have digital processing units and then you have physics...

Duration:00:33:56

Claude Code for Finance + The Global Memory Shortage: Doug O'Laughlin, SemiAnalysis

2/24/2026

This is a free preview of a paid episode. To hear more, visit www.latent.space First speakers for AIE Europe and AIEi Miami have been announced. If you’re in Asia/Aus, come by Singapore and Melbourne. AI Engineering is going global! One year ago today, Anthropic launched Claude Code, to not much fanfare: The word of mouth was incredibly strong however, and so we were glad to be one of the first podcasts to invite Boris and Cat on in early May: As we discussed on the pod, all CC usage was API-based and therefore it was ridiculously expensive to do anything. This was then fixed by the team including Claude Code in the Claude Pro plan in early June, and then the virality caused us to make a rare trend call in late June: Now, 6 months on, Doug has just calculated that around 4% of GitHub is written by Claude Code: We talk about how Doug uses Claude Code to do SemiAnalysis work. Memory Mania In the second part of this episode, we also check in on Memory Mania, which is going to affect you (yes, you) at home if it hasn’t already: Full Episode on YouTube Timestamps 00:00 AI as Junior Analyst00:59 Meet Swyx and Doug03:30 From Value Mule to Semis06:28 Moore’s Law Ends Thesis12:02 Claude Code Awakening32:02 Agent Swarms Reality Check32:53 Kimi Swarm Benchmarks37:31 Bots vs Zapier Automation39:44 Claude Code Workflow Setup57:54 AGI Metrics and GDP01:04:48 Railroad CapEx Analogy01:06:00 Funding Bubbles and Demand01:08:11 Agents Replace Work Tools01:13:56 Codex vs Claude Race01:21:15 Microsoft and TPU Strategy01:34:13 TPU Window vs Nvidia01:36:30 HBM Supply Chain Squeeze01:39:41 Memory Shock and CXL01:45:20 Context Rationing Future01:54:37 Writing and Trail Lessons Transcript [00:00:00] AI as Junior Analyst [00:00:00] Doug: This crap makes mistakes all the time. All the time. It is still just like a, like I think of it once again as like a junior analyst, right? The analyst goes and does all this like really pain in the ass information and you bring it all together to make a good decision at the top. Historically what happens is that junior analyst, who I once was, went and gathered all that information, and after doing this enough times, there’s a meta level thinking that’s happening where it’s like, okay, here’s what I really understand and how this type of analysis, I’m an expert in, actually I’m very good at, I consistently have a hit rate. [00:00:28] Now I’m the expert, right? I don’t think that meta level learning is there yet. We’ll see if l ones do it, right? Everyone who’s spending one quadrillion dollars in the world thinks it will, it better, it better happen by if you’re spending, you know, a trillion dollars and there’s not meta level learning. [00:00:44] But for me, in our firm, that massively amplifies everyone who is an expert. ‘cause like you have to still do something that you can just like lop it up. It’s very obvious to me. What It’s slop. [00:00:59] Meet Swyx and Doug

Duration:02:04:13

⚡️SWE-Bench-Dead: The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

2/23/2026

Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/) arguing that SWE-Bench Verified—long treated as a key “North Star” coding benchmark—has become saturated and highly contaminated, making it less useful for measuring real coding progress. SWE-Bench Verified originated as a major OpenAI-led cleanup of the original Princeton SWE-Bench benchmark, including a large human review effort with nearly 100 software engineers and multiple independent reviews to curate ~500 higher-quality tasks. But recent findings show that many remaining failures can reflect unfair or overly narrow tests (e.g., requiring specific naming or unspecified implementation details) rather than true model inability, and cite examples suggesting contamination such as models recalling repository-specific implementation details or task identifiers. From now on, OpenAI plans to stop reporting SWE-Bench Verified and instead focus on SWE-Bench Pro (from Scale), which is harder, more diverse (more repos and languages), includes longer tasks (1–4 hours and 4+ hours), and shows substantially less evidence of contamination under their “contamination auditor agent” analysis. We also discuss what future coding/agent benchmarks should measure beyond pass/fail tests—longer-horizon tasks, open-ended design decisions, code quality/maintainability, and real-world product-building—along with the tradeoffs between fast automated grading and human-intensive evaluation. 00:00 Meet the Frontier Evals Team00:56 Why SWE Bench Stalled01:47 How Verified Was Built04:32 Contamination In The Wild06:16 Unfair Tests And Narrow Specs08:40 When Benchmarks Saturate10:28 Switching To SWE Bench Pro12:31 What Great Coding Evals Measure18:17 Beyond Tests Dollars And Autonomy21:49 Preparedness And Future Directions Get full access to Latent.Space at www.latent.space/subscribe

Duration:00:26:12