LessWrong Curated Podcast

Technology Podcasts

Audio version of the posts shared in the LessWrong Curated newsletter.

Location:

United States

Genres:

Technology Podcasts

Philosophy Podcasts

Description:

Audio version of the posts shared in the LessWrong Curated newsletter.

Language:

English

Website:

https://sites.libsyn.com/421877

Episodes

AI companies aren’t really using external evaluators

5/24/2024

New blog: AI Lab Watch. Subscribe on Substack. Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.[1] Other model evaluators also seem to have little access before deployment. Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.[2] External evals can improve a lab's risk assessment and—if the evaluator can publish its results—provide public accountability. The evaluator should get deeper access than users will get. The original text contained 5 footnotes which were omitted from this narration. --- First published: May 24th, 2024 Source: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators --- Narrated by TYPE III AUDIO.

Duración:00:07:42

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

5/24/2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Part 13 of 12 in the Engineer's Interpretability Sequence. TL;DR On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work. Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates. Reflecting on predictions Please see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...] --- First published: May 21st, 2024 Source: https://www.lesswrong.com/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may --- Narrated by TYPE III AUDIO.

Duración:00:06:42

What’s Going on With OpenAI’s Messaging?

5/21/2024

This is a quickly-written opinion piece, of what I understand about OpenAI. I first posted it to Facebook, where it had some discussion. Some arguments that OpenAI is making, simultaneously: --- First published: May 21st, 2024 Source: https://www.lesswrong.com/posts/cy99dCEiLyxDrMHBi/what-s-going-on-with-openai-s-messaging --- Narrated by TYPE III AUDIO.

Duración:00:06:30

Language Models Model Us

5/21/2024

Produced as part of the MATS Winter 2023-4 program, under the mentorship of @Jessica Rumbelow One-sentence summary: On a dataset of human-written essays, we find that gpt-3.5-turbo can accurately infer demographic information about the authors from just the essay text, and suspect it's inferring much more. Introduction. Every time we sit down in front of an LLM like GPT-4, it starts with a blank slate. It knows nothing[1] about who we are, other than what it knows about users in general. But with every word we type, we reveal more about ourselves -- our beliefs, our personality, our education level, even our gender. Just how clearly does the model see us by the end of the conversation, and why should that worry us? Like many, we were rather startled when @janus showed that gpt-4-base could identify @gwern by name, with 92% confidence, from a 300-word comment. If [...] The original text contained 12 footnotes which were omitted from this narration. --- First published: May 17th, 2024 Source: https://www.lesswrong.com/posts/dLg7CyeTE4pqbbcnp/language-models-model-us --- Narrated by TYPE III AUDIO.

Duración:00:29:05

Jaan Tallinn’s 2023 Philanthropy Overview

5/21/2024

This is a link post.to follow up my philantropic pledge from 2020, i've updated my philanthropy page with 2023 results. in 2023 my donations funded $44M worth of endpoint grants ($43.2M excluding software development and admin costs) — exceeding my commitment of $23.8M (20k times $1190.03 — the minimum price of ETH in 2023). --- First published: May 20th, 2024 Source: https://www.lesswrong.com/posts/bjqDQB92iBCahXTAj/jaan-tallinn-s-2023-philanthropy-overview --- Narrated by TYPE III AUDIO.

Duración:00:00:51

OpenAI: Exodus

5/21/2024

Previously: OpenAI: Facts From a Weekend, OpenAI: The Battle of the Board, OpenAI: Leaks Confirm the Story, OpenAI: Altman Returns, OpenAI: The Board Expands. Ilya Sutskever and Jan Leike have left OpenAI. This is almost exactly six months after Altman's temporary firing and The Battle of the Board, the day after the release of GPT-4o, and soon after a number of other recent safety-related OpenAI departures. Many others working on safety have also left recently. This is part of a longstanding pattern at OpenAI. Jan Leike later offered an explanation for his decision on Twitter. Leike asserts that OpenAI has lost the mission on safety and culturally been increasingly hostile to it. He says the superalignment team was starved for resources, with its public explicit compute commitments dishonored, and that safety has been neglected on a widespread basis, not only superalignment but also including addressing the safety [...] --- First published: May 20th, 2024 Source: https://www.lesswrong.com/posts/ASzyQrpGQsj7Moijk/openai-exodus --- Narrated by TYPE III AUDIO.

Duración:01:24:44

DeepMind’s ”Frontier Safety Framework” is weak and unambitious

5/20/2024

FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("PF"), and METR's Key Components of an RSP. DeepMind's FSF has three steps: --- First published: May 18th, 2024 Source: https://www.lesswrong.com/posts/y8eQjQaCamqdc842k/deepmind-s-frontier-safety-framework-is-weak-and-unambitious --- Narrated by TYPE III AUDIO.

Duración:00:07:20

Do you believe in hundred dollar bills lying on the ground? Consider humming

5/18/2024

Introduction. [Reminder: I am an internet weirdo with no medical credentials] A few months ago, I published some crude estimates of the power of nitric oxide nasal spray to hasten recovery from illness, and speculated about what it could do prophylactically. While working on that piece a nice man on Twitter alerted me to the fact that humming produces lots of nasal nitric oxide. This post is my very crude model of what kind of anti-viral gains we could expect from humming. I’ve encoded my model at Guesstimate. The results are pretty favorable (average estimated impact of 66% reduction in severity of illness), but extremely sensitive to my made-up numbers. Efficacy estimates go from ~0 to ~95%, depending on how you feel about publication bias, what percent of Enovid's impact can be credited to nitric oxide, and humming's relative effect. Given how made up speculative some [...] --- First published: May 16th, 2024 Source: https://www.lesswrong.com/posts/NBZvpcBx4ewqkdCdT/do-you-believe-in-hundred-dollar-bills-lying-on-the-ground-1 --- Narrated by TYPE III AUDIO.

Duración:00:11:29

Deep Honesty

5/12/2024

Most people avoid saying literally false things, especially if those could be audited, like making up facts or credentials. The reasons for this are both moral and pragmatic — being caught out looks really bad, and sustaining lies is quite hard, especially over time. Let's call the habit of not saying things you know to be false ‘shallow honesty’[1]. Often when people are shallowly honest, they still choose what true things they say in a kind of locally act-consequentialist way, to try to bring about some outcome. Maybe something they want for themselves (e.g. convincing their friends to see a particular movie), or something they truly believe is good (e.g. causing their friend to vote for the candidate they think will be better for the country). Either way, if you think someone is being merely shallowly honest, you can only shallowly trust them: you might be confident that [...] The original text contained 7 footnotes which were omitted from this narration. --- First published: May 7th, 2024 Source: https://www.lesswrong.com/posts/szn26nTwJDBkhn8ka/deep-honesty --- Narrated by TYPE III AUDIO.

Duración:00:15:22

On Not Pulling The Ladder Up Behind You

5/2/2024

Epistemic Status: Musing and speculation, but I think there's a real thing here. 1. When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground. Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder. Not only would you need to climb the tree itself instead of the ladder with its handholds, but [...] The original text contained 1 footnote which was omitted from this narration. --- First published: April 26th, 2024 Source: https://www.lesswrong.com/posts/k2kzawX5L3Z7aGbov/on-not-pulling-the-ladder-up-behind-you --- Narrated by TYPE III AUDIO.

Duración:00:14:29

Mechanistically Eliciting Latent Behaviors in Language Models

5/1/2024

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout). TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities. Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective. I apply the method to several alignment-relevant toy examples, and find that the [...] The original text contained 15 footnotes which were omitted from this narration. --- First published: April 30th, 2024 Source: https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/mechanistically-eliciting-latent-behaviors-in-language-1 --- Narrated by TYPE III AUDIO.

Duración:01:20:59

Ironing Out the Squiggles

5/1/2024

Adversarial Examples: A Problem The apparent successes of the deep learning revolution conceal a dark underbelly. It may seem that we now know how to get computers to (say) check whether a photo is of a bird, but this façade of seemingly good performance is belied by the existence of adversarial examples—specially prepared data that looks ordinary to humans, but is seen radically differently by machine learning models. The differentiable nature of neural networks, which make them possible to be trained at all, are also responsible for their downfall at the hands of an adversary. Deep learning models are fit using stochastic gradient descent (SGD) to approximate the function between expected inputs and outputs. Given an input, an expected output, and a loss function (which measures "how bad" it is for the actual output to differ from the expected output), we can calculate the gradient of the [...] The original text contained 5 footnotes which were omitted from this narration. --- First published: April 29th, 2024 Source: https://www.lesswrong.com/posts/H7fkGinsv8SDxgiS2/ironing-out-the-squiggles --- Narrated by TYPE III AUDIO.

Duración:00:18:59

Introducing AI Lab Watch

4/30/2024

This is a linkpost for https://ailabwatch.orgI'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly. It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff. (It's much better on desktop than mobile — don't read it on mobile.) It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly. It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me. Some clarifications and disclaimers. How you can help: --- First published: April 30th, 2024 Source: https://www.lesswrong.com/posts/N2r9EayvsWJmLBZuF/introducing-ai-lab-watch Linkpost URL: https://ailabwatch.org --- Narrated by TYPE III AUDIO.

Duración:00:02:42

Refusal in LLMs is mediated by a single direction

4/28/2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee. This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal. We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review. Executive summary Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you." We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests. The original text contained 8 footnotes which were omitted from this narration. --- First published: April 27th, 2024 Source: https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction --- Narrated by TYPE III AUDIO.

Duración:00:17:07

Funny Anecdote of Eliezer From His Sister

4/23/2024

This comes from a podcast called 18Forty, of which the main demographic of Orthodox Jews. Eliezer's sister (Hannah) came on and talked about her Sheva Brachos, which is essentially the marriage ceremony in Orthodox Judaism. People here have likely not seen it, and I thought it was quite funny, so here it is: https://18forty.org/podcast/channah-cohen-the-crisis-of-experience/ David Bashevkin: So I want to shift now and I want to talk about something that full disclosure, we recorded this once before and you had major hesitation for obvious reasons. It's very sensitive what we’re going to talk about right now, but really for something much broader, not just because it's a sensitive personal subject, but I think your hesitation has to do with what does this have to do with the subject at hand? And I hope that becomes clear, but one of the things that has always absolutely fascinated me about [...] --- First published: April 22nd, 2024 Source: https://www.lesswrong.com/posts/C7deNdJkdtbzPtsQe/funny-anecdote-of-eliezer-from-his-sister --- Narrated by TYPE III AUDIO.

Duración:00:04:06

Thoughts on seed oil

4/21/2024

This is a linkpost for https://dynomight.net/seed-oil/A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack: “When are you going to write about seed oils?” “Did you know that seed oils are why there's so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?” “Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?” “Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?” He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it's critical that we overturn our lives to eliminate soybean/canola/sunflower/peanut oil and replace them with butter/lard/coconut/avocado/palm oil. This confused [...] --- First published: April 20th, 2024 Source: https://www.lesswrong.com/posts/DHkkL2GxhxoceLzua/thoughts-on-seed-oil Linkpost URL: https://dynomight.net/seed-oil/ --- Narrated by TYPE III AUDIO.

Duración:00:34:23

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

4/19/2024

Yesterday Adam Shai put up a cool post which… well, take a look at the visual: Yup, it sure looks like that fractal is very noisily embedded in the residual activations of a neural net trained on a toy problem. Linearly embedded, no less. I (John) initially misunderstood what was going on in that post, but some back-and-forth with Adam convinced me that it really is as cool as that visual makes it look, and arguably even cooler. So David and I wrote up this post / some code, partly as an explainer for why on earth that fractal would show up, and partly as an explainer for the possibilities this work potentially opens up for interpretability. One sentence summary: when tracking the hidden state of a hidden Markov model, a Bayesian's beliefs follow a chaos game (with the observations randomly selecting the update at each time), so [...] --- First published: April 18th, 2024 Source: https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why --- Narrated by TYPE III AUDIO.

Duración:00:12:36

Express interest in an “FHI of the West”

4/18/2024

TLDR: I am investigating whether to found a spiritual successor to FHI, housed under Lightcone Infrastructure, providing a rich cultural environment and financial support to researchers and entrepreneurs in the intellectual tradition of the Future of Humanity Institute. Fill out this form or comment below to express interest in being involved either as a researcher, entrepreneurial founder-type, or funder. The Future of Humanity Institute is dead: I knew that this was going to happen in some form or another for a year or two, having heard through the grapevine and private conversations of FHI's university-imposed hiring freeze and fundraising block, and so I have been thinking about how to best fill the hole in the world that FHI left behind. I think FHI was one of the best intellectual institutions in history. Many of the most important concepts[1] in my intellectual vocabulary were developed and popularized under its [...] The original text contained 1 footnote which was omitted from this narration. --- First published: April 18th, 2024 Source: https://www.lesswrong.com/posts/ydheLNeWzgbco2FTb/express-interest-in-an-fhi-of-the-west --- Narrated by TYPE III AUDIO.

Duración:00:05:44

Transformers Represent Belief State Geometry in their Residual Stream

4/17/2024

Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, and @Guillaume Corlouer for suggestions on this writeup. Introduction. What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because The original text contained 10 footnotes which were omitted from this narration. --- First published: April 16th, 2024 Source: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their --- Narrated by TYPE III AUDIO.

Duración:00:23:51

Paul Christiano named as US AI Safety Institute Head of AI Safety

4/16/2024

This is a linkpost for https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safetyU.S. Secretary of Commerce Gina Raimondo announced today additional members of the executive leadership team of the U.S. AI Safety Institute (AISI), which is housed at the National Institute of Standards and Technology (NIST). Raimondo named Paul Christiano as Head of AI Safety, Adam Russell as Chief Vision Officer, Mara Campbell as Acting Chief Operating Officer and Chief of Staff, Rob Reich as Senior Advisor, and Mark Latonero as Head of International Engagement. They will join AISI Director Elizabeth Kelly and Chief Technology Officer Elham Tabassi, who were announced in February. The AISI was established within NIST at the direction of President Biden, including to support the responsibilities assigned to the Department of Commerce under the President's landmark Executive Order. Paul Christiano, Head of AI Safety, will design and conduct tests of frontier AI models, focusing on model evaluations for capabilities of national security [...] --- First published: April 16th, 2024 Source: https://www.lesswrong.com/posts/63X9s3ENXeaDrbe5t/paul-christiano-named-as-us-ai-safety-institute-head-of-ai Linkpost URL: https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safety --- Narrated by TYPE III AUDIO.

Duración:00:02:18

LessWrong Curated Podcast

Technology Podcasts

Audio version of the posts shared in the LessWrong Curated newsletter.

AI companies aren’t really using external evaluators

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

What’s Going on With OpenAI’s Messaging?

Language Models Model Us

Jaan Tallinn’s 2023 Philanthropy Overview

OpenAI: Exodus

DeepMind’s ”​​Frontier Safety Framework” is weak and unambitious

Do you believe in hundred dollar bills lying on the ground? Consider humming

Deep Honesty

On Not Pulling The Ladder Up Behind You

Mechanistically Eliciting Latent Behaviors in Language Models

Ironing Out the Squiggles

Introducing AI Lab Watch

Refusal in LLMs is mediated by a single direction

Funny Anecdote of Eliezer From His Sister

Thoughts on seed oil

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

Express interest in an “FHI of the West”

Transformers Represent Belief State Geometry in their Residual Stream

Paul Christiano named as US AI Safety Institute Head of AI Safety

DeepMind’s ”Frontier Safety Framework” is weak and unambitious