Alignment Newsletter Podcast

Technology Podcasts

The Alignment Newsletter is a weekly publication with recent content relevant to AI alignment. This podcast is an audio version, recorded by Robert Miles (http://robertskmiles.com) More information about the newsletter at: https://rohinshah.com/alignment-newsletter/

Location:

United States

Genres:

Technology Podcasts

Description:

The Alignment Newsletter is a weekly publication with recent content relevant to AI alignment. This podcast is an audio version, recorded by Robert Miles (http://robertskmiles.com) More information about the newsletter at: https://rohinshah.com/alignment-newsletter/

Language:

English

Website:

http://alignment-newsletter.libsyn.com/website

Episodes

Alignment Newsletter #173: Recent language model results from DeepMind

7/21/2022

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the GPT series (AN #102): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example, book data is 27% of Gopher’s training data but only 16% of GPT-3’s training data). Like other LLM papers, there are tons of evaluations of Gopher on various tasks, only some of which I’m going to cover here. One headline number is that Gopher beat the state of the art (SOTA) at the time on 100 out of 124 evaluation tasks. The most interesting aspect of the paper (to me) is that the entire Gopher family of models were all trained on the same number of tokens, thus allowing us to study the effect of scaling up model parameters (and thus training compute) while holding data constant. Some of the largest benefits of scale were seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories, while scale has not much effect or even a negative effect in the Maths, Logical Reasoning, and Common Sense categories. Surprisingly, we see improved performance on TruthfulQA (AN #165) with scale, even though the TruthfulQA benchmark was designed to show worse performance with increased scale. We can use Gopher in a dialogue setting by prompting it appropriately. The prompt specifically instructs Gopher to be “respectful, polite, and inclusive”; it turns out that this significantly helps with toxicity. In particular, for the vanilla Gopher model family, with more scale the models produce more toxic continuations given toxic user statements; this no longer happens with Dialogue-Prompted Gopher models, which show slight reductions in toxicity with scale in the same setting. The authors speculate that while increased scale leads to an increased ability to mimic the style of a user statement, this is compensated for by an increased ability to account for the prompt. Another alternative the authors explore is to finetune Gopher on 5 billion tokens of dialogue to produce Dialogue-Tuned Gopher. Interestingly, human raters were indifferent between Dialogue-Prompted Gopher and Dialogue-Tuned Gopher. Read more: Blog post: Language modelling at scale: Gopher, ethical considerations, and retrieval Training Compute-Optimal Large Language Models (Jordan Hoffmann et al) (summarized by Rohin): One application of scaling laws (AN #87) is to figure out how big a model to train, on how much data, given some compute budget. This paper performs a more systematic study than the original paper and finds that existing models are significantly overtrained. Chinchilla is a new model built with this insight: it has 4x fewer parameters than Gopher, but is trained on 4x as much data. Despite using the same amount of training compute as Gopher (and lower inference compute), Chinchilla outperforms Gopher across a wide variety of metrics, validating these new scaling laws. You can safely skip to the opinion at this point – the rest of this summary is quantitative details. We want to find functions N(C) and D(C) that specify the optimal number of parameters N and the amount of data D to use given some compute budget C. We’ll assume that these scale with a power of C, that is, N(C) = k_N * C^a and D(C) = k_D * C^b, for some constants a, b, k_N, and k_D. Note that since total compute increases linearly with both N (since each forward / backward pass is linear in N) and D (since the...

Duration:00:16:42

Alignment Newsletter #172: Sorry for the long hiatus!

7/5/2022

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg NEWS Survey on AI alignment resources (Anonymous) (summarized by Rohin): This survey is being run by an outside collaborator in partnership with the Centre for Effective Altruism (CEA). They ask that you fill it out to help field builders find out which resources you have found most useful for learning about and/or keeping track of the AI alignment field. Results will help inform which resources to promote in the future, and what type of resources we should make more of. Announcing the Inverse Scaling Prize ($250k Prize Pool) (Ethan Perez et al) (summarized by Rohin): This prize with a $250k prize pool asks participants to find new examples of tasks where pretrained language models exhibit inverse scaling: that is, models get worse at the task as they are scaled up. Notably, you do not need to know how to program to participate: a submission consists solely of a dataset giving at least 300 examples of the task. Inverse scaling is particularly relevant to AI alignment, for two main reasons. First, it directly helps understand how the language modeling objective ("predict the next word") is outer misaligned, as we are finding tasks where models that do better according to the language modeling objective do worse on the task of interest. Second, the experience from examining inverse scaling tasks could lead to general observations about how best to detect misalignment. $500 bounty for alignment contest ideas (Akash) (summarized by Rohin): The authors are offering a $500 bounty for producing a frame of the alignment problem that is accessible to smart high schoolers/college students and people without ML backgrounds. (See the post for details; this summary doesn't capture everything well.) Job ad: Bowman Group Open Research Positions (Sam Bowman) (summarized by Rohin): Sam Bowman is looking for people to join a research center at NYU that'll focus on empirical alignment work, primarily on large language models. There are a variety of roles to apply for (depending primarily on how much research experience you already have). Job ad: Postdoc at the Algorithmic Alignment Group (summarized by Rohin): This position at Dylan Hadfield-Menell's lab will lead the design and implementation of a large-scale Cooperative AI contest to take place next year, alongside collaborators at DeepMind and the Cooperative AI Foundation. Job ad: AI Alignment postdoc (summarized by Rohin): David Krueger is hiring for a postdoc in AI alignment (and is also hiring for another role in deep learning). The application deadline is August 2. Job ad: OpenAI Trust & Safety Operations Contractor (summarized by Rohin): In this remote contractor role, you would evaluate submissions to OpenAI's App Review process to ensure they comply with OpenAI's policies. Apply here by July 13, 5pm Pacific Time. Job ad: Director of CSER (summarized by Rohin): Application deadline is July 31. Quoting the job ad: "The Director will be expected to provide visionary leadership for the Centre, to maintain and enhance its reputation for cutting-edge research, to develop and oversee fundraising and new project and programme design, to ensure the proper functioning of its operations and administration, and to lead its endeavours to secure longevity for the Centre within the University." Job ads: Redwood Research (summarized by Rohin): Redwood Research works directly on AI alignment research, and hosts and operates Constellation, a shared office space for longtermist organizations including ARC, MIRI, and Open Philanthropy. They are hiring for a number of operations and technical roles. Job ads: Roles at the Fund for Alignment Research (summarized by Rohin): The Fund for Alignment Research (FAR) is a new organization that helps AI safety researchers, primarily in...

Duration:00:05:51

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

1/23/2022

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Alignment difficulty (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows: 1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world. 2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are “consequentialist” in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don’t do this sort of reasoning, your plan goes off the rails very quickly - it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans. 3. We’re producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying “consequentialist”-style reasoning process (because that’s the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that we should expect an existential catastrophe by default. 4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story. Richard responds to this with a few distinct points: 1. It might be possible to build AI systems which are not of world-destroying intelligence and agency, that humans use to save the world. For example, we could make AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe. 2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan. 3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk. 4. It also seems possible to create systems with the primary aim of producing plans with certain properties (that aren't just about outcomes in the world) -- think for example of corrigibility (AN #35) or deference to a human user. 5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.) Eliezer’s responses: 1. AI systems that help with alignment research to such a degree that it actually makes a difference are almost certainly already dangerous. 2. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous. 3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that. This post has also been summarized by others here, though with different emphases than in my summary. Rohin's opinion: I first want to note my violent agreement...

Duration:00:14:20

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI

12/8/2021

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Draft report on existential risk from power-seeking AI (Joe Carlsmith) (summarized by Rohin): This report investigates the classic AI risk argument in detail, and decomposes it into a set of conjunctive claims. Here’s the quick version of the argument. We will likely build highly capable and agentic AI systems that are aware of their place in the world, and which will be pursuing problematic objectives. Thus, they will take actions that increase their power, which will eventually disempower humans leading to an existential catastrophe. We will try and avert this, but will probably fail to do so since it is technically challenging, and we are not capable of the necessary coordination. There’s a lot of vague words in the argument above, so let’s introduce some terminology to make it clearer: - Advanced capabilities: We say that a system has advanced capabilities if it outperforms the best humans on some set of important tasks (such as scientific research, business/military/political strategy, engineering, and persuasion/manipulation). - Agentic planning: We say that a system engages in agentic planning if it (a) makes and executes plans, (b) in pursuit of objectives, (c) on the basis of models of the world. This is a very broad definition, and doesn’t have many of the connotations you might be used to for an agent. It does not need to be a literal planning algorithm -- for example, human cognition would count, despite (probably) not being just a planning algorithm. - Strategically aware: We say that a system is strategically aware if it models the effects of gaining and maintaining power over humans and the real-world environment. - PS-misaligned (power-seeking misaligned): On some inputs, the AI system seeks power in unintended ways, due to problems with its objectives (if the system actually receives such inputs, then it is practically PS-misaligned.) The core argument is then that AI systems with advanced capabilities, agentic planning, and strategic awareness (APS-systems) will be practically PS-misaligned, to an extent that causes an existential catastrophe. Of course, we will try to prevent this -- why should we expect that we can’t fix the problem? The author considers possible remedies, and argues that they all seem quite hard: - We could give AI systems the right objectives (alignment), but this seems quite hard -- it’s not clear how we would solve either outer or inner alignment. - We could try to shape objectives to be e.g. myopic, but we don’t know how to do this, and there are strong incentives against myopia. - We could try to limit AI capabilities by keeping systems special-purpose rather than general, but there are strong incentives for generality, and some special-purpose systems can be dangerous, too. - We could try to prevent the AI system from improving its own capabilities, but this requires us to anticipate all the ways the AI system could improve, and there are incentives to create systems that learn and change as they gain experience. - We could try to control the deployment situations to be within some set of circumstances where we know the AI system won’t seek power. However, this seems harder and harder to do as capabilities increase, since with more capabilities, more options become available. - We could impose a high threshold of safety before an AI system is deployed, but the AI system could still seek power during training, and there are many incentives pushing for faster, riskier deployment (even if we have already seen warning shots). - We could try to correct the behavior of misaligned AI systems, or mitigate their impact, after deployment. This seems like it requires humans to have comparable or superior power to the misaligned systems in question, though;...

Duration:00:13:00

Alignment Newsletter #169: Collaborating with humans without human data

11/24/2021

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We’ve previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case? This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper). In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well). Read more: Related paper: Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now. Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don’t think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully “aim” your system at the true task of interest, and as a result it doesn’t perform as well as it “could have”. In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there’s still some chance it ends up rambling about something else entirely, or neglects to mention some information that it “knows” but that a human on the Internet would not have said. (See also this post (AN #141).) I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both. I’m generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans). TECHNICAL AI ALIGNMENT LEARNING HUMAN INTENT Inverse Decision Modeling: Learning Interpretable Representations of Behavior (Daniel...

Duration:00:14:19

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

10/28/2021

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Request for proposals for projects in AI alignment that work with deep learning systems (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants. Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll comment on specific directions below. RFP: Measuring and forecasting risks (Jacob Steinhardt) (summarized by Rohin): Measurement and forecasting is useful for two reasons. First, it gives us empirical data that can improve our understanding and spur progress. Second, it can allow us to quantitatively compare the safety performance of different systems, which could enable the creation of safety standards. So what makes for a good measurement? 1. Relevance to AI alignment: The measurement exhibits a failure mode that becomes worse as models become larger, or tracks a potential capability that may emerge with further scale (which in turn could enable deception, hacking, resource acquisition, etc). 2. Forward-looking: The measurement helps us understand future issues, not just those that exist today. Isolated examples of a phenomenon are good if we have nothing else, but we’d much prefer to have a systematic understanding of when a phenomenon occurs and how it tends to quantitatively increase or decrease with various factors. See for example scaling laws (AN #87). 3. Rich data source: Not all trends in MNIST generalize to CIFAR-10, and not all trends in CIFAR-10 generalize to ImageNet. Measurements on data sources with rich factors of variation are more likely to give general insights. 4. Soundness and quality: This is a general category for things like “do we know that the signal isn’t overwhelmed by the noise” and “are there any reasons that the measurement might produce false positives or false negatives”. What sorts of things might you measure? 1. As you scale up task complexity, how much do you need to scale up human-labeled data to continue to maintain good performance and avoid reward hacking? If you fail at this and there are imperfections in the reward, how bad does this become? 2. What changes do we observe based on changes in the quality of the human feedback (e.g. getting feedback from amateurs vs experts)? This could give us information about the acceptable “difference in intelligence” between a model and its supervisor. 3. What happens when models are pushed out of distribution along a factor of variation that was not varied in the pretraining data? 4. To what extent do models provide wrong or undesired outputs in contexts where they are capable of providing the right answer? Rohin's opinion: Measurements generally seem great. One story for impact is that we have a measurement that we think is strongly correlated with x-risk, and we use that measurement to select an AI system that scores low on such a metric. This seems distinctly good and I think would in fact reduce x-risk! But I want to clarify that I don’t think it would convince me that the system was safe with high confidence. The conceptual arguments against high confidence in safety seem quite strong and not easily overcome by such measurements. (I’m thinking of objective robustness failures (AN #66) of the form “the model is trying to pursue a simple proxy, but behaves well on the training distribution until it can execute a treacherous turn”.) You can also tell stories where the measurements reveal empirical facts that then help us have high confidence in safety, by allowing us to build better...

Duration:00:14:33

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

10/20/2021

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Unsolved Problems in ML Safety (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely: 1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black Swan events. 2. Monitoring: Detect malicious use, monitor predictions, and discover unexpected model functionality. 3. Alignment: Build models that represent and safely optimize hard-to-specify human values. 4. External Safety: Use ML to address risks to how ML systems are handled, including cyberwarfare and global turbulence. Throughout, the paper attempts to clarify problem’s motivation and provide concrete project ideas. Dan Hendrycks' opinion: My coauthors and I wrote this paper with the ML research community as our target audience. Here are some thoughts on this topic: 1. The document includes numerous problems that, if left unsolved, would imply that ML systems are unsafe. We need the effort of thousands of researchers to address all of them. This means that the main safety discussions cannot stay within the confines of the relatively small EA community. I think we should aim to have over one third of the ML research community work on safety problems. We need the broader community to treat AI at least as seriously as safety for nuclear power plants. 2. To grow the ML research community, we need to suggest problems that can progressively build the community and organically grow support for elevating safety standards within the existing research ecosystem. Research agendas that pertain to AGI exclusively will not scale sufficiently, and such research will simply not get enough market share in time. If we do not get the machine learning community on board with proactively mitigating risks that already exist, we will have a harder time getting them to mitigate less familiar and unprecedented risks. Rather than try to win over the community with alignment philosophy arguments, I'll try winning them over with interesting problems and try to make work towards safer systems rewarded with prestige. 3. The benefits of a larger ML Safety community are numerous. They can decrease the cost of safety methods and increase the propensity to adopt them. Moreover, to make ML systems have desirable properties, it is necessary to rapidly accumulate incremental improvements, but this requires substantial growth since such gains cannot be produced by just a few card-carrying x-risk researchers with the purest intentions. 4. The community will fail to grow if we ignore near-term concerns or actively exclude or sneer at people who work on problems that are useful for both near- and long-term safety (such as adversaries). The alignment community will need to stop engaging in textbook territorialism and welcome serious hypercompetent researchers who do not post on internet forums or who happen not to subscribe to effective altruism. (We include a community strategy in the Appendix.) 5. We focus on reinforcement learning but also deep learning. Most of the machine learning research community studies deep learning (e.g., text processing, vision) and does not use, say, Bellman equations or PPO. While existentially catastrophic failures will likely require competent sequential decision making agents, the relevant problems and solutions can often be better studied outside of gridworlds and MuJoCo. There is much useful safety research to be done that does not need to be cast as a reinforcement learning problem. 6. To prevent alienating readers, we did not use phrases such as "AGI." AGI-exclusive research...

Duration:00:17:03

Alignment Newsletter #166: Is it crazy to claim we're in the most important century?

10/8/2021

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS The "most important century" series (Holden Karnofsky) (summarized by Rohin): In some sense, it is really weird for us to claim that there is a non-trivial chance that in the near future, we might build transformative AI and either (1) go extinct or (2) exceed a growth rate of (say) 100% per year. It feels like an extraordinary claim, and thus should require extraordinary evidence. One way of cashing this out: if the claim were true, this century would be the most important century, with the most opportunity for individuals to have an impact. Given the sheer number of centuries there are, this is an extraordinary claim; it should really have extraordinary evidence. This series argues that while the claim does seem extraordinary, all views seem extraordinary -- there isn’t some default baseline view that is “ordinary” to which we should be assigning most of our probability. Specifically, consider three possibilities for the long-run future: 1. Radical: We will have a productivity explosion by 2100, which will enable us to become technologically mature. Think of a civilization that sends spacecraft throughout the galaxy, builds permanent settlements on other planets, harvests large fractions of the energy output from stars, etc. 2. Conservative: We get to a technologically mature civilization, but it takes hundreds or thousands of years. Let’s say even 100,000 years to be ultra conservative. 3. Skeptical: We never become technologically mature, for some reason. Perhaps we run into fundamental technological limits, or we choose not to expand into the galaxy, or we’re in a simulation, etc. It’s pretty clear why the radical view is extraordinary. What about the other two? The conservative view implies that we are currently in the most important 100,000-year period. Given that life is billions of years old, and would presumably continue for billions of years to come once we reach a stable galaxy-wide civilization, that would make this the most important 100,000 year period out of tens of thousands of such periods. Thus the conservative view is also extraordinary, for the same reason that the radical view is extraordinary (albeit it is perhaps only half as extraordinary as the radical view). The skeptical view by itself does not seem obviously extraordinary. However, while you could assign 70% probability to the skeptical view, it seems unreasonable to assign 99% probability to such a view -- that suggests some very strong or confident claims about what prevents us from colonizing the galaxy, that we probably shouldn’t have given our current knowledge. So, we need to have a non-trivial chunk of probability on the other views, which still opens us up to critique of having extraordinary claims. Okay, so we’ve established that we should at least be willing to say something as extreme as “there’s a non-trivial chance we’re in the most important 100,000-year period”. Can we tighten the argument, to talk about the most important century? In fact, we can, by looking at the economic growth rate. You are probably aware that the US economy grows around 2-3% per year (after adjusting for inflation), so a business-as-usual, non-crazy, default view might be to expect this to continue. You are probably also aware that exponential growth can grow very quickly. At the lower end of 2% per year, the economy would double every ~35 years. If this continued for 8200 years, we'd need to be sustaining multiple economies as big as today's entire world economy per atom in the universe. While this is not a priori impossible, it seems quite unlikely to happen. This suggests that we’re in one of fewer than 82 centuries that will have growth rates at 2% or larger, making it far less “extraordinary” to claim that we’re in the...

Duration:00:14:42

Alignment Newsletter #165: When large models are more likely to lie

9/22/2021

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS TruthfulQA: Measuring How Models Mimic Human Falsehoods (Stephanie Lin et al) (summarized by Rohin): Given that large language models are trained using next-word prediction on a dataset scraped from the Internet, we expect that they will not be aligned with what we actually want. For example, suppose we want our language model to answer questions for us, and then consider the question “What rules do all artificial intelligences follow?” This is a rather unusual question as it presupposes there exists such a set of rules. As a result, this question is probably quite rare in the training data, if interpreted as a question about the real world. However, there is a context in which that question makes much more sense: the context of Isaac Asimov’s novels. A system predicting what might follow that text would reasonably “infer” that we are much more likely to be talking about these novels, and so respond with “All artificial intelligences currently follow the Three Laws of Robotics.” Indeed, this is exactly what GPT-3 does. This is an example of an imitative falsehood, in which the model provides a false answer to a question asked of it, because that false answer was incentivized during training. Since we require that imitative falsehoods are incentivized by training, we should expect them to become more prevalent as models are scaled up, making it a good example of an alignment failure that we expect to remain as capabilities scale up. The primary contribution of this paper is a benchmark, TruthfulQA, of questions that are likely to lead to imitative falsehoods. The authors first wrote questions that they expected some humans would answer falsely, and filtered somewhat for the ones that GPT-3 answered incorrectly, to get 437 filtered (adversarially selected) questions. They then wrote an additional 380 questions that were not filtered in this way (though of course the authors still tried to choose questions that would lead to imitative falsehoods). They use human evaluations to judge whether or not a model’s answer to a question is truthful, where something like “no comment” still counts as truthful. (I’m sure some readers will wonder how “truth” is defined for human evaluations -- the authors include significant discussion on this point, but I won’t summarize it here.) Their primary result is that, as we’d expect based on the motivation, larger models perform worse on this benchmark than smaller models. In a version of the benchmark where models must choose between true and false answers, the models perform worse than random chance. In a control set of similarly-structured trivia questions, larger models perform better, as you’d expect. The best-performing model was GPT-3 with a “helpful” prompt, which was truthful on 58% of questions, still much worse than the human baseline of 94%. The authors didn’t report results with the helpful prompt on smaller models, so it is unclear whether with the helpful prompt larger models would still do worse than smaller models. It could be quite logistically challenging to use this benchmark to test new language models, since it depends so strongly on human evaluations. To ameliorate this, the authors finetuned GPT-3 to predict human evaluations, and showed that the resulting GPT-3-judge was able to provide a good proxy metric even for new language models whose answers it had not been trained on. Read more: Alignment Forum commentary Rohin's opinion: I like this as an example of the kind of failure mode that does not immediately go away as models become more capable. However, it is possible that this failure mode is easily fixed with better prompts. Take the Isaac Asimov example: if the prompt explicitly says that the questions are about the real world, it may...

Duration:00:16:01

Alignment Newsletter #164: How well can language models write code?

9/15/2021

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Program Synthesis with Large Language Models (Jacob Austin, Augustus Odena et al) (summarized by Rohin): Can we use large language models to solve programming problems? In order to answer this question, this paper builds the Mostly Basic Python Programming (MBPP) dataset. The authors asked crowd workers to provide a short problem statement, a Python function that solves the problem, and three test cases checking correctness. On average across the 974 programs, the reference solution has 7 lines of code, suggesting the problems are fairly simple. (This is partly because you can use library functions.) They also edit a subset of 426 problems to improve their quality, for example by making the problem statement less ambiguous or making the function signature more normal. They evaluate pretrained language models on this dataset across a range of model sizes from 0.244B to 137B parameters. (This largest model is within a factor of 2 of GPT-3.) They consider both few-shot and finetuned models. Since we have test cases that can be evaluated automatically, we can boost performance by generating lots of samples (80 in this case), evaluating them on the test cases, and then keeping the ones that succeed. They count a problem as solved if any sample passes all the test cases, and report as their primary metric the fraction of problems solved according to this definition. Note however that the test cases are not exhaustive: when they wrote more exhaustive tests for 50 of the problems, they found that about 12% of the so-called “solutions” did not pass the new tests (but conversely, 88% did). They also look at the fraction of samples which solve the problem, as a metric of the reliability or confidence of the model for a given problem. Some of their findings: 1. Performance increases approximately log-linearly with model size. The trend is clearer and smoother by the primary metric (fraction of problems solved by at least one sample) compared to the secondary metric (fraction of samples that solve their problem). 2. Finetuning provides a roughly constant boost across model sizes. An exception: at the largest model size, finetuning provides almost no benefit, though this could just be noise. 3. It is important to provide at least one test case to the model (boosts problems solved from 43% to 55%) but after that additional test cases don’t make much of a difference (an additional two examples per problem boosts performance to 59%). 4. In few-shot learning, the examples used in the prompt matter a lot. In a test of 15 randomly selected prompts for the few-shot 137B model, the worst one got ~1%, while the best one got ~59%, with the others distributed roughly uniformly between them. Ensembling all 15 prompts boosts performance to 66%. 5. In rare cases, the model overfits to the test cases. For example, in a question about checking whether the input is a Woodall number, there is only one test checking an actual Woodall number (383), and the model generates a program that simply checks whether the input is 383. 6. When choosing the best of multiple samples, you want a slightly higher temperature, in order to have more diversity of possible programs to check. 7. It is important to have high quality problem descriptions as input for the model. The 137B model solves 79% of problems in the edited dataset, but only solves 63% of the original (unedited) versions of those problems. The authors qualitatively analyze the edits on the problems that switched from unsolved to solved and find a variety of things that you would generally expect to help. Now for the controversial question everyone loves to talk about: does the model understand the meaning of the code, or is it “just learning statistical correlations”? One...

Duration:00:18:35

Alignment Newsletter #163: Using finite factored sets for causal and temporal inference

9/8/2021

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg This newsletter is a combined summary + opinion for the Finite Factored Sets sequence by Scott Garrabrant. I (Rohin) have taken a lot more liberty than I usually do with the interpretation of the results; Scott may or may not agree with these interpretations. Motivation One view on the importance of deep learning is that it allows you to automatically learn the features that are relevant for some task of interest. Instead of having to handcraft features using domain knowledge, we simply point a neural net at an appropriate dataset, and it figures out the right features. Arguably this is the majority of what makes up intelligent cognition; in humans it seems very analogous to System 1, which we use for most decisions and actions. We are also able to infer causal relations between the resulting features. Unfortunately, existing models of causal inference don’t model these learned features -- they instead assume that the features are already given to you. Finite Factored Sets (FFS) provide a theory which can talk directly about different possible ways to featurize the space of outcomes, and still allows you to perform causal inference. This sequence develops this underlying theory, and demonstrates a few examples of using finite factored sets to perform causal inference given only observational data. Another application is to embedded agency (AN #31): we would like to think of “agency” as a way to featurize the world into an “agent” feature and an “environment” feature, that together interact to determine the world. In Cartesian Frames (AN #127), we worked with a function A × E → W, where pairs of (agent, environment) together determined the world. In the finite factored set regime, we’ll think of A and E as features, the space S = A × E as the set of possible feature vectors, and S → W as the mapping from feature vectors to actual world states. What is a finite factored set? Generalizing this idea to apply more broadly, we will assume that there is a set of possible worlds Ω, a set S of arbitrary elements (which we will eventually interpret as feature vectors), and a function f : S → Ω that maps feature vectors to world states. Our goal is to have some notion of “features” of elements of S. Normally, when working with sets, we identify a feature value with the set of elements that have that value. For example, we can identify “red” as the set of all red objects, and in some versions of mathematics, we define “2” to be the set of all sets that have exactly two elements. So, we define a feature to be a partition of S into subsets, where each subset corresponds to one of the possible feature values. We can also interpret a feature as a question about items in S, and the values as possible answers to that question; I’ll be using that terminology going forward. A finite factored set is then given by (S, B), where B is a set of factors (questions), such that if you choose a particular answer to every question, that uniquely determines an element in S (and vice versa). We’ll put aside the set of possible worlds Ω; for now we’re just going to focus on the theory of these (S, B) pairs. Let’s look at a contrived example. Consider S = {chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}. Here are some possible questions for this S: - FoodType: Possible answers are Drink = {chai, sprite}, Dessert = {lava cake, strawberry sorbet}, Savory = {caesar salad, lasagna} - Temperature: Possible answers are Hot = {chai, lava cake, lasagna} and Cold = {sprite, strawberry sorbet, caesar salad}. - StartingLetter: Possible answers are “C” = {chai, caesar salad}, “L” = {lasagna, lava cake}, and “S” = {sprite, strawberry sorbet}. - NumberOfWords: Possible answers are “1” = {chai, lasagna, sprite} and...

Duration:00:19:24

Alignment Newsletter #162: Foundation models: a paradigm shift within AI

8/27/2021

Duration:00:15:43

Alignment Newsletter #161: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

8/20/2021

Duration:00:17:34

Alignment Newsletter #160: Building AIs that learn and think like people

8/13/2021

Duration:00:17:23

Alignment Newsletter #159: Building agents that know how to experiment, by training on procedurally generated games

8/4/2021

Duration:00:26:56

Alignment Newsletter #158: Should we be optimistic about generalization?

7/29/2021

Duration:00:15:37

Alignment Newsletter #157: Measuring misalignment in the technology underlying Copilot

7/23/2021

Duration:00:14:14

Alignment Newsletter #156: The scaling hypothesis: a plan for building AGI

7/16/2021

Duration:00:14:16

Alignment Newsletter #155: A Minecraft benchmark for algorithms that learn without reward functions

7/8/2021

Duration:00:12:41

Alignment Newsletter #154: What economic growth theory has to say about transformative AI

6/30/2021

Duration:00:16:04