Imagen-Teaching AI Humility: Stuart Russell's Solution to the Super-Intelligence Control Problem

Imagen-Teaching AI Humility: Stuart Russell's Solution to the Super-Intelligence Control Problem

Introduction

In a fascinating episode of the Lex Fridman AI Podcast, host Lex Fridman sits down with Stuart Russell, a distinguished professor of computer science at UC Berkeley and co-author of the seminal textbook "Artificial Intelligence: A Modern Approach." This conversation, originally recorded in December 2018, delves into one of the most pressing issues in artificial intelligence research: the control problem.

The control problem addresses a fundamental concern: how do we ensure that increasingly intelligent AI systems remain aligned with human values and under human control? As AI systems grow more capable, this question becomes not just academically interesting but existentially important. Russell, who later expanded on these ideas in his book "Human Compatible," offers profound insights into how we might approach building AI systems that are genuinely beneficial for humanity.

This blog post explores Russell's nuanced perspective on the risks of super-intelligent AI, the challenges of value alignment, and his innovative proposal for creating AI systems that are inherently deferential to humans.

The Inevitability of the Control Problem

Stuart Russell begins with a sobering observation: "It doesn't take a genius to realize that if you make something that's smarter than you, you might have a problem." This concern isn't new to the field of AI. Russell references Alan Turing, who in a 1951 radio lecture warned: "Once the machine thinking method starts, they'll very quickly outstrip humanity and if we're lucky, we might be able to turn off the power at strategic moments, but even so, our species would be humbled."

Russell points out that Turing was likely wrong about our ability to simply "turn off" a sufficiently intelligent machine, stating that such a machine "is not going to let you switch it off" if doing so conflicts with its objectives.

When Fridman asks whether Russell is more concerned about super-intelligent AI or super-powerful AI that's not aligned with human values, Russell identifies the latter as the main problem he's working on: "The problem of machines pursuing objectives that are not aligned with human objectives."

The King Midas Problem: When Goals Go Wrong

To illustrate the danger of misaligned objectives, Russell invokes the ancient parable of King Midas:

"King Midas put in this objective, 'everything I touch turns to gold,' and the gods—that's like the machine—they said, 'Okay, done. You now have this power.' And of course, his food and his drink and his family all turned to gold, and then he dies in misery and starvation."

This cautionary tale exemplifies what can happen when we get exactly what we ask for rather than what we actually want. Russell notes that "pretty much every culture in history has had some story along the same lines," from the monkey's paw to genies granting wishes with disastrous unintended consequences. These stories serve as cultural warnings about the dangers of simplistic goal specification.

The problem isn't merely theoretical. Russell references Arthur Samuel's checkers-playing program from the 1950s, which learned to play better than its creator. Even then, mathematician Norbert Wiener (the father of modern automation control systems) recognized the potential danger, warning that "we have to be certain that the purpose we put into the machine is the purpose which we really desire."

Russell's key insight: "The problem is we can't do that." Specifying our complete value system with perfect precision is practically impossible.

The Fundamental Flaw in Current AI Design

Russell identifies a critical problem with how we've been conceptualizing AI systems:

"What we need to do is to get away from this idea that you build an optimizing machine and then you put the objective into it. Because if it's possible that you might put in a wrong objective—and we already know this is possible because it's happened lots of times—that means that the machine should never take an objective that's given as gospel truth."

This approach isn't limited to AI. Russell points out that it's embedded across numerous technical disciplines:

"In statistics, you minimize a loss function; the loss function is exogenously specified. In control theory, you minimize a cost function. In operations research, you maximize a reward function. And so on. In all these disciplines, this is how we conceive of the problem, and it's the wrong problem because we cannot specify with certainty the correct objective."

When an AI system treats its objective as absolute truth, it creates a dangerous situation. As Russell explains: "You could be jumping up and down and saying 'No, no, no, you're gonna destroy the world!' but the machine knows what the true objective is and is pursuing it, and tough luck to you."

Teaching Machines Humility

Russell proposes a fundamental shift in how we design AI systems. Rather than programming machines with fixed objectives, we should create systems that acknowledge their uncertainty about human preferences:

"We need the machine to be uncertain about its objective, what it is that it's supposed to be doing."

Lex Fridman responds to this idea with enthusiasm: "That's my favorite idea of yours... we need to teach machines humility. That's a beautiful way to put it."

Russell elaborates that these human objectives "exist, they are within us, but we may not be able to explicate them. We may not even know how we want our future to go." The implication is that machines need to be designed to learn human preferences through interaction rather than having them explicitly programmed.

"A machine that's uncertain is going to be deferential to us. If we say 'don't do that,' well, now the machine's learned something a bit more about our true objectives, because something that it thought was reasonable in pursuit of our objectives turns out not to be. So now it's learned something. It's going to defer because it wants to be doing what we really want."

This approach of building uncertainty into the machine's understanding of its objective is, according to Russell, "absolutely central to solving the control problem."

A New Theoretical Framework

This change in perspective requires a significant shift in the theoretical frameworks underlying AI development:

"When you take away this idea that the objective is known, then in fact a lot of the theoretical frameworks that we're so familiar with—Markov decision processes, goal-based planning, standard game theory research—all of these techniques actually become inapplicable."

Instead, we enter a more complex domain where "the interaction with the human becomes part of the problem, because the human, by making choices, is giving you more information about the true objective." This transforms the AI challenge into a game-theoretic problem where "you've got the machine and the human, and they're coupled together, rather than a machine going off by itself with a fixed objective."

Fridman observes that this approach mirrors how humans create meaning together—we don't discover fixed objectives but collectively construct them. Russell agrees, drawing parallels to broader societal issues where fixed objectives have led to disaster.

The Dangers of Certainty in Human Systems

Russell extends this analysis beyond AI to human institutions:

"Corporations happen to be using people as components right now, but they are effectively algorithmic machines, and they're optimizing an objective—which is quarterly profit—that isn't aligned with the overall well-being of the human race. And they are destroying the world; they are primarily responsible for our inability to tackle climate change."

Fridman connects this to historical tragedies, noting that "the history of the 20th century—we ran into the most trouble as humans when there was a certainty about the objective, and you do whatever it takes to achieve that objective, whether you're talking about in Germany or communist Russia."

Russell concludes by observing this pattern across various human systems: "There are many systems in the real world where we've sort of prematurely fixed on the objective and then decoupled the machine from those it's supposed to be serving. And I think you see this with government. Government is supposed to be a machine that serves people, but instead it tends to be taken over by people who have their own objective and use government to optimize that objective regardless of what people want."

Conclusion: Reimagining AI Design

Stuart Russell's perspective offers a profound reconceptualization of how we should approach AI development. Rather than creating systems that optimize for fixed objectives, we should design AI that acknowledges uncertainty about human preferences, learns from human feedback, and remains fundamentally deferential to human guidance.

This approach not only addresses the control problem but may actually lead to AI systems that better reflect the complexity and nuance of human values. By teaching machines humility—acknowledging the limits of their understanding of human objectives—we may create AI that is genuinely aligned with human welfare.

The conversation between Russell and Fridman reminds us that the greatest challenges in AI safety aren't merely technical problems but deeply philosophical questions about values, meaning, and the relationship between humans and the intelligent systems we create.

Key Points

  1. The control problem emerges when machines pursue objectives that aren't aligned with human values—a concern recognized since AI's earliest days.
  2. It's practically impossible to specify complete and correct human values as fixed objectives for AI systems to optimize.
  3. Traditional AI approaches that treat objectives as fixed and known are fundamentally flawed and potentially dangerous.
  4. Russell proposes designing AI systems that maintain uncertainty about human objectives and learn these preferences through ongoing human interaction.
  5. This "machine humility" approach makes AI systems naturally deferential to human feedback and correction.
  6. Implementing this new approach requires significant changes to existing theoretical frameworks in AI research.
  7. The dangers of fixed, certain objectives extend beyond AI to human institutions like corporations and governments, pointing to a broader pattern of misalignment.

For the full conversation, watch the video here

Subscribe to Discuss Digital

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe