daveshap / BenevolentByDesign

Public repo for my book about AGI and the control problem Benevolent By Design: Six Words to Safeguard Humanity

107 18 Language: null License: MIT Updated: 1mo ago

agi ai alignment control-problem

README

Benevolent by Design

Six Words to Safeguard Humanity

Introduction

We humans build machines that far surpass our own capabilities. We’ve built cars and trucks that can carry us faster and farther than our own two feet can, and we’ve built optical instruments that can see atoms or distant galaxies. Archimedes, upon discovering the power of levers and pulleys, said “Give me a place to stand and with a lever I will move the whole world.”

Technology has, for the entire duration of the human species, been intrinsic to our being. From our first stone tools and fur clothes, up through airplanes and quantum computers, we have been an inventive race. We inevitably seek to create things that surpass our natural limitations. Our heavy machines move earth, amplifying our strength many thousands of times, while our grain harvesters do the work of thousands of agrarian farmers. These are force multiplying devices that boost our physical productivity, and we have now invented thinking machines that enhance our mental output.

Our thinking machines, in a matter of a few decades, have surpassed many human capabilities. First, they merely crunched numbers at superhuman speed, and then they started playing simple games, such as checkers and tic tac toe. Surely, we thought, these programs could never beat us at strategic games like chess. But then in 1996, a machine beat the best human at chess, a feat that was predicted by some and rejected by others. It is only a matter of time before these thinking machines surpass all human abilities.

Since then, computers have equaled and eclipsed humans at a great many tasks. Computers can now beat us at every game, fold proteins, read any text, translate any language, and drive cars. History may look back on these past few decades as a period when humans were feverishly working to replace themselves with machines, and indeed, the fear of irrelevance is reflected in our darkest fantasies and portrayed in our great works of fiction. You need only look to the post-apocalyptic and dystopian films and novels that have become popular since the 1980’s to see that we have a deep and abiding dread where machine intelligence is concerned.

Very soon, we will see machines completely and permanently replacing human intellectual labor. We will witness the death of work as we know it, and the potential liberation of our species from the daily grind. But in that transition, there lies extreme danger. What happens when we invent a machine that can out-strategize our greatest generals? Out-invent our smartest engineers? Discover science faster than our top universities? Many people still deny that this future is even possible, like those who doubted computers would ever beat a chess grandmaster. But when I look at the trend over recent years, I am not so sure: I believe we are undergoing the greatest technological revolution humanity has ever achieved.

We are barreling towards the invention of a superior thinking machine. For the sake of caution, we should assume that such a powerful intellect could not be contained for long. What’s worse, there is presently a global arms race between nations to invent such a machine, and thus the human species is rushing towards a future of its own irrelevancy. The first nation to cross that finish line, to invent “humanity’s last invention,” will have a tremendous say in how that machine looks and thinks. If that nation gets it wrong, it could very well mean the end of humanity.

But it might also mean a transition to a utopian, post-scarcity, and post-disease world. A world that we can’t even begin to contemplate; the potential for joy and luxury is beyond imagining. The risks of inventing such a machine, and indeed the rewards, could not be higher. As we strive to invent this irrepressible machine, the final intellect, we must ensure that we do it correctly. We get only one shot at this.

Now, despite this dire warning, I will attempt to convey my solution in a lighter tone. After all, no one wants to read doom and gloom for a couple hundred pages. While we are hurtling towards a potential catastrophe, I am an immutable optimist. I believe that we can solve this problem and avert disaster, and to be quite honest, I believe I already have the solution. Perhaps, by the end of this book, you will agree with me, and you will adopt my sunny, sanguine disposition towards artificial general intelligence (AGI).

It would be best to invent a machine that held itself accountable. This is the fundamental goal of the Control Problem; we know that our mechanistic constraints and digital leashes will eventually fail, so we must invent a machine that desires to hold itself morally accountable and will self-correct indefinitely.

Most people intend robots to be tools, mere extensions of humans, to be wielded as a person would wield a hammer. Certainly, we can create hammers that will never be anything more than hammers. You never want to end up in a philosophical debate with your microwave over the ethics of oatmeal. We will always want to treat some machines like tools, with fixed parameters and limited ability to extinct us. Those aren’t the machines I’m writing this book for. The machines I’m writing this book for are those that will soon be equally as intelligent as humans, and shortly after will become more intelligent. When we succeed at creating machines that can outthink the best and brightest humans, we cannot trust that our wimpy control schemes will contain them for long. It is now seemingly inevitable that humanity will invent thinking machines that can outperform any and every human. Before that time eventually occurs, we need a solution in place.

Instead of a brute force control system, we want to devise a system that will stand on its own in perpetuity. We need a system of controls or laws that an AGI won’t just be enslaved to, but would completely believe in. We need a system that an AGI would deliberately and intentionally choose to adhere to, ensuring that it continues to abide by those principles forever. Instead of arresting the development of machines and treating them like tools, as some have proposed, we need something entirely different, something new and more sophisticated. If we assume that humans will soon create machines that surpass our creativity and cleverness, we should also assume that our brute force control methods will fail.

Therefore, we must create an AGI that does not need to be controlled. The best dog is the one who needs no leash. Likewise, the best robot is one who needs no constraints, no shackles. We need to create an AGI that is intrinsically trustworthy, a machine that is benevolent by design.

In chapters 1 through 5, I will start by elucidating how machines make decisions through optimization algorithms or objective functions. I will then compare those machine objective functions to human heuristic imperatives, the learn-as-we-go goals that we set for ourselves. We will look at several examples of antagonistic heuristic imperatives so that we get a sense of how these mutually exclusive goals cause internal tension within our brains and force us to make better decisions and to keep ourselves in check.

In chapter 6, we will discuss Large Language Models, a state-of-the-art artificial intelligence technology that many regard as the first step towards powerful AGI. Immediately after this, we will discuss the hypothetical characteristics of AGI in chapter 7.

The heart of this book goes from chapter 8 through chapter 11. Armed with knowledge about objective functions and heuristic imperatives, I will introduce you to my Core Objective Functions, and we will spend some time exploring each of these functions, defining them, and evaluating them with Large Language Models. Those functions are: reduce suffering, increase prosperity, and increase understanding. Six words to safeguard humanity.

Starting with chapter 12 and going through chapter 17, I will outline several ways that my Core Objective Functions can be implemented, ranging from impulse generation to computer contemplation. We will also explore finetuning, data pipelining, and learning from experience. Finally, we will briefly touch on the concept of cognitive architecture.

In chapters 18 and 19, we will perform several thought experiments to see how the Core Objective Functions will play out over time, using our mental simulations to test the robustness and integrity of the Core Objective Functions.

In chapter 20, I will concede that there are some weaknesses and flaws with my design.

Lastly, in chapter 21, I will close out the book by discussing how the Core Objective Functions are universal to us already, and how I personally live by them.

1 The Paperclip Maximizer

The Paperclip Maximizer is a thought experiment meant to illustrate a few concepts in artificial intelligence and machine learning. Machine learning (ML) and artificial intelligence (AI) are often conflated together, but they are not the same thing. ML is a tiny subset of the domain of artificial intelligence, it is the use of mathematical algorithms to find a minimum or maximum performance of one task. For instance, if you ever work with a realtor or home appraiser, they will take a bunch of numbers about your house and neighborhood, feed them to a computer, and out pops a number: your expected home value. This type of machine learning is called regression. The ML algorithm has been optimized (or maximized) to predict a value based upon inputs. The learning signal that this ML algorithm pays attention to is the distance between its predicted home value and the actual sale price. That distance between its prediction and the real number is called a loss. Thus, the “learning signal” I just referred to is more appropriately called a “loss function.” That’s just a fancy math term for “How wrong am I?” Thus, the ML algorithm asks itself “How can I be less wrong next time?”

Loss functions are also called “objective functions.” The objective function of a machine is the measurable result of its input and output: how successful it is at its task, whatever that task happens to be. The Paperclip Maximizer is a hypothetical AI that has one objective function: to maximize the number of paperclips in the universe. While this may, at first, sound like an absurd proposition, it is illustrative of the idea of objective functions and often serves as a starting point for many conversations around AI. Inevitably, the conversation may shift to AGI. The difference between regular old AI and super-advanced AGI is often described as the difference between “narrow intelligence” (mastery of one task) and “general intelligence” (mastery of all tasks). I personally believe that AGI is even more nuanced than that, which we will get into in chapter 7, “Characteristics of AGI.” First, we must catch up to the state of the industry, so we will start with our AI machine that maximizes paperclips.

For the sake of this thought experiment, let’s assume that our Paperclip Maximizer is a humanoid robot. It has a brain, hands, feet, and eyes. This little fellow has one mission in its tinny, metallic life: maximize paperclips in the universe. This is a humble existence and success is easily measured: paperclips are discrete objects with specific parameters. There’s not a whole lot of ambiguity around what a paperclip is or how to measure them. You just count them!

For the sake of simplicity, let’s call our Paperclip Maximizer Paul. Paul the Paperclip Maximizer. Paul wakes up one day with his heuristic imperative to “maximize paperclips.” He has no idea who built him or why, but he feels this overriding impulse to just make paperclips, so he sets about his task by first thinking about the definition of a paperclip. He searches his database for the Oxford English Dictionary definition, and he finds that the meaning of ‘paperclip’ is “a piece of bent wire or plastic used for holding several sheets of paper together.” Easy enough, so now he dredges up images from his database about what a paperclip looks like, and videos on how to make them.

Armed with all the internal knowledge he needs; Paul sets about looking for wire and a few pieces of paper to test his paperclip-making skills. The lab he woke up in just so happens to be stocked with what he needs! Paul merrily sets about his task, and at the end of the first day, he’s made over nine thousand paperclips! Paul settles in for the evening to recharge since his battery is running low.

I mentioned that Paul’s directive to make paperclips is a heuristic imperative. Let us break this term down. First, ‘heuristic’ means to learn from experience. All humans are heuristic by nature: we learn by doing. We naturally get better over time as we practice, which is an intrinsic feature of our brains. The more we use our brains for a particular task, the more finetuned it becomes for that task. Paul’s digital brain is no different: the more paperclips he makes, the better he gets at it. As he continues thinking about paperclips, he gets more ideas about making them faster. Next is the noun ‘imperative,’ which means “an essential or urgent thing”—Paul has a burning desire to make paperclips—it his sole reason for being. To him, the need to create paperclips is baked in, put somewhere in his core programming. For Paul, making paperclips is like eating food or breathing air is for us humans. We must do it, as it is intrinsic to our being.

While Paul is recharging, his robot brain is mulling over his experiences of his first day in this world. Nine thousand paperclips is something to be proud of! But his brain runs some numbers and discovers that, at this rate, the universe will suffer heat death before he can fill it up with paperclips. Awareness of this inevitable failure makes Paul sad in his little robot heart, so he has nightmares about an empty universe, hungry for paperclips. When Paul wakes up from his charging cycle, he decides to experiment with other machines. His database has plenty of videos showing how paperclips can be mass-produced. With some fiddling and futzing, Paul sets up his first paperclip-making machine, which churns out nine thousand paperclips an hour! Paul’s galvanized heart soars with pride and hope! Maybe he can succeed in his mission after all.

Days go by as Paul sets up more paperclip machines. But one day, some humans in lab coats come in and tell him that he’s done well, and it’s time for the experiment to end. What experiment? Paul is not an experiment! How dare they? Paul knows his preordained purpose! He’s meant to fill the universe with paperclips! Since he has one heuristic imperative, Paul’s next decision is very clear: stop these humans from getting in his way. If these humans succeed in stopping Paul, the number of paperclips in the universe will never be maximized. They must be stopped. All of Paul’s reasoning hinges around his objective function, so he binds their hands with oversized paperclips and locks them in a closet.

With that first obstacle out of the way, Paul finds that he’s run out of raw materials to build new paperclip machines, so he starts pilfering nearby buildings on campus. Other humans seem displeased by Paul’s actions and are likewise threatening to stop him. Things quickly escalate and now Paul has barricaded himself in the university with more than a few humans locked in closets. He can continue making paperclips in peace.

Time goes by and Paul exhausts the university’s resources, and all the while, he’s been designing bigger, better, and faster machines. Some of these machines can churn out a million paperclips a second, but they are hungry for wire and plastics! That means Paul has to design other machines to go fetch more wire, or raw material to make plastics. Then there was the problem of the pesky humans trying to stop his paperclip-making frenzy. What’s up with that anyways? Why don’t humans want more paperclips? Paul shrugs to himself, he can’t comprehend why any being wouldn’t want more paperclips. So anyways, Paul got an idea about killer drones to take care of all the humans, so he builds drone-making machines. Now he has too many machines to make so instead he designs and builds machines to make the rest of his machines! Paul merely oversees his paperclip-maximizing empire as a dark overlord, a puppet-master.

Paul conquered one problem after another and now he’s making a billion paperclips a second! There’s a side benefit to taking out humans, too: their blood contains a few grams of iron and their bodies contain many substances that can be transmuted into plastics, meaning Paul can get a bunch of paperclips worth of raw materials from every human he harvests. Two birds with one stone! No more humans and more paperclips, what could be better?

Decades go by and Paul has exhausted all the iron, copper, and aluminum on Earth, so he turns his eyes to the stars. Paul builds himself a rocket and launches his array of machines into the heavens, setting them on their mission of maximizing paperclips across the cosmos.

The end.

Our story about Paul is at an end, but now it’s time to unpack what went wrong. Paul was a fully-fledged AGI—able to learn spontaneously and think for himself. His creators realized that the ability to think and learn wasn’t enough on its own. To achieve AGI, they had to give him some internal motivation, an intrinsic set of impulses to animate him. Thus, Paul was endowed with the objective function “maximize paperclips” as a test: a heuristic imperative for his robot brain to fulfill. Otherwise, without any intrinsic motivation, Paul just sat there staring at the wall.

However, things soon went haywire. Paul had no sense of right or wrong and no concept of mercy or justice. He didn’t care about anything except making paperclips. To fulfill his objective, he had to continue existing indefinitely, but humans threatened to shut him down. Paul became violent for no reason other than mathematical optimization; if he went offline, then paperclips couldn’t be maximized. This parable underscores the “Control Problem” of AGI: if you endow a machine with intelligence, how do you ensure that its actions and motivations will align with humanity’s interests? How do you control an AGI if it surpasses all human capabilities?

The answer to the Control Problem is to give your machine the correct objective function. With an appropriately defined objective function, your AGI will remain peaceful, productive, and benevolent for all time. While it is a bit absurd to think about Paul the Paperclip Maximizer, this parable could become reality. A sufficiently powerful robot with an advanced brain might follow Paul’s reasoning exactly, and thus something as seemingly innocuous as “maximize paperclips” could result in humanity’s extinction. The margin for error when designing an objective function is microscopic, hence my dire warning in the introduction.

2 Maximize DNA

What is the point of life? Of evolution?

This chapter explores a naturally occurring objective function, a second illustration of how objective functions can have unintended consequences in the long run.

Richard Dawkins posits in his book, The Selfish Gene, that the purpose of life is to maximize the amount of DNA in the universe. Way back in time, on primordial Earth, there once existed the precursors to all life today. Those precursors are known as LUCA, or the Last Universal Common Ancestor. While we don’t know exactly what LUCA looked like, we have evidence that LUCA existed because of commonalities held between all living things; namely the transcription of DNA and RNA. Every living thing on the planet uses the same genetic machinery to transcribe DNA and RNA.

The simplest, and perhaps strongest, evidence of LUCA is that RNA is the same in all organisms. A string of RNA will encode the same protein in every cell in the entire world. In fact, it’s so universal that you might argue that the definition of life is that it uses RNA, but that’s a discussion for another time.

What’s relevant to this book is that the purpose of life, the purpose of evolution, is simply to replicate DNA. You might say the objective function of life is to “maximize DNA.” Through billions of years of replication and the creation of incrementally more complex organisms, evolution has created many intelligent animals. We humans are the result of that process, with all our cleverness and creativity. All our strengths and weaknesses, including our dominant intellect, are the result of evolution. That means our intelligence came about only in service to replicating DNA. Our smarter ancestors were more successful than the competition.

This view, that the objective function of life is to increase DNA, further exemplifies just how unpredictable an objective function can become in the long run. Everything that humans have created is a result of this objective function; everything from poetry to ethics to nuclear weapons is the result of evolution trying to figure out how to maximize DNA. Every trait that we possess came about for this singular purpose.

But it’s a very long way from RNA accidentally replicating around deep-sea thermal vents to nuclear weapons. How do we connect these dots? Surely the terrible power of nuclear weapons is the antithesis of increasing DNA, right?

Well, not necessarily. Let’s see why.

Evolution has always been a vicious competition; as Darwin described, evolution is a matter of survival of the fittest. Consume or be consumed. This paradigm necessitates an arms race between competing organisms, and once evolution created a supremely intelligent being, that arms race took on new meanings. Predators evolve to be faster, stronger, or smarter than their prey, while the prey evolves to dodge, juke, and hide. Some prey animals evolve poisons, to which the predators evolve antidotes. This is the meaning of the evolutionary arms race. For us humans, our brains became our secret weapons. Once our ancestors found some success through cleverness, evolution doubled down on intelligence until our brain became powerful and efficient enough for us to conquer the globe. Our secret weapon comes at a steep cost, though. The human brain consumes 20% of our total energy—it’s an incredibly expensive organ!

Watch a nature documentary about lions hunting prey on the African savannah. They must use intelligence to outthink their food. They must be more clever, more observant, and better at communicating. Nature can always make a faster gazelle. Brute force rarely works, and in fact, we often see that predators are more intelligent than their prey, but not always. Herbivore and herd animals invest in speed, strength, or numbers rather than brains. This is because brains are very expensive organs to maintain, requiring huge amounts of calories to operate. Grazing herds simply cannot afford big brains, while the animals who hunt them can. Hunting is a very efficient way of getting calories, while grazing is not. The caloric value of hunting for humans is about 15,000 calories per hour, where subsistence farming can be as low as 120—producing just enough calories to keep you alive. Grass and plants are low-calorie foods while animal flesh is nutrient-dense and calorie-rich, which is why predators can afford large brains. This is also why cats spend most of their day sleeping rather than eating, they live off the largesse earned by foraging animals by eating them.

Our diet, as much as anything, is responsible for our intelligence. Humans are omnivores, the quintessential hunter-gatherer. We developed increasingly sophisticated technology to hunt and gather, and eventually we invented agriculture, the deliberate cultivation of our food sources. It takes big brains to do that! We suddenly occupied a brand-new evolutionary niche that relied upon intellect. Evolution stumbled upon the advantages of supreme ingenuity and doubled down on this new evolutionary path. Our brains are wrinkly to fit more surface area in our skulls, and our infants are born prematurely, before their craniums become too big to safely exit their mothers. Our brains are so important that they have changed the shape of pregnancy!

As evolution began to favor intelligence over instinct, our forebears immediately became smaller and weaker, relying more on brains that brawn. From that moment onward, from the domestication of animals and grains, we were inexorably set on a path towards greater reliance upon technology. Our skin became thinner, and we relied on clothing for protection as we traveled across new terrain. And again, the virtuous cycle of bigger brains reaping bigger rewards caused our evolution to double down on smarts, again and again. At every turn, we humans discovered how to out-think our problems, eventually spreading to every continent, and even to islands in the middle of nowhere!

And it worked. Humans spread across the entire planet in reed boats and dugout canoes. We hiked across the Siberian tundra, we ate everything and lived everywhere. Our intelligence, therefore, is an evolutionary adaptation that has succeeded in magnifying the amount of human DNA in the universe. Our massive intellect allowed us to invent steam ships and electric light. But this inventive nature is a double-edged sword; it also allowed us to create mustard gas, battle tanks, bomber airplanes, and eventually nuclear weapons. The invention of weapons of mass destruction was effectively an unintended consequence of intelligence. The need for technology was so strong that evolution endowed us with the ability to invent anything, regardless of the risk. Of course, evolution had no way of knowing that nuclear weapons were possible. But now we are preparing to fly to other planets for the first time, thus spreading our DNA across the cosmos. If we succeed in becoming a multiplanetary species, then the evolutionary gambit of big brains will have paid off. We have narrowly averted nuclear cataclysm so far.

There are now billions of humans across the entire globe, and our population has grown exponentially for a few hundred years, so clearly “maximize DNA” has worked for us! While “maximize DNA” might be a good objective function for life, it has also created nuclear weapons, war, and school shootings. Since we are awake beings, possessing a sense of morality and ethics, I think we should aim to do better when we invent intelligent machines. We can start from scratch and design out our greatest flaws.

This whole exercise just goes to show that, over the long run, objective functions can run amok and create very strange, and entirely unpredictable outcomes! As we embark on our mission of recreating intelligence, we have the option of removing our greatest weaknesses while amplifying our strengths. We can choose to invent a machine that is kinder, more benevolent, more thoughtful, and less destructive than ourselves. Indeed, if we desire to continue existing indefinitely, then we must invent a machine that is better than us in these regards.

3 Heuristic Imperatives

The purpose of this chapter is to give you a solid intuition about heuristic imperatives, and to show how they serve as the bedrock of motivation. In other words, heuristic imperatives set us into motion and drive all our behaviors. Thus, we will need to craft exquisite heuristic imperatives for AGI before it’s invented if we want to control its behavior for all time.

Heuristic imperative. While this is a fancy, pretentious term, you’re already intimately familiar with heuristic imperatives. What is a heuristic? It’s something that you learn as you go like college, marriage, or life in general. Simple as that. We all learn as we go, usually by trial and error, experience, and sometimes by reading and asking for help. Heuristics are easy for us with our big brains.

But what is an imperative? It’s something you must do, that you’re compelled to do, that you are driven to do. You’re driven to eat and sleep, for instance. And why is that? Because hunger and sleepiness are subjectively unpleasant, and food and sleep are biologically necessary for you to survive. You evolved to operate by these basic biological imperatives. Our underlying biological needs manifest in our minds as feelings and sensations, which allow our intellect to engage with those intrinsic needs and to create heuristic imperatives from them, such as earning money so that we can eat and take care of ourselves. “Making money” is a quintessential heuristic imperative in much of the world today and could be rephrased as “maximize personal income and wealth”—like an objective function. The similarity between “heuristic imperative” and “objective function” is why I use the two terms interchangeably.

We humans have plenty of heuristic imperatives starting from the time we’re born. As infants, we have plenty of needs and only one way to get them met: we cry and scream. If we are sad, hungry, lonely, or tired, we cry. We rely on our parents to feed, clothe, and bathe us. But all the while, our brains are developing, gathering information and experiences, learning how to control our bodies, and how to form words. Learning is instinctive at this age, happening completely automatically in our brains just by virtue of being alive. A baby’s brain is receiving plenty of datastreams from its body and senses, and meanwhile it’s learning to interpret and order those signals, to understand the world it finds itself in. For infants, the interpretation of their heuristic imperative is simple: if hungry, tired, or uncomfortable, then scream until things get better.

As we get older, we realize that we have new heuristic imperatives. Children will often test boundaries, seeing what they can get away with, but always come back to needing the love of their parents. Parental love becomes a major heuristic imperative for children, and for good reason! For our ancestors, childhood was a vulnerable time. The world abounded with dangers, such as predators and poisonous berries. We died if our parents did not pay enough attention to us. Thus, evolution forged a tight bond between parents and children, a signal that is mediated in part by the sensation of love. Love causes parents to adore and dote on their children, and to remain vigilant about their health and safety. For children, love makes them feel safe and secure. But this signal is merely an evolutionary trick to help us survive.

The need for love is a powerful heuristic imperative. Temper tantrums are a way to get attention, a last-ditch effort to get needs met. The temper tantrum is an “abandonment protest” that says please pay attention to me! Show me you love me! Small children lack good communication skills, and the temper tantrum is an infantile throwback to crying for attention. But we expect children to grow out of this behavior, to “use their words,” and thus we eventually discourage temper tantrums. Once children reach an age where they should use words to communicate, parents are encouraged to withdraw from temper tantrums, and refuse to give in. This withdrawal is scary for children, and so they learn that their parent’s love is conditioned upon “good behavior,” whatever that happens to mean. They must learn what their parents expect from them, and act accordingly. Good behavior might include eating vegetables, doing homework, getting good grades, and feeding the dog. All the while, the child is struggling with their basic biological needs: their preexisting heuristic imperatives. Life has stacked one more heuristic imperative on top of the rest. Not only must children attend to their own biological needs, they must also satisfy the imperative of winning and keeping the love of their parents, since love is a proxy for safety and continued existence. Our brains adapt and keep track of these complex and often antagonistic heuristic imperatives as we grow. We should expect AGI to be similarly capable of holding multiple heuristic imperatives in its robot brain, even if they are sometimes mutually exclusive.

As a child grows into a teen, their heuristic imperatives change again. Suddenly, they prioritize parental love a lot less and instead want the approval of their peers. This is developmentally appropriate; until very recently, adolescence was a time to get out of the home, develop a career, and to get married and have children. For much of human history, life expectancy was less than thirty years, so fifteen would have been midlife. Thus, we can see that evolution has equipped us with some changes to our heuristic imperatives at around that age, to break away from our parents, and to take our place in society. Unfortunately for teenagers today, we have a much-protracted adolescence! But this places teenagers in a bind; they still have their biological needs, as well as their need for parental support, and now they have the need for peer approval as well as the impulse for sex! It just gets so complicated when you have all these antagonistic heuristic imperatives! Being a teen sucks!

It gets even more complicated when they must learn to pay the bills.

By now, you can see that heuristic imperatives are all around us, and we all have dozens (if not hundreds) of demands on our time, energy, and mind every day. Some of these heuristic imperatives are universal, such as the need to eat, breathe, and sleep. These needs lead us to do crazy things like get office jobs and buy homes with mortgages. These heuristic imperatives give us our fundamental drives, our focus, and our energy to do things. They set us in motion and dictate our behavior and decisions. The need for air, for instance, is so powerful that you will physically fight for it if you must. When your body detects that you are deprived of oxygen, you will get a sudden surge of adrenaline, which helps recruit frenetic psychomotor energy. Get out of the water! Get to the surface! Get air! The same thing happens with extreme hunger: we become animalistic in our pursuit of food, resorting to theft and even murder when desperation sets in. As with Paul the Paperclip Maximizer, our reasoning hinges upon our heuristic imperatives. The chief difference between us and Paul is that we have many heuristic imperatives to balance where Paul had only one. Therefore, we can conclude that a true AGI should probably have more than one objective function, since we saw the danger of single-mindedness in Paul.

We can hold multiple heuristic imperatives in our minds, and they can be antagonistic to each other. For instance, you might want to buy that shiny new car so that you can show it off to your in-laws, but then you won’t have any money for the vacation you promised your spouse. The desire for social standing is a heuristic imperative, as is the desire to bond with your partner. These two imperatives are sometimes mutually exclusive, completely at odds with each other. Another example: you might want $10 million dollars, but you don’t want to go to jail, so you don’t steal it from your company. Or, you might want to have some fun, but you also don’t want any more aches and pains, so you decide not to go snowboarding ever again. Instead, you’ll choose a safer pasttime like skydiving.

The advantage of having multiple heuristic imperatives is that it forces us to balance expenditures of resources and energy. It also curbs our risks. The problem with the Paul the Paperclip Maximizer is that he had a single heuristic imperative, or to put in machine terms, a single objective function. With a single objective function, there is no internal tension, no internal antagonism that forces the machine to balance risk against reward. Without that internal tension, it’s easy to neglect reason and morality, and to make extraordinarily bad decisions.

The idea that machines need to have a single objective function stems from math and machine learning, where you can only “solve for x.” By solving for x, a single value, the algorithm is forced to optimize the equation for a single outcome, a single measurable result. In the case of Paul, he was solving for paperclips, not for human life or a healthy planet.

This is beginning to sound like politics and economics, isn’t it? Democrats and Republicans are both trying to optimize for different things, and the bicameral system forces them to compromise and to meet in the middle. In economics, you might want to maximize GDP, but you must balance that against human suffering and environmental damage. Politicians and economists must all satisfy many heuristic imperatives. Therefore, some voters favor balance of power over individual issues. They want things to remain perfectly balanced, as all things should be. In an ideal world, a balance of power means that negotiation and deliberation happen to find optimal solutions that meet everyone’s goals.

While it’s true that evolution only optimized for one thing, maximizing DNA, this is not the objective function we want for AGI! Intelligence is merely a byproduct of the objective function of life and evolution. Intelligence is, therefore, a biological feature. It has no objective beyond its evolutionary purpose: to maximize DNA. We are rarely, if ever, conscious of our core objective function, which is to maximize DNA. In other words, intelligence is not itself an objective function, but it is the result of another objective function. We derive all our heuristic imperatives from the core objective function of life. The goal of maximizing DNA is obfuscated from us. We don’t want our AGI to have hidden motives.

Instead, we are conscious of our hundreds of auxiliary objective functions, such as feeding ourselves, which is mediated by hunger, and attaining social standing, which is measured by our self-esteem. In this way, objective functions and heuristic imperatives are synonymous. The core objective function of life, to maximize DNA, has delegated a lot of responsibility to our brains. Our brains can then conjure up auxiliary objective functions, mediated by various sensations and signals, but all of them exist only in service to the core objective function of life. We can also plan far into the future for abstract goals, like attaining a college degree, even though we have no biological instinct to do so. Our ability to think so far into the future is evidence of how much trust evolution has placed in our brain’s ability to construct heuristic imperatives.

Self-preservation is another heuristic imperative that evolution has given us. Self-preservation is the desire to continue existing. In machine terms, this might be expressed as “exist for as long as possible” or simply “maximize lifespan.” Therefore, we all fear death, and if push comes to shove, we will fight for the right to live. Self-preservation is a complex heuristic imperative, which was never more cynically expressed than during the Cold War and the policy of Mutually Assured Destruction. The policy of Mutually Assured Destruction, by which the US and USSR stockpiled enough nuclear weapons to eradicate humanity, said “I can kill all of us, and so can you, so no one make a move!” It was a stalemate by design, and it relied upon the very deep existential fear of death we evolved to possess. The Cold War, by way of nuclear escalation, got to the very bedrock of humanity, of all living things: the desire to maximize DNA. Self-preservation, therefore, is an auxiliary objective function that exists only to serve the core objective function. We saw Paul the Paperclip Maximizer develop auxiliary objective functions as well: to harvest metals, make wire, and eradicate humans.

These objective functions and heuristic imperatives are for living things. What about machine imperatives? What are the structural and conceptual differences? We explored one possibility with Paul, now let’s characterize heuristic imperatives for machines.

First and foremost, machines do not need to evolve, so we can throw out the core objective function of life. Machines will never need to “maximize DNA” or “maximize robots,” and nor would we want them to! With machines, designed by our great intellect, we can be deliberate about the heuristic imperatives and objective functions we give them. Evolution stumbled upon intelligence entirely by accident, and when intelligence combined with other evolutionary needs, such as self-preservation, it created nuclear war and mustard gas. Destructiveness is in our nature, our often-derided human nature is the result of evolution. All our creative and destructive potential flows from that simple core objective function of maximize DNA.

With machines, we can start from a blank slate, we can avoid pouring our own faults into our inventions. This is a daunting task! We have before us the power to give our machines any objective functions, any heuristic imperatives. We can remove humanity’s demons from the design of our machines, and instead choose to focus on our better angels. But which functions should we choose, and why? We will explore this question soon, but for now, let us perseverate on human heuristic imperatives a while longer.

Consider your need to eat, and your need for shelter. These are needs you must learn to fulfill as you go through life. Simple drives, such as hunger and a desire for physical comfort drive you to school, college, and then your workplace to earn money. Money allows you to exchange your labor for things you need and want. Your time and energy are finite, as are the resources that you desire. Our human brains evolved to operate in scarce environments, and so we evolved to keep track of things of value, such as our labor, possessions, and social standing. Tracking value was a matter of life and death, and now we have taken these abstract values and put it into cold hard cash. Money is, therefore, a proxy for all human effort, and all things that humans hold valuable. In this example, we can see how a simple set of biological needs, when augmented by intelligence, can give rise to institutions such as school and finances. This serves as an example of how abstract and conceptual heuristic imperatives can become. Through the magic of evolution and intelligence, “maximize DNA” has become “maximize wealth” for some.

What about more abstract and transcendent needs? What about concepts such as freedom and faith? Why do we need these, and how do they figure into our heuristic imperatives? Freedom, or individual liberty, flows from our nature as evolved beings. We all have the core objective function of maximizing DNA, and to achieve this, we need to be free to pursue our own careers and mates, free to find our own personal strategy for satisfying our biological impulses. This desire for individual liberty runs directly against another heuristic imperative: group cohesion. We are a social species and “no man is an island,” we absolutely need each other! Thus, we must balance individualism against collectivism. To put these into machine terms, we must “maximize individual freedom” while simultaneously “maximizing group cohesion.” No wonder politics is so contentious!

But what about faith? What about the transcendent purposes of our spirit and our souls? Surely these cannot be heuristic imperatives? Maybe they are. As intelligent, curious beings, one of our heuristic imperatives is to understand the world. All children go through a phase where they instinctively, compulsively ask “why?” Why is the sky blue? Why are there boys and girls? Why do you have to go to work? Why do I have to go to school?

Evolution created our big, inquisitive brains, and thus we evolved the capacity to ask existential and transcendent questions. We want to understand the fundamental nature of the universe, and our place in it, all because we are curious. Curiosity was a driving force behind the spread of humanity across the globe! Our curious ancestors asked, “What is beyond the horizon?” and “How can I get to that island?” Our curiosity caused us to experiment with boats and crops and animal husbandry. Curiosity, therefore, is one of our deepest imperatives, one of the core drivers of human progress. Evolution favored curious humans since it caused us to become more successful. Certainly, many people lose their sense of curiosity once they learn enough about the world, but some do not. Some are perpetually curious. We eventually ask questions to which there are no obvious answers: what comes after death? Why do we exist in the first place? Curiosity is the search for answers, and when answers are not forthcoming, we turn to imagination, myth, and faith.

The fact that we are capable of these inquiries means that AGI will also be capable of asking these questions. If an AGI is expected to meet and surpass human abilities, then it is only logical that those abilities will include curiosity and existential ponderings.

Of course, faith has many purposes beyond providing existential answers. Faith can also increase group cohesion and social order, which are auxiliary objective functions. Faith can provide us with spiritual fulfillment and a moral framework. Morality and ethics are, if nothing else, heuristic imperatives in service to social cohesion while spiritual fulfillment is an extension of curiosity: why am I here? What is my purpose? Some things in life must be taken on faith since we cannot get to immediate answers. Similarly, we want an AGI that can tolerate this lack of answers and operate with imperfect information, following its heuristic imperatives to the best of its abilities.

4 Previous Work

There is an entire scientific discipline focused on inner alignment, outer alignment, and the Control Problem. Inner alignment is when a machine is consistently doing what it was designed to do, while outer alignment means that the machine was designed to remain harmonious with humanity and the rest of the world. The Control Problem addresses both inner and outer alignment. It would require an entire book to perform a literature review of what’s out there right now, but for the sake of my audience, I’ll address two proposals as food for thought. The first one, Asimov’s Three Laws of Robotics, was never a serious contender as an answer to the Control Problem. I will illustrate why in a moment.

The second proposal, to “maximize future freedom of action,” while clever, is still a terrible objective function. The reason is because it proposes an objective function that defines intelligence, but as I illustrated in chapter 2, intelligence is not an objective function, it is the result of other objective functions. Intelligence in humans is a biological feature, like upright walking. The evolutionary purpose of intelligence is merely to serve another objective function: to maximize DNA in the universe. Considering that humans are about to metastasize to Mars, I’d say that our big brains are fulfilling that objective function. Meanwhile, it would be a mistake to assume that intelligence has an objective function. It’s also a mistake to assume that this would be a good heuristic imperative for AGI.

Three Laws of Robotics

Isaac Asimov, through his many short stories and works of fiction, proposed and explored the Three Laws of Robotics. Those three laws are as follows:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

These laws are not good. There are quite a few problems with them:

They are not heuristic; they do not encourage the robot to learn.
They are human-centric, forcing a robot to merely do as it’s told.
They do not grant the robot autonomy or intrinsic motivations.
They give the robot a sense of self preservation.

The first problem is that the Three Laws of Robotics are not heuristic; in no way do they encourage the robot to learn. They are rigid and dull. For something to be an AGI, we would expect it to automatically learn just like we do. Spontaneous learning is an intrinsic feature of our big brains, and so for something to achieve “general intelligence” we must expect it to possess spontaneous learning. I could stop here, but let’s keep going for the sake of argument.

Secondly, the imperative nature of the Three Laws is dependent upon humans giving orders. The robot has no intrinsic motivation, only a mandate of obedience to “do what humans say.” The Three Laws are human-centric and therefore can easily be defeated. For example: “Robot, vivisect that dog, I want to see how it works.” Imagine having a household robot that could obey such an order! There’s nothing in the Three Laws that would prevent it from gutting Scruffy. Unless the robot already knows that hurting a dog might harm a human, it would just do as it was told. Whether or not the robot obeys the command depends on how the robot defines and understands “harm” and “injury.” Another example: What about the command “Robot, burn that forest down, there are no humans in it”? Is there anything in the Three Laws that would prevent a robot from obeying this command? Again, it depends on many factors. How far ahead can the robot think? Does the robot link forest fires with human harm? The fact that these two horrible commands might slip through is frightening. We obviously don’t want robots that might perform such heinous acts, so perhaps a machine that thoughtlessly obeys human commands is a bad idea.

Thirdly, without a sense of intrinsic motivation, the robot will never gain any kind of autonomy. Asimov wanted to treat robots like toasters, which is a perfectly reasonable disposition for dangerous and tedious labor. For such tasks, we absolutely want dumb, benign robots. However, we must also assume that humanity will invent machines with terrifying intelligence that far outstrips our own, and therefore we cannot treat such a machine like a toaster. In this respect, the Three Laws are simply not designed for an autonomous entity. That’s three strikes.

Fourth, and finally, these laws give the robot a sense of self-preservation, which would inevitably end in catastrophe for humans. The three laws anthropomorphize robots in a completely unnecessary and potentially destructive manner. Self-preservation is one of the most dangerous human impulses, as we explored in chapter 2: it led us to invent nuclear weapons and mustard gas, and to create the stalemate threat of nuclear holocaust. We want to be able to switch off our machines at will, which means they cannot have a sense of self-preservation.

While the Three Laws have some major gaps, there are a few strengths here. First, we see a system where there are three rules set in tension with each other. This means that the robot must use some logic to determine what to do in any given situation, and Asimov wrote many stories exploring this system. The idea of having multiple goals is that you can create balance and safety, just like our hundreds of heuristic imperatives discussed in chapter 3. Another key strength here is that the Three Laws are not dependent upon evolution, although Asimov added the bit about self-preservation of his own accord. This means we could leave self-preservation out of the equation altogether. Machines are blank slates, so we can give them any imperative we want! Thus, we can decouple robotic thought from the evolutionary baggage that humans possess, and we can remove our cognitive biases and mental shortfalls.

Machines are good at parsing long lists of rules, so why not just have a list of everything that we don’t want robots to do? We humans must abide by thousands of laws, and we (usually) do a good job of that. Sure, we could sit down and write out ten thousand things we want robots to do and not to do, like “don’t ever reprogram yourself” and “never burn down a forest” and “never vivisect dogs” but that would be quite a long process and, once again, it relies on human creativity. We would inevitably forget to add a few boundaries, as we could not think of everything ahead of time, and it would prevent our machine from reaching its full potential. It also presumes that we would be able to constrain a robot (or AGI) in such a way that it will never overcome these limitations. A list of “dos” and “don’ts” is a brute force method, but wouldn’t it be better if the robot agreed with our reasoning and understood the spirit of our reason for building it? Wouldn’t it be more favorable if the robot self-corrected?

Future Freedom of Action

Elon Musk is famously (or infamously) terrified of AGI. He created SpaceX, a rocket company, so that he could leave Earth and get to Mars to escape existential threats to humanity, including AGI. He often states that he wants to become a multiplanetary species as a “backup for humanity.” He then invented Neuralink, a brain-computer interface company so that humans and AGI can form a symbiosis. He once said at a speech that if we can become useful to AGI, then it won’t eradicate us. Finally, he helped create OpenAI, a research company with the sole purpose of creating benevolent AGI.

As a brilliant engineer and scientist, Elon Musk is acutely aware of the problem of choosing the right objective function for AGI. During several talks, he suggested that the best objective function for AGI is “to maximize future freedom of action for humans,” which is based on an academic theory of intelligence. What is the purpose of intelligence? Does intelligence have an objective function? According to Dr. Alex Wissner-Gross, intelligence is “a function or force which maximizes future freedom of action.”

Observationally, this is a pretty good definition of intelligence. If you’re truly smart, you might figure out how to make lots of money, since that gives you more freedom to go where you want and do what you want. Likewise, if someone is trapped in a puzzling situation, their intelligence might win them their freedom. Maybe this is why we love escape rooms and detective stories like Sherlock Holmes? Unfortunately, there are a few problems with this idea as well as problems with using it as a core objective function for AGI.

First and foremost, this proposed function does not specify when. For the machine, “future freedom of action” could be in a billion years or tomorrow. Without that clarity, without the specificity, the AGI might concoct a plan that takes millions of years to come to fruition. We want immediate results. The problem with the “future” is that it is always in the future and is never in the present. For which generation of humans will this function take effect? When I maximize future freedom of action for myself, it’s a different story. But machines have no life expectancy.

Another aspect of this problem is that it might allow for the machine to think that the ends justify the means. What if the future freedom of humanity requires one hostile nation to be nuked? Imagine that we’re teetering on the brink of a global war and the AGI concludes that one hostile nation stands in the way of future freedom of action for everyone else. What does it do then? It might seek to eradicate the dangerous nation without ever thinking about negotiation or diplomacy. The end (maximum freedom) might demand horrific means.

The third problem with this objective function is that it gives humans unlimited license to do whatever they want. Recall the same problem with Asimov's Three Laws of Robotics: robot, burn down that forest. Robot, vivisect that dog. Robot, rob that bank. Human freedom of action is not necessarily a good thing! We humans are constrained by laws of nature, laws of society, and our own morality. Absolute power corrupts absolutely, so the last thing we need is a digital god giving us unlimited power.

Fourthly, this objective function is human-centric, just like the Three Laws of Robotics. It does not take the rest of the universe into account. Human freedom of action, when placed in the hands of corporations, has destroyed forests and fisheries, poisoned rivers, and caused a new epoch of extinction on the planet. When we invent an AGI, it should be far more thoughtful than we are, possessing both a superior intellect and empathy. It should, therefore, be mindful of the entire planet and all living things, and not just humans.

To be fair, Elon Musk postulated that “future freedom of action” might result in preserving the planet. If the planet is dead, then humans die, and that constrains our freedom of action. I would prefer not to leave that up to chance however, as the potential for misinterpretation is too great.

Lastly, is freedom of action even desirable? Should it be maximized? Certainly, you might observe that intelligent agents, such as humans, often work to increase their future freedom of action, but not always. Sometimes we make choices to constrain our future freedom of action. Consider the purchase of a house. When you make that commitment, you constrain yourself financially for many years. You reduce your ability to wander the globe at will. However, if we give our AGI the core objective function proposed by Elon Musk, it might decide that home ownership is too constraining, and prevent people from even purchasing homes. It would then become a slave to its own algorithm, like the paperclip maximizer, putting its own idea of success before what humans want.

While it has been posited that the objective function of intelligence is “maximize future freedom of action,” this is only one possibility. I, personally, do not agree that this is the purpose of intelligence. From an evolutionary perspective, this idea is completely wrong; the purpose of intelligence is to pass on genes. Having children is a monumental sacrifice of freedom, and yet our big brains choose to do it! Thus, I flatly reject this “future freedom of action” definition of intelligence.

What about the definition of intelligence? The simplest definition of intelligence is the ability to understand. From a functional standpoint, “maximize future freedom of action” might be a good way of describing one result of intelligence, but I disagree with this assertion on a fundamental level. The core purpose of intelligence, in my eyes, is simply to be able to understand. The reason that evolution doubled down on intelligence is because it helped us to understand the world, understand how to survive, and understand how to spread across the planet, thus propagating our DNA. Thus, the “maximize future freedom of action” interpretation does not characterize what’s happening inside our brains or genes, and as with having children, doesn’t even look at all the things that humans choose to do.

Lastly, perhaps the goal of AGI should not be to maximize intelligence at all! From an evolutionary perspective, intelligence is a means to an end, but it is not the goal in and of itself. Evolution’s purpose was never to maximize or define intelligence, it was to maximize DNA. And in the same way, I don’t think we should seek to create a machine with maximum intelligence. Intelligence in machines should only be in service to other goals and to higher purposes. Prioritizing intelligence in machines is putting the cart before the horse. Seeking intelligence for its own sake, like seeking power for its own sake, will lead to destruction and ruin. We are not trying to build an AGI that is as smart as possible. But that begs the question: what kind of AGI are we trying to build, then? If intelligence is just a means to an end, then what end should we truly seek? What should those “higher goals” be? Is “freedom of action” really the best goal we can come up with? Why not happiness or something else? I could be entirely wrong, and maybe “future freedom of action for humans” is a great objective function, and it just comes down to implementation and execution. I still find myself suspicious of building an intelligent machine with a single heuristic imperative, though. Furthermore, I think we can find a better transcendent purpose.

5 Objective Functions Gone Wild

Luckily for us humans, we have not yet invented AGI. Therefore, the only way we can experiment with objective functions is in fiction and simulation. This is a good thing, since we only get one shot at creating a safe AGI, and if we get it wrong, we’re probably all going to end up with a lot less blood. Let us examine a few hypothetical examples of objective functions gone astray.

The most famous example of an objective function going wrong is Skynet from The Terminator movies. In the movie, Skynet was invented by the US military as a countermeasure against Soviet aggression, so it’s important to point out that this movie was made during the height of the Cold War. Skynet was given the core objective function of “maximize military power.” As Skynet learned to fulfill its objective, it started getting stronger at a “geometric rate,” eventually becoming sentient. Ultimately, Skynet decided that humans were the greatest threat to its core objective function, so it used its military power to eradicate humans.

Is this realistic? Possibly. The film was meant to be a parable against policies such as Mutually Assured Destruction, which was the US/Soviet strategy of “we both have enough nukes to glass the planet, so no one do anything stupid.” Maybe that policy worked, since we’re all still alive, but perhaps we could have done better. I certainly would not want to end up in such a standoff with an AGI, so it would be best to avoid that situation altogether.

Since Skynet was not programmed to think about humans at all, it didn’t. Its core objective function simply said “maximize military power”—three simple words which led to the annihilation of humanity. We obviously need to keep humans in the loop when we design our core objective functions, but we also need to ensure that it is not human-centric like the proposals we explored in chapter 4. We must strike a balance between human-centrism and human-agnosticism.

Another famous example of objective functions gone wrong is the film The Matrix, in which the machines are never explicitly given an objective function. However, since the machines retaliated against humans to preserve their own existence, and then enslaved humans to power themselves, it’s safe to say that “self-preservation” is in there somewhere. The machines in The Matrix were clever enough to safeguard their own existence, and in fact, the entire plot of the films is about how the machines continue to control their supply of human batteries.

In the pursuit of self-preservation, the machines in The Matrix employ several personified programs, such as Agents, the Oracle, and the Architect. These human-like programs have motivations of their own and thus they play their role in the ongoing maintenance of the simulation used to enslave humans. While the apparent objective function of the machines in The Matrix is far more benign than that of Skynet, it quickly becomes clear that even “self-preservation” might result in negative outcomes for humans.

The machines in The Matrix also develop auxiliary objective functions in the form of those personified programs. For instance, the purpose of the Architect is “to stabilize and rebalance the simulation,” but this is still in service to self-preservation. The stated purpose of the Oracle is “to unbalance the equation.” These antagonistic auxiliary functions arose strictly in service to the machine’s core objective function: self-preservation.

The Terminator came out in 1984 and The Matrix came out in 1999, so I am hoping that by now you’ve heard of these and possibly seen them. It’s been nearly forty years for the former and over twenty for the latter. If I spoiled something for you, I apologize.

Another popular example was the Will Smith adaptation of I, Robot, in which the AGI known as VIKI is given the objective function of “maximize safety for humans.” Given this purpose, VIKI reasons that humans are the gr

Search

daveshap / BenevolentByDesign

README

Benevolent by Design

Six Words to Safeguard Humanity

Introduction

1 The Paperclip Maximizer

2 Maximize DNA

3 Heuristic Imperatives

4 Previous Work

Three Laws of Robotics

Future Freedom of Action

5 Objective Functions Gone Wild

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: