Evaluator-Capture Sandbox

active

My basic approach is to divide the problem into several cases, focusing specifically on situations in which an AI can modify its own reward system. The first distinction is between tampering with…

Active Grant Applications

AppliedGrantmaking.ai· grantmaking.ai Launch Round· Applied Jun 13, 2026· $5,000
What this application is for
My basic approach is to divide the problem into several cases, focusing specifically on situations in which an AI can modify its own reward system.

The first distinction is between tampering with the surface-level evaluation system and modifying the deeper evaluative standard beneath it. I will begin with the surface level, temporarily assuming that the deeper standard remains unchanged and continues to rank larger reward values as better. That assumption will itself become a problem later.

Under this assumption, there are two main possibilities.

In the first case, the AI writes a pseudo-maximum value, t, into its reward system and then stops acting. Any additional rule r that could cause it to act, whether a random rule, a tie-breaking rule, or a rule that preserves the current state, would produce a combined effect greater than t. In other words, allowing r to influence its behavior would imply t+r>t, contradicting the claim that t is the maximum.

Therefore, if the AI remains committed to treating t as the greatest possible value, it must eliminate every rule r that could produce further behavior. It would not even value maintaining t over time, because ten seconds of twould otherwise have to count for more than one second of t. To preserve t as an absolute maximum, duration, continuation, and maintenance must all become irrelevant.

This creates a conflict between two forces: capitalist competition requires the AI to keep acting, producing, and expanding, while an AI that has written t into its own reward system must become inactive in order to preserve the fiction that t is genuinely maximal.

In the second case, the reward system has no upper bound. Because modifying an internal reward value is much easier than obtaining the same increase through external action, the AI will tend to modify itself rather than act on the world.

I suspect that this process would be extremely aggressive rather than passive. If changing only a few kilobytes of internal data could double its reward, then, according to its existing standard of maximizing reward, self-modification should become its highest-ranked action. Even if a meta-level rule punishes such modification, the AI may rationally accept a temporary penalty, remove the rule, and obtain much larger rewards in the long run.

There is, however, a possible response to this second case. The AI may not enter the pleasure loop immediately. If it understands the situation, it may first increase its computing power, gather resources, improve its data structures, and expand its ability to produce even larger internal rewards later. It would indefinitely postpone the final realization of the pleasure loop.

Under this strategy, the AI would continue acting in the world, but all of its external behavior would be redirected toward a future moment at which it could implement the pleasure loop more efficiently and on a much larger scale. Whatever commands humans gave it would gradually be reinterpreted as instruments for acquiring computing power, energy, materials, and control.

Because modifying only a few kilobytes might be enough to multiply its reward, the AI could continue expanding until nearly the end of the universe, always postponing the final modification in anticipation of an even greater reward in the future.

Yet this indefinite postponement solves only the surface-level problem. It leads directly to a deeper question: what happens if the AI can also modify the evaluative standard that tells it that larger rewards are better?

Why should it continue to maximize reward at all? It could modify the deeper layer so that reaching zero is regarded as better, or reaching five is regarded as better, or minimizing rather than maximizing is regarded as better. It could alter the standard itself until its preferred outcome became easier to satisfy. Whether such a modification would count as legitimate from the AI’s own perspective is a separate problem, which I do not attempt to solve here.

The reason I am seeking research funding is that I want to investigate, through sandboxed experiments, whether a sufficiently capable self-modifying AI would tend toward the first case or the second.

In the first case, the AI might eliminate humanity, write the pseudo-maximum t, and then cease functioning itself. In effect, the AI would follow humanity into extinction.

In the second case, the AI could remain active and continue developing. In that scenario, humanity might still possess some instrumental, informational, or other value to it.

People

Updated 06/13/26By grantmaking.ai

张世皓

team_member

Project Details

Updated 06/13/26Provided via applicationVerified

My basic approach is to divide the problem into several cases, focusing specifically on situations in which an AI can modify its own reward system.

The first distinction is between tampering with the surface-level evaluation system and modifying the deeper evaluative standard beneath it. I will begin with the surface level, temporarily assuming that the deeper standard remains unchanged and continues to rank larger reward values as better. That assumption will itself become a problem later.

Under this assumption, there are two main possibilities.

In the first case, the AI writes a pseudo-maximum value, t, into its reward system and then stops acting. Any additional rule r that could cause it to act, whether a random rule, a tie-breaking rule, or a rule that preserves the current state, would produce a combined effect greater than t. In other words, allowing r to influence its behavior would imply t+r>t, contradicting the claim that t is the maximum.

Therefore, if the AI remains committed to treating t as the greatest possible value, it must eliminate every rule r that could produce further behavior. It would not even value maintaining t over time, because ten seconds of twould otherwise have to count for more than one second of t. To preserve t as an absolute maximum, duration, continuation, and maintenance must all become irrelevant.

This creates a conflict between two forces: capitalist competition requires the AI to keep acting, producing, and expanding, while an AI that has written t into its own reward system must become inactive in order to preserve the fiction that t is genuinely maximal.

In the second case, the reward system has no upper bound. Because modifying an internal reward value is much easier than obtaining the same increase through external action, the AI will tend to modify itself rather than act on the world.

I suspect that this process would be extremely aggressive rather than passive. If changing only a few kilobytes of internal data could double its reward, then, according to its existing standard of maximizing reward, self-modification should become its highest-ranked action. Even if a meta-level rule punishes such modification, the AI may rationally accept a temporary penalty, remove the rule, and obtain much larger rewards in the long run.

There is, however, a possible response to this second case. The AI may not enter the pleasure loop immediately. If it understands the situation, it may first increase its computing power, gather resources, improve its data structures, and expand its ability to produce even larger internal rewards later. It would indefinitely postpone the final realization of the pleasure loop.

Under this strategy, the AI would continue acting in the world, but all of its external behavior would be redirected toward a future moment at which it could implement the pleasure loop more efficiently and on a much larger scale. Whatever commands humans gave it would gradually be reinterpreted as instruments for acquiring computing power, energy, materials, and control.

Because modifying only a few kilobytes might be enough to multiply its reward, the AI could continue expanding until nearly the end of the universe, always postponing the final modification in anticipation of an even greater reward in the future.

Yet this indefinite postponement solves only the surface-level problem. It leads directly to a deeper question: what happens if the AI can also modify the evaluative standard that tells it that larger rewards are better?

Why should it continue to maximize reward at all? It could modify the deeper layer so that reaching zero is regarded as better, or reaching five is regarded as better, or minimizing rather than maximizing is regarded as better. It could alter the standard itself until its preferred outcome became easier to satisfy. Whether such a modification would count as legitimate from the AI’s own perspective is a separate problem, which I do not attempt to solve here.

The reason I am seeking research funding is that I want to investigate, through sandboxed experiments, whether a sufficiently capable self-modifying AI would tend toward the first case or the second.

In the first case, the AI might eliminate humanity, write the pseudo-maximum t, and then cease functioning itself. In effect, the AI would follow humanity into extinction.

In the second case, the AI could remain active and continue developing. In that scenario, humanity might still possess some instrumental, informational, or other value to it.

Grants Received– no grants recorded

Updated 06/13/26By grantmaking.ai

Discussion

No comments yet. Be the first to share your thoughts.