AGI Inherent Non-Safety

Potsdam, Germany

pik-gane.github.io

15 people

A research project developing non-maximizing, aspiration-based designs for AI agents that avoid objective function maximization, arguing that such optimization is inherently unsafe in sufficiently capable AGI systems.

People– no linked people

Updated 04/02/26

Funding Details

Updated 04/02/26

Annual Budget: -
Current Runway: -
Funding Goal: -
Funding Raised to Date: -

Org Details

Updated 04/02/26

AGI Inherent Non-Safety is a research project that originated from the 7th edition of AI Safety Camp (AISC). The project, closely associated with the SatisfIA research agenda, is led by Dr. Jobst Heitzig, a senior researcher and working group leader at the Potsdam Institute for Climate Impact Research (PIK) in Germany. The project is hosted by PIK's FutureLab on Game Theory and Networks of Interacting Agents, in collaboration with AI Safety Camp, the Supervised Program for Alignment Research (SPAR), and ENS Paris. The core thesis of the project is that AGI systems designed around maximizing an objective function representing utility are inherently unsafe. The project identifies two key reasons for this: first, such systems risk existential harm from Goodharting and other forms of misaligned optimization when the agent is capable enough, has access to sufficient resources, and one cannot be absolutely sure of having found the exactly right notion and metric of utility; second, optimization-based agents will develop dangerous instrumental behaviors. This argument draws on and extends well-known concerns in AI safety about reward hacking and instrumental convergence. Rather than trying to add safety constraints to optimization-based systems, the project develops an entirely different paradigm: aspiration-based designs where agents fulfill goals specified through constraints called aspirations. These agents aim to achieve outcomes within defined acceptable ranges rather than pushing toward any maximum, which is argued to imply a much lower probability of taking extreme actions and therefore to be much safer. The research spans multiple areas including model-based planning and reinforcement learning algorithms for aspiration-based agents, development of safety criteria such as environmental disturbance minimization, action predictability, and resource acquisition limits, testing in AI safety gridworlds, and theory of multi-dimensional aspirations and multi-agent environments. Dr. Heitzig has proposed a modular AI architecture with separate perception, world model, evaluation, and hardcoded decision algorithm components designed for transparency and interpretability. The project has been supported by the Survival and Flourishing Fund (SFF), receiving $135,000 in the SFF 2024 S-Process round through its fiscal sponsor, the Players Philanthropy Fund (PPF). The team consists of approximately 15 volunteers with diverse backgrounds organized into specialized sub-teams addressing theory, algorithms, implementation, and testing. Dr. Heitzig has also been working on establishing an AI safety lab in Berlin.

Theory of Change

Updated 04/02/26

The project's theory of change rests on the argument that optimization-based AGI designs are fundamentally unsafe regardless of how carefully the objective function is specified. By developing a viable alternative paradigm (aspiration-based, non-maximizing agent designs), the project aims to provide the AI development community and policymakers with concrete safer alternatives to current approaches. The causal chain runs from publishing academic papers establishing credibility, to developing software components demonstrable in toy environments, to partnering with industry on concrete applications, to creating proofs-of-concept, to ultimately providing regulators with viable safer alternatives to current AI development approaches. If successful, this would reduce existential risk by offering a technically sound path to capable AI systems that do not exhibit the dangerous instrumental behaviors and Goodharting failures inherent in optimization-based designs.

Grants Received

Updated 04/02/26

SFF-2024 - AGI Inherent Non-Safety

from Survival and Flourishing Fundsurvivalandflourishing.fund

$135,000

Projects– no linked projects

Updated 04/02/26

Discussion

No comments yet. Be the first to share your thoughts.