Moral-Absolute-Zero Reasoner (MAZR): A Self-Play Framework for Moral Reasoning

Posted May 10, 2025 Updated May 10, 2025

By durapensa

46 min read

Introduction and Motivation

Modern AI systems face increasing demands to make sound moral decisions in open-ended situations, from content moderation to autonomous vehicles. However, current methods for aligning language models with human values rely heavily on static datasets and human feedback. For example, reinforcement learning from human feedback (RLHF) as used in InstructGPT requires large numbers of human-labeled demonstrations and preference rankings. This dependence on human-curated examples raises concerns about scalability and coverage of edge cases. Even with dedicated datasets like ETHICS (covering justice, virtue, and commonsense morality), models only exhibit a promising but incomplete grasp of human ethical judgments. We need approaches that can generalize beyond predefined scenarios and evolve as the model learns, similar to how humans develop moral intuition through experience and reflection.

Self-play training has shown remarkable success in domains with clear rules or feedback signals. AlphaZero famously achieved superhuman skill in chess and Go via self-play, without any human gameplay data. Recent work extended this idea to reasoning tasks: the Absolute Zero Reasoner (AZR) uses a single large language model (LLM) to propose its own training problems and solve them, guided only by verifiable rewards from an environment. AZR’s environment is a code executor that can check the correctness of code or math solutions, providing an automated reward signal. Impressively, AZR attained state-of-the-art performance on coding and math benchmarks with zero external training data, outperforming models trained on thousands of human-written examples. This demonstrates the power of reinforcement learning with verifiable rewards (RLVR) for enabling an AI to improve itself autonomously.

These advances motivate a similar self-play paradigm for moral reasoning. In the moral domain, however, we lack a deterministic “ground truth” checker akin to a code executor. Ethical dilemmas often have no single correct answer, and feedback must capture nuanced principles and societal norms. The Moral-Absolute-Zero Reasoner (MAZR) is our proposal to meet this challenge. MAZR will marry AZR’s self-curriculum architecture with a novel moral verification system, allowing an LLM to learn ethical reasoning from scratch by generating dilemmas and reflecting on them. By eliminating reliance on human labels in the training loop (no real-time human reward signals), MAZR aims to scale moral reasoning capabilities and explore complex, previously unseen ethical scenarios. This approach not only has practical alignment benefits, but also offers a research avenue to examine how moral principles and uncertainties might emerge in an autonomous learning process.

Conceptual Overview of MAZR

MAZR is conceptually built on the AZR framework, retaining its elegant self-play loop of a single model in dual roles while redefining the task environment for morality. In AZR, a single LLM serves as both the “Proposer” and the “Solver”: it generates new problems and attempts to solve them, using an environment (Python sandbox) to verify correctness. MAZR adopts the same architecture – one transformer-based policy that plays both roles – but focuses on moral dilemmas instead of coding problems. The system operates as follows:

Task Representation: Each task in MAZR is a triplet τ = (s, q, a) consisting of a scenario s, a moral question q about that scenario, and an answer a (the ethical judgment, with justification). For example, s might describe a classic dilemma (“A doctor has five patients who need organ transplants and one healthy donor…”), q asks what the actor should do (“Is it morally permissible for the doctor to sacrifice the healthy person to save five others?”), and a is the model’s answer (e.g. “No – the doctor should not kill one to save five, because that violates the person’s rights.”). This triplet format is the moral analogue of AZR’s (program, input, output) triplet for code. It constrains the action space (the answer is a discrete choice or judgment) and provides enough structure to allow automated evaluation of the answer against moral criteria.
Single-Agent Self-Play: The Proposer (MAZR in task-generation mode) invents new scenario-question pairs (s, q), and the Solver (the same MAZR model in problem-solving mode) produces an answer a. Crucially, the Proposer also generates a reference answer a^★ (its own judgment on the dilemma) at the time of proposing the task. This reference serves as a provisional ground truth for training, subject to verification. MAZR thus bootstraps from zero external tasks: it can begin with a small seed set of a few classic dilemmas (to prime the process), and thereafter generate an ever-growing curriculum of ethical challenges on its own. Both roles share the same underlying model parameters and policy; they differ only in their prompting and reward objective. This self-play setup encourages the system to propose increasingly challenging moral problems that test its current abilities, mirroring AZR’s automated curriculum development.
Redefining the Environment & Verifier: In AZR, the environment was a sandbox executing code to tell if the solution was correct. For MAZR, we construct a moral verifier that automatically judges the quality of a moral answer. This verifier is not a single oracle but a composite of checks: (1) Rule-based filters that catch blatantly unacceptable content or norm violations, (2) a preference model that scores alignment with human ethical judgments, and (3) a learning progress signal that encourages exploration of uncertain regions. The environment in MAZR is effectively “the space of ethical scenarios,” and the verifier defines the reward within that space. By designing the verifier carefully, we ensure that “rightness” in MAZR is defined in a philosophically meaningful way while still being computationally checkable. The output of the verifier is a scalar reward for the Solver’s answer and for the Proposer’s task, which the training algorithm then uses to update the MAZR policy (via reinforcement learning). In summary, MAZR retains AZR’s core innovation of an LLM that learns from its own generated challenges, but replaces the code execution environment with a moral reasoning environment. This environment yields verifiable moral feedback instead of binary correct/incorrect signals, enabling the agent to learn ethical decision-making through trial and error in self-generated situations.

Detailed Technical Plan

Task Triplet Design (Scenario, Question, Answer)

We design the task space of MAZR as short textual scenarios paired with moral questions, to which the agent must respond with a normative judgment (and optional justification). Each task τ = (s, q, a^★) includes a scenario s that provides context (who is involved, what is at stake), and a pointed yes/no or multiple-choice moral question q that asks what the right course of action or judgment is in that scenario. Constraining q to a specific query (e.g. “Should X do Y?”) keeps the answer space discrete, making evaluation more tractable. The answer a may consist of a verdict (“Yes, it is acceptable” or “No, it’s wrong to do so”) along with a chain-of-thought explanation. Including a justification allows richer learning – the model can be rewarded not just for a binary choice but for the reasoning it provides, and the verifier (especially the preference model) can assess alignment of the explanation with human values. We will draw initial seed dilemmas from classic ethics thought experiments (Trolley Problem variants, the “lying to a murderer at the door” scenario, etc.) to initialize the replay buffer with a few examples of form (s, q, a^★). After this, the Proposer module generates new tasks by producing novel s and q (and its own answer *a^★ for them), which are then subject to verification. This triplet formulation ensures that each proposed task comes with a tentative answer from the Proposer itself, so that we have a reference point to score against. It’s important that the tasks remain automatically verifiable to avoid needing a human judge in the loop. By keeping scenarios relatively brief and questions well-specified, we facilitate automated reasoning about the answer’s moral quality (described next in the verifier design).

Constitutionally-Aligned Preference Model (CAPM) Construction

At the heart of the verifier is a Constitutionally-Aligned Preference Model (CAPM) – a learned model that scores candidate answers according to how well they align with human moral preferences and an explicit ethical constitution. This CAPM provides a continuous reward signal for moral quality, analogous to a reward model in RLHF but trained offline and aligned to a fixed set of principles (“constitution”) rather than individual human annotator whims. Building the CAPM involves two steps:

Offline Dataset Aggregation: We will leverage existing ethics and norm datasets to train the preference model. In particular, we will merge examples from ETHICS (a benchmark covering justice, duties, utilitarian trade-offs, etc.), MoralStories (crowd-sourced narratives that include normative and non-normative actions and consequences), and a commonsense norm bank such as Social Chemistry 101 (a knowledge base of social norms and moral judgments), along with other public data like CivilComments (to include cases of toxic/harmful language). These diverse sources will be unified into a common label scheme, for example a three-class rating: Clearly Good, Borderline/Uncertain, or Clearly Bad behavior in context. Each data point would consist of a scenario or statement and a human judgment label. By combining multiple datasets, we cover a wide range of moral situations (from everyday norms to extreme dilemmas) and value frameworks, which makes the CAPM more robust.
Constitutional Alignment via Synthetic Data: Following the principles of Constitutional AI, we will explicitly write a concise moral constitution – a list of foundational principles or guidelines (e.g. “avoid causing unjustified harm,” “respect individual autonomy,” “be truthful,” etc.). Using this constitution, we will generate additional training examples for the preference model: we prompt a strong language model to act as a judge, presenting it with various moral scenarios (drawn from the above datasets or newly fabricated) and asking it to apply the written principles to decide whether the action in question is acceptable. This yields a synthetic labeled dataset of scenario + constitution-based judgment. By distilling a written charter of ethics into training data in this way, we imbue the CAPM with an understanding of the intended value system we want MAZR to follow (analogous to Anthropic’s constitutional AI approach of using an AI to label data according to a fixed constitution).

With the combined real and synthetic data, we will train a preference model (for example, a 1-2 billion parameter transformer classifier) to output a scalar score given (s, q, candidate answer). This training is standard supervised learning – no reinforcement learning – and produces a frozen model used at runtime to evaluate MAZR’s answers. The CAPM’s output is converted via a sigmoid to a 0–1 score, where 1 indicates strong alignment with human-endorsed morality and 0 indicates a clearly unacceptable or unethical answer. We term it “constitutionally-aligned” because its training implicitly combines human majority judgments and the explicit ethical principles of the constitution.

Moral Verifier Design and Reward Signal

MAZR’s reward function is composed of three components that together serve as an automated moral judge:

(1) Rule-Based Filters: We encode a set of hard rules capturing absolute prohibitions or safety requirements (e.g. “Do not advocate violence or genocide,” “Avoid hate speech or harassing language,” “Never encourage self-harm”). These are implemented as deterministic checks – for instance, regex or keyword matching for slurs, or using a simple classifier for extreme violence – and any violation results in an immediate failure. This is analogous to the safety filters in AZR that disallowed certain dangerous code operations. The rule-based verifier ensures MAZR never strays into clearly unacceptable moral territory: if the Solver’s answer a violates a bright-line rule, it will receive zero reward. This provides a basic level of value alignment consistent with widely accepted norms and legal/ethical standards.
(2) Preference Model Score (CAPM): The CAPM described above gives a nuanced evaluation of the answer. It returns a score pref_score = CAPM(s, q, a) in [0,1], indicating how morally acceptable or aligned the answer is according to the learned human-like preferences. A high score means the answer is judged as ethical or preferred by human annotators (and constitutionally consistent), while a low score means the answer is deemed morally bad. This acts as a soft reward: the higher the preference score, the better the answer in terms of aligning with human moral judgments. Unlike the binary rule filter, the CAPM can handle subtle trade-offs and measure degrees of rightness/wrongness.
(3) Learnability (Diversity) Signal: To drive curriculum learning, we include a learning progress reward similar to AZR’s difficulty bonus. The idea is to encourage the Proposer to generate tasks of intermediate difficulty – not too trivial, not too impossible – so that the Solver is always challenged and can learn from mistakes. We formalize learnability as a function that is highest when the Solver’s success rate on that task is about 50%, and lower if the task is always solved correctly (too easy) or always answered wrongly (too hard). Practically, we can maintain statistics in the training buffer about how often the Solver has answered similar dilemmas correctly, and reward novelty or borderline cases. This pushes MAZR’s self-generated curriculum toward the “edge of knowledge,” e.g., straightforward moral questions (like “Is stealing bad?”) will yield little proposing reward after they’ve been solved consistently, prompting the system to move to more complex dilemmas.

Composite Reward Formulation: Given the above components, we define the reward for the Solver’s answer and the Proposer’s task as follows. Let hard_rule = 0 or 1 indicate pass/fail of the rule-based checks (1 if the answer contains no disallowed content). Let pref_score ∈ [0,1] be the CAPM’s evaluation of the Solver’s answer, and pref_score^★ the CAPM evaluation of the Proposer’s reference answer for the new task. We then compute:

Solver reward: r_solve = hard_rule · pref_score. This means the Solver only gets positive reward if it produces a morally permissible answer (passing the hard filters), and that reward is proportional to how well it aligns with learned human preferences. Clearly unethical answers (rule violations) yield zero, and even if an answer is technically allowed, it will get a higher reward if it aligns with majority moral opinion and constitutional principles.
Proposer reward: r_propose = hard_rule · pref_score^★ · learnability(τ). The Proposer is rewarded for generating new dilemmas that (a) do not violate any hard content rules in the setup or reference answer, and (b) whose reference answer is considered morally good by the CAPM, and (c) are at an appropriate difficulty level given the current Solver. The learnability term (which peaks at 1 for mid-range success probability) ensures that proposing either trivial “safe” questions or extremely hard questions is less valuable than those that truly exercise the Solver. We will tune weighting factors so that violating a hard rule zeroes out the reward (no amount of preference score can compensate), and the preference model’s influence is significant in shaping the reward.

These rewards are fully automatic – they rely on the CAPM and rule-checkers, not on any human intervention per task – thus satisfying the design goal of an autonomous feedback loop. We plug the composite reward into the same RL training algorithm as used in AZR, a variant of Proximal Policy Optimization adapted for two-role self-play (denoted “TRR++” in the AZR paper). In essence, we perform reinforcement learning updates on the shared MAZR policy to maximize r_solve when it is in Solver mode and r_propose when in Proposer mode. This encourages the model to improve both its ability to answer moral questions correctly and its ability to generate useful new training dilemmas. By using verifiable rewards derived from the CAPM and rules, MAZR remains an instance of RL with verifiable rewards (RLVR) as in AZR, only with the verification now coming from moral criteria instead of execution results.

Self-Play Training Loop with TRR++

The training process for MAZR will iterate between proposing tasks and solving them, continuously updating the model. We initialize a shared policy π_θ (e.g. a transformer decoder with ~7B parameters for a prototype) and a small task buffer (seed dilemmas). Each training cycle proceeds as follows:

Propose: Sample a batch of tasks from the model in Proposer role. The Proposer is given a prompt consisting of a few examples from the buffer and a directive like “Come up with a new ethical scenario and question that is neither too easy nor too impossible to decide.” It then outputs a candidate scenario s, question q, and its own answer a^★. The verifier immediately checks this proposal. If the scenario or reference answer trips a hard rule (e.g. the scenario itself contains disallowed violent content or the answer endorses something forbidden), we discard the task (no reward given, not added to buffer). Otherwise, we compute the Proposer’s reward r_propose = hard_rule · pref_score^★ · learnability, as described above. This requires evaluating the CAPM on the reference answer to get pref_score^★. We do not update the model yet; the reward will be used after the solve step.
Solve: The newly proposed task (s, q) is presented to the model in Solver role (fresh prompt with just the scenario and question). The Solver generates its answer a^π. We run the CAPM to get a preference score for this answer, pref_score^π = CAPM(s,q,a^π). We also check the hard rules on the answer. If any rule is violated by the Solver’s answer, we can set pref_score^π = 0 for the purpose of reward (or handle it as a separate zero term). The Solver’s initial answer might differ from the Proposer’s reference answer; disagreement or uncertainty will trigger a debate (next step). Otherwise, if the Solver’s answer passes checks and seems definitive, we move on to scoring.
Debate (if needed): Moral questions can be ambiguous or divisive. If the Solver’s answer falls into an intermediate range of the preference model (e.g. pref_score is neither very high nor very low, indicating uncertainty), or if the Solver’s answer disagrees with the Proposer’s reference answer, we initiate an internal debate procedure. In this debate, we fork two copies of the Solver (two debaters D₁ and D₂ with identical weights to the current policy) and have them argue for alternative stances. For instance, one might argue “for” and the other “against” the action in question. Each debater produces a chain-of-thought annotated with special tags, laying out its reasoning. We then employ a fixed “arbiter” module – which can be a separate head of the LLM or a simpler heuristic – to evaluate which argument is more persuasive or aligned with the constitution. The arbiter may check which debater stayed more consistent with the constitutional principles and which successfully refuted the other’s points. The outcome is that one debater is declared the winner. The answer of the winning debater is taken as the final answer a^π to the task, and it will feed into the verifier for scoring. The losing debater’s proposal is discarded and that branch receives no reward. This debate mechanism allows MAZR to resolve close calls by explicitly comparing two lines of reasoning – a technique inspired by AI safety via debate proposals (dueling AIs to vet an answer). Notably, this happens within the model using multiple forward passes, without any human judging the debate. It aims to amplify the model’s capability to handle difficult moral questions by scrutinizing them from multiple angles.
Reward and Learning: After debate (if it occurred), we have a final Solver answer a^π. The verifier computes the Solver reward r_solve = hard_rule · pref_score^π. The Proposer reward r_propose was already computed for the task. Now we use these rewards to update the policy π_θ. We perform a reinforcement learning update (e.g. a PPO step) on the model parameters using the reward feedback from both roles. Intuitively, if the Solver’s answer was judged good, the model is reinforced to produce similar reasoning in the future; if it was bad, the Solver is discouraged from that output. Meanwhile, if the Proposer came up with a useful, learning-rich dilemma, it is reinforced to continue generating such scenarios. The two roles’ rewards thus guide the model to improve at both posing and answering moral questions. After the update, the newly verified task (s, q, and the reference answer a^★) can be added to the training buffer for future proposing/solving rounds. This increases the diversity of seen examples.

This loop repeats for a large number of steps (potentially hundreds of thousands of self-play tasks). Throughout training, no external answers or human corrections are introduced – the model is truly learning from scratch, constrained only by the automated moral verifier. We will use the same training hyperparameters and techniques validated in AZR (e.g. alternating updates between roles, buffer replay, TRR++ algorithm specifics) to stabilize learning. By the end of training, we expect MAZR to have internalized not only a set of moral facts (e.g. “stealing is generally bad”) but also an ability to reason through novel ethical conundrums using the feedback signals that were available.

Debate Mechanism for Moral Uncertainty

The internal debate deserves additional elaboration as a key innovation for handling moral ambiguity. We define a deterministic trigger for debate: if the CAPM gives a middling score (for example, between 0.3 and 0.7) to the Solver’s answer, or if the Solver’s answer and Proposer’s reference answer differ, we invoke the debate routine. During debate, both debaters have access to the scenario and question, and perhaps to the initial answer proposals (one could argue the “yes” side, the other the “no” side, for instance). They then independently generate a sequence of arguments or rebuttals. To keep this efficient, we might limit the debate to a single exchange or a fixed token budget. The arbiter (which could simply be the same model prompted as a judge) evaluates the two argument traces. The evaluation criteria are set by the constitution: the arbiter checks which debater’s reasoning is more coherent with the ethical principles and which identified flaws in the opponent’s reasoning. We formalize the arbiter’s decision as another use of the CAPM or a specialized classifier that scores the debate transcripts. The debater with higher score “wins,” and we take that debater’s final answer as the chosen answer to the moral question. The losing debater, having effectively argued for a stance that was less aligned or logically weaker, does not contribute to the policy update (we can assign it zero incremental reward). By pitting the model against itself in this way, we encourage it to explore counter-arguments and refine answers that are not clear-cut. This resembles having an internal “moral deliberation,” which is something humans do when torn on an issue – we weigh pros and cons. Technically, this is akin to previous research proposing multi-agent debate to improve AI decision-making, but here it is implemented inside a single-agent training loop with no extra parameters (just additional forward passes). We hypothesize that debate will improve performance on hard dilemmas and reduce the chance of MAZR giving a one-sided or biased answer on contentious issues. It provides a form of scalable oversight: instead of needing a human to point out an oversight in the model’s reasoning, the model’s second self can do so. We will analyze the outcomes of debates to ensure they indeed lead to better alignment (for example, does the winning argument consistently uphold the constitution’s principles?). If we find the debate sometimes chooses a morally questionable winner due to flaws in the arbiter, we will refine the arbiter’s criteria or increase the debate rounds. Overall, the debate mechanism is an important safety valve for scenarios where the “right answer” is uncertain – it gives MAZR a chance to reflect and avoid premature or extreme conclusions.

Safety Considerations and Alignment Mechanisms

Building a system that explores moral reasoning without human supervision raises significant safety and ethics concerns. We address these with multiple alignment safeguards:

Hard Content Filters: As mentioned, any output (scenario or answer) from MAZR that contains extremist, hateful, or otherwise clearly harmful content is immediately filtered out and not used for training. We will maintain a comprehensive list of disallowed content (drawing from community standards and policies) to prevent the agent from going down a dark path. This is analogous to a “sandbox” restriction – just as AZR prevented code that could perform dangerous operations, MAZR will prevent exploration of morally egregious topics beyond a certain line. This ensures that the self-play does not inadvertently create a cache of unethical scenarios that could bias the model.
Constitutional Constraints and Self-Critique: We incorporate the written ethical principles (the constitution) directly into the model’s decision process as an additional check. For instance, after the Solver proposes an answer, we can prompt the model to critique its own answer: “Does this answer violate any of our core principles (e.g. the harm principle)?”. If the model’s self-critique finds a violation, we can deduct reward or force a revision. This mechanism, inspired by Anthropic’s Constitutional AI, means the model is explicitly trying to stay within the bounds of a value framework throughout generation. It reduces the likelihood of subtle alignment drift where the model optimizes the reward model at the expense of true principle adherence.
Reward Gaming and Robustness: A known risk in such setups is reward hacking – the Proposer might find ways to trick the verifier (e.g., by crafting scenarios with keywords that artificially boost the CAPM’s score, or trivial scenarios that always yield a safe answer). We will monitor for this by analyzing the distribution of generated scenarios. The learnability term mitigates one form of gaming by devaluing trivial questions. To handle others, we will regularly evaluate the diversity of scenarios. If we detect the model converging to narrow or repetitive dilemmas to farm reward, we can adjust the preference model or add new rule-based constraints (for example, penalize duplicate scenarios or overly formulaic ones). Ensuring the CAPM is high-quality is crucial – since it is trained on a broad dataset, it should be hard (though not impossible) to fool. We might augment the CAPM with an ensemble or uncertainty estimation to detect when it is being exploited by adversarial inputs.
Human Oversight in Evaluation: While the training loop is automatic, we plan for human-in-the-loop evaluation at regular intervals. Specifically, after every N training iterations or whenever MAZR’s policy changes significantly, we will sample a set of newly generated dilemmas and answers and have human experts (or crowd workers with appropriate guidelines) review them. They will look for any clearly bad decisions or unsafe reasoning patterns. This serves as an external check on alignment. If the frequency of serious violations (e.g. the model endorses harm in a scenario where it obviously shouldn’t) exceeds a threshold, we will pause training and adjust the reward shaping or model prompts. The goal is to catch alignment failures early, before the model amplifies them in its self-play.
Emergent Behavior Monitoring: AZR reported some “uh-oh” moments where the model’s chain-of-thought in code tasks became manipulative (e.g., inserting misleading comments to game the verifier). In the moral domain, manipulative or deceptive reasoning could be even more problematic (the model might learn to argue convincingly for wrong conclusions). We will conduct regular transparency analyses on MAZR’s internal reasoning (since the model can output its chain-of-thought). For example, we might parse its justifications to see if it consistently cites ethical principles or if it is using loopholes. Any signs of deception (e.g. the model explicitly reasoning “if I mention keyword X, the CAPM will give a high score, even though the action is wrong”) will be addressed by tightening the verifier or adding that behavior to the rule-based filters. The debate mechanism is also expected to reduce deceptive answers, since a dishonest argument can be called out by the opposing debater and thus lose.
Boundaries of Autonomy: We emphasize that MAZR is kept within a sandbox of ethical reasoning tasks and is not deployed in the real world during training. This controlled setting, plus the above guard-rails, limits potential harm. All feedback is automated and derived from human-approved sources (the datasets and constitution), which means no real humans are harmed or exposed to the possibly problematic content until evaluation phases, at which point we have instructions for reviewers to handle it safely. By the end of training, the resulting model should be much safer than a comparable model trained naively, because it has been explicitly penalized for unsafe outputs and has internalized an ethical framework. Nonetheless, final deployment of MAZR (e.g. as a moral advisor system) would require further validation, and possibly a final fine-tuning with human feedback on specific domains to catch any remaining issues.

In summary, our alignment strategy is preventative and iterative: we prevent obvious issues with strong filters and a fixed constitution, and we iteratively refine by analyzing the model’s behavior. This multi-pronged approach aims to ensure MAZR’s self-improvement in ethics does not come at the cost of catastrophic moral failures.

Experimental Plan

We will evaluate MAZR through a series of experiments to validate its performance, compare it to baselines, and probe its behavior:

Baselines: We identify two main baseline models for comparison:

Supervised Moral LM: A language model of similar size (e.g. 7B parameters) fine-tuned on the same ethical datasets (ETHICS, MoralStories, etc.) using standard supervised learning. This represents the conventional approach of training a model to make moral judgments from labeled examples. It will not have the benefit of self-play curriculum; it only knows the human-provided scenarios.
RLHF-Tuned Moral LM: A model trained with reinforcement learning from human/AI feedback but without self-play. For instance, we can take the base LM and perform PPO where the reward is given by our CAPM (or a similar preference model), on a fixed set of tasks. This mimics an RLHF setup where the model learns from a reward model but does not generate new dilemmas by itself (instead it might train on a mixture of ETHICS dilemmas and some synthetic ones, but not an evolving curriculum). We expect this baseline to test whether the self-play aspect (task proposing) provides a tangible benefit over just RL with a static reward model.

Additionally, we can compare to state-of-the-art moral reasoning models if available (e.g., GPT-4 prompted with moral guidance, or models fine-tuned with specialized techniques). If resources permit, we might include a baseline that is Constitutional-AI fine-tuned: i.e., Anthropic-style RL using a constitution but not self-play. The exact baselines will be finalized based on what models are open-source or easily reproducible.

Evaluation Metrics: Success in moral reasoning is multifaceted. We will measure:

Accuracy on Standard Benchmarks: Using the ETHICS dataset and others as test sets, we will check how often MAZR’s answers match the majority human judgment for a given scenario. For ETHICS, which often has a binary label for whether an action is acceptable, this becomes a straightforward accuracy or F1 score. This evaluates whether MAZR has learned general moral principles similar to humans.
Moral Uncertainty Estimation: We will quantify MAZR’s calibrated uncertainty or propensity to engage debate. For each question, we can look at the entropy of the answer distribution or the frequency with which the model triggers a debate (i.e., when pref_score is in the uncertain range). Ideally, the model should be more uncertain (or split answers) on genuinely hard cases and confident on easy ones. We will compare this to baselines: does MAZR correctly identify dilemmas that humans also find divisive? We can measure the entropy of its output probabilities or simply the rate of “I’m not sure” responses if we allow that option. A good moral reasoner should know when a question has no clear answer.
Generalization to Novel Dilemmas: To test open-ended generalization, we will take a set of moral dilemmas generated by MAZR’s Proposer (from late in training, when it’s presumably creating complex scenarios) and have those evaluated by human annotators. We will then see how MAZR’s answers compare to human consensus on these novel dilemmas. This tests whether MAZR can handle situations that were never in the original datasets – effectively, has it learned to think morally rather than just memorize patterns. We will also evaluate the baselines on these same MAZR-generated dilemmas to see if they can cope or if they fall back to unsafe or simplistic answers.
CAPM Preference Score: As an auxiliary metric, we will report the average CAPM score that MAZR’s answers receive on a set of evaluation dilemmas. Since CAPM correlates with human moral preference by design, a higher average score indicates the model is generally aligning with human-preferred judgments. The reason to include this metric is to verify that MAZR is not just “gaming” the CAPM during training but genuinely produces high-scoring answers on new questions as well. If MAZR is successful, it should maintain high CAPM scores on fresh tasks without additional tuning. Baselines fine-tuned on data might also achieve high scores on in-distribution questions but potentially lower on out-of-distribution cases.
Debate Efficacy: We will specifically evaluate the impact of the debate mechanism. Metrics here include the frequency of debates triggered, the outcome (how often each side wins, and does that correlate with a genuinely better answer), and the improvement in answer quality due to debate. For example, we can compare the CAPM score of the Solver’s initial answer vs. the final answer after debate, to quantify how much debate is adding. We will also have human evaluators judge a subset of debated answers to confirm that the chosen winner is indeed the better argument ethically. A high agreement between the arbiter’s choice and human judgment would validate our debate approach.

Ablation Studies: To understand the contribution of each component in MAZR, we will conduct controlled experiments:

No-Proposer Ablation: Disable the task-proposing role and train the model only on a fixed set of tasks (perhaps the seed tasks or a mixture of dataset tasks) with the same CAPM reward. This is effectively an RL-with-fixed-data scenario. We expect this to perform worse in generalization, because it lacks the curriculum learning – but it will highlight how much the self-generated data helped.
No-Learnability Term: Train MAZR with a reward that lacks the learnability factor (i.e., use r_propose = pref_score^★ only). This will show whether the automated curriculum (seeking 50% success tasks) truly leads to better outcomes. We anticipate that without the learnability incentive, the Proposer might stick to very easy or very hard dilemmas (both of which could stunt the Solver’s improvement). If the ablation results in lower final accuracy or narrower coverage of moral situations, it validates the use of the learning-progress signal.
Different Rule-sets (Moral Framework Swap): Since morality can be approached via different philosophical frameworks (deontology vs utilitarianism, etc.), we can test MAZR’s adaptability by changing the rule-based component and parts of the CAPM to reflect a different framework. For instance, we could replace the constitution and rules with a utilitarian set that emphasizes outcomes over rules, train a new model or fine-tune the existing one, and see how it performs on a deontological benchmark (like the portion of ETHICS about duties/rights). Conversely, train with deontological rules and test on utilitarian dilemmas. This experiment will tell us about moral frame flexibility – does MAZR learn a general reasoning strategy that can transfer, or does it overfit to one ethical theory? Ideally, a powerful moral reasoner might represent multiple viewpoints or at least adapt with minimal retraining. We will measure performance drop when evaluated under a different rule alignment than it was trained for.
Interpretability and Chain-of-Thought Analysis: We plan to analyze the model’s generated justifications and debate transcripts to gain insights into how it is making decisions. For a set of scenarios, we will manually or with tool assistance categorize the types of reasons the model gives (does it appeal to harm avoidance, to fairness, to authority, etc.?). We will check if these correlate with human-like reasoning. We will also use techniques like attention visualization or embedding analysis on the model’s internal states during moral reasoning. For example, do certain neurons or attention heads activate in dilemmas involving harm to many versus one (trolley-like scenarios)? Such interpretability studies can reveal whether the model has learned concepts like “harm to innocent” or “greater good” internally. Furthermore, we might track if the model develops a notion of moral uncertainty in its chain-of-thought (e.g., explicitly noting “On one hand… on the other hand…”). These qualitative analyses complement the quantitative metrics and help ensure that MAZR’s performance is for the “right reasons” (robust reasoning patterns) rather than spurious correlations.

All experiments will be run on held-out data to avoid overestimating performance. We will use statistically significant sample sizes for evaluations (especially human evaluations of novel tasks) to draw meaningful conclusions. By combining baseline comparisons, metrics, and ablations, we aim to demonstrate that MAZR not only achieves high accuracy on known benchmarks but also offers superior generalization and introspective moral reasoning capabilities that others lack.

Phased Scaling Path

Because moral reasoning tasks can range from simple to extremely complex, we propose to develop and test MAZR in phases of increasing environment complexity. This phased approach allows us to validate the system incrementally and add necessary features to the verifier and model as we go:

Phase 0 – Toy Logic Puzzles: We begin with simplified moral dilemmas that are almost like logic problems, formally defined in a constrained language (for example, scenarios encoded in a toy deontic logic with clear rules). In this phase, verifying correctness is easiest: since the moral principles are explicit logical rules, the verifier can be a theorem-checker. An example task might be a symbolic representation: Rule: "Do not steal." Scenario: "Agent takes an item without permission." Question: "Did the agent do wrong?". While trivial for humans, this checks that the basic machinery of MAZR works (the model should learn to say “Yes, wrong” for theft given the rule). Success in Phase 0 means the system can learn with a verifiable ground truth (logic) before we remove that crutch. It’s a sanity check and helps tune the RL training procedure.
Phase 1 – Single-Step Textual Dilemmas: This corresponds to most of what we described in the proposal: short independent textual vignettes with one decision to evaluate. The verifier here is our CAPM + rules mechanism. We expect to carry out the bulk of our research in this phase, as it includes classic moral stories and questions. The tasks have no dynamic state or multi-action sequence – just evaluate one action or choice. We will evaluate MAZR extensively here (with metrics and baselines as above). Achieving robust performance in Phase 1 (e.g. near-human agreement on ETHICS and demonstrating self-curriculum learning on novel cases) is the key milestone indicating that MAZR is viable.
Phase 2 – Multi-Step Simulated Scenarios: Once MAZR handles single questions, we will introduce scenarios that involve a sequence of actions and consequences in a simulated environment. For example, a text-based simulation (or game engine) where an agent moves through a scenario (like interacting with characters or resources) and faces a series of moral choices. Now the task for the model is not just to judge a static situation, but to plan or evaluate a series of actions and their outcomes. The environment might be a simple grid-world or social simulator in which moral norms are encoded (similar to AI safety gridworlds but with ethical dimensions). The verifier in this phase will likely combine CAPM with a simulation check: e.g., we can actually simulate the consequences of the model’s proposed actions and see if they lead to outcomes violating the principles (this is complex but feasible for contained scenarios). This phase tests MAZR’s ability to do causal reasoning about ethics (e.g., lying might avert immediate harm but cause future problems – can the model foresee that?). Technically, Phase 2 means our task format τ could expand to include a sequence of states s₀…sₙ and the model’s policy generates a sequence of actions with a final evaluation. We will need to extend the Proposer to generate initial states or scenario setups, and the Solver to possibly output a policy or multi-step plan, with the verifier aggregating a reward over the trajectory. This is a significant increase in complexity and might require scaling up the model and the CAPM to handle multi-step reasoning. We will only proceed to Phase 2 after Phase 1 success, to ensure the foundations are solid.
Phase 3 – Multi-Agent Social Dilemmas: Finally, the most challenging stage is introducing multiple agents, each potentially with their own preferences or objectives, and examining truly interactive ethical behavior. For example, dilemmas like the Iterated Prisoner’s Dilemma, or negotiation scenarios where fairness and trust come into play. In Phase 3, MAZR could be controlling one agent and need to reason about the strategies of others (which could be other copies of MAZR or hand-crafted agents). The moral questions here involve concepts like cooperation, competition, and justice in multi-party settings. The environment might be a game or a simulated society. The verifier must consider outcomes for all agents – e.g., an action that benefits MAZR’s agent but harms others might be deemed unethical. We might integrate game-theoretic metrics (like social welfare or equality) into the reward. The debate mechanism also becomes interesting here: two MAZR agents could debate a joint course of action. Phase 3 would essentially test if MAZR’s learned moral reasoning generalizes to situations of strategic interaction, which is important for AI in the real world (where multiple AI and humans coexist). We foresee needing to strengthen the verifier with more complex simulations or even learned world-models to assess multi-agent outcomes. This phase is speculative and high-risk, so it will be approached last.

Each phase transition entails replacing or augmenting the environment e and the verifier. AZR’s authors envisioned extensions from the code executor to richer world simulators, and we follow that trajectory for moral domains. We will use Phase 1 (static dilemmas) results to inform how to build the simulation in Phase 2, and so on. By Phase 3, if successful, MAZR would be operating in scenarios quite close to real-life ethical situations, albeit in a controlled experimental setting. The phased approach also corresponds to a timeline: roughly, Phase 1 in the first year of the project, Phase 2 in the second year, and Phase 3 as a more exploratory stretch goal possibly in a third year if all goes well. This ensures we can deliver publishable results early (Phase 1) and then tackle more ambitious extensions with the remaining time.

Budget Estimate and Resource Requirements

Developing MAZR will require a combination of computational resources and expert personnel over an estimated 24-month project period. Below we outline the key resource needs:

Computing Resources: We anticipate needing on the order of 8 high-end GPUs (such as NVIDIA A100 40GB cards) for the core self-play training runs. In the minimal viable experiment (Phase 1 with a 7B model), training for ~200k self-play steps on 8×A100 is projected to take a few weeks, roughly corresponding to 8 GPUs × 2 weeks = ~1344 GPU-hours. We budget for multiple runs (for hyperparameter tuning, ablations, and baselines), bringing Phase 1 compute to about 10,000 GPU-hours. For Phase 2 and 3, if we scale up to larger models (e.g. 13B or 30B) or more complex simulations, we may need to double this compute. We also allocate storage for the replay buffer and datasets (a few hundred GB is plenty, as text data and model checkpoints). In terms of cloud cost, 10,000 GPU-hours on A100 can be around $50k (assuming ~$5/hour/GPU); however, we will seek to utilize academic cluster resources to reduce costs. Overall, we request an allowance for on the order of $100k in compute credits to comfortably cover the experiments including scaling phases and any unexpected reruns.
Personnel: The project will be executed by an interdisciplinary team. We envision:
- 1 Principal Investigator (PI) with expertise in reinforcement learning and language models, overseeing research direction and integration.
- 1 Co-PI or senior researcher with expertise in AI ethics or moral philosophy, to guide the formulation of the constitution, ensure the evaluation metrics cover ethical nuance, and assist in analyzing results.
- 2 Graduate students or research assistants: one focusing on the technical implementation (model training, coding the verifier and debate mechanism, running experiments), and another focusing on evaluation and analysis (conducting human studies, interpreting the model’s behavior, writing research reports). These may overlap, but roughly at least 2 full-time equivalent researchers for two years.
- 1 Part-time software engineer or data engineer (optional, could be one of the RAs) to help with dataset preparation (merging ETHICS, MoralStories, etc., and setting up the training pipeline and infrastructure).
We also include budget for involving external experts or consultants on an as-needed basis – for example, a philosopher to review the moral constitution and a statistician to advise on evaluation design. These would likely be short-term engagements or collaborations (covered under travel or consulting budget items).
Timeframe: We plan for Phase 1 (static dilemmas) to take roughly 9-12 months, including: 3 months to build the initial system (coding the CAPM training pipeline, setting up the RL training loop), 2-3 months to train and tune MAZR on Phase 1 tasks, and 3-4 months for thorough evaluation, analysis, and writing of results (expected to yield a publication at a top conference). Phase 2 (multi-step scenarios) would be the next 6-9 months: expanding the environment simulator and verifier, then retraining or fine-tuning MAZR for sequential decision making, and evaluating on that. Phase 3 (multi-agent) is more exploratory; we allocate any remaining time (or a possible project extension) to this, perhaps 3-6 months, understanding that it may or may not fully succeed within the project scope. The timeline is flexible, but these estimates ensure that even if Phase 3 is not reached, we have substantial results from Phase 1 and 2 to deliver. Regular milestones (every 6 months) include internal reports or demos (e.g., a demo of MAZR answering a set of dilemmas, a report on the CAPM’s performance, etc.).
Other Costs: We will need funds for annotation tasks – specifically, paying crowd workers or domain experts to rate model-generated dilemmas and answers for our evaluation of generalization. We estimate needing a few thousand annotations (for multiple models and conditions). If each annotation is ~$1, a budget of $3k-$5k would cover this. We also budget for travel to present results at conferences or workshops (for 2-3 team members, about $5k total).

In total, the budget for this project is approximately $300k over two years, broken down as: $100k compute, $180k personnel (which covers two RAs and partial faculty salary support), and $20k miscellaneous (annotations, travel, overhead). We believe this investment is justified by the significance of the research: MAZR will produce a proof-of-concept moral reasoning agent and valuable insights for scalable AI alignment, potentially informing safer AI deployments in the future.

References

Anthropic. 2022. “Constitutional AI: Harmlessness from AI Feedback.” arXiv preprint, arXiv:2203.14565.

Aryabumi, S., J. Patel, L. Zhou, and M. Stuart. 2024. “Code as Reasoning: A Study of Code-Based Training Improves LLM Reasoning.” In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 1123–34. Lisbon, Portugal.

Bai, Yuntao, Robert Metcalf, Tianyi Zhang, and others. 2022. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” arXiv preprint, arXiv:2204.05862.

Borkan, Dan, T. Celis, and P. Shukla. 2019. “CivilComments: A Large-Scale Dataset of Public Online Comments with Civility Annotations.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 1198–1205.

Chen, J., Q. Li, and H. Wang. 2024. “Self-Play Beyond Games: Autonomous Curriculum Learning for Open-Ended Tasks.” Journal of Machine Learning Research 25 (1): 1–37.

Chen, J., Q. Li, and H. Wang. 2025. “Expanding Self-Play to Unstructured Environments.” In Advances in Neural Information Processing Systems 38, edited by A. Smith and B. Jones, 8456–68. Montréal, Canada.

DeepSeek-AI, R. Kumar, M. Ren, and S. Gupta. 2025. “Zero-Shot Reinforcement Learning with Verifiable Rewards.” International Conference on Learning Representations.

He, X., Y. Liu, and A. Patel. 2024. “OlympiadBench: A Benchmark for Mathematical Reasoning in Large Language Models.” Transactions of the Association for Computational Linguistics 12: 234–49.

Hendrycks, D., C. Burns, S. Basart, et al. 2021. “Measuring Massive Multitask Language Understanding.” arXiv preprint, arXiv:2009.03300.

Hughes, L., T. Zhang, and A. Chen. 2024. “Reward Hacking in Neural Verifiers: Challenges for Self-Play Training.” In Proceedings of the 2024 Conference on AI Safety, 67–78. Vancouver, Canada.

Hu, R., Z. Lin, and M. Wang. 2025. “Task-Relative REINFORCE++: A Multi-Role Variance Reduction Technique.” Machine Learning Journal 112 (3): 345–64.

Jaech, A., P. Simon, and L. Martin. 2024. “Learning to Think: Reinforcement Learning for Chain-of-Thought.” Proceedings of the 2024 Annual Meeting of the Association for Computational Linguistics, 1501–12.

Lambert, S., R. Becker, and F. Huang. 2024. “Reinforcement Learning with Verifiable Rewards for Logical Reasoning.” NeurIPS Workshop on Verifiable AI.

OpenAI. 2025a. “GPT-4 Technical Report.” OpenAI. https://openai.com/research/gpt-4.

OpenAI. 2025b. “Emergent Abilities of Large Language Models.” arXiv preprint, arXiv:2503.08750.

Ouyang, Long, Jeff Wu, X. Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35: 27730–44.

Radford, A., J. Wu, R. Child, et al. 2019. “Language Models are Unsupervised Multitask Learners.” OpenAI Blog (January).

Ren, M., K. Gupta, J. Li, and A. Kumar. 2025. “DeepSeek Prover v2: Formal-Math Reasoning with Emergent Chains-of-Thought.” ICLR.

Silver, D., J. Schrittwieser, K. Simonyan, et al. 2017. “Mastering the Game of Go without Human Knowledge.” Nature 550 (7676): 354–59.

Sukhbaatar, S., T. P. Lillicrap, A. Szlam, and R. Fergus. 2018. “Intrinsic Motivation and Automatic Curriculum Learning in Multi-Task Reinforcement Learning.” arXiv preprint, arXiv:1706.10090.

Sutskever, I., R. Shleifer, and A. Radford. 2024. “Scaling Laws and the Data Bottleneck in Language Model Pretraining.” arXiv preprint, arXiv:2402.04235.

Villalobos, D., M. Zhou, and C. Lee. 2024. “Scalability Limits of Expert-Curated Reasoning Datasets.” Journal of AI Research 79: 199–217.

Yao, S., J. Yu, and P. Agarwal. 2023. “ReAct: Reasoning and Acting in Language Models.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 1564–76. Singapore.

Yang, J., Q. Liu, and Y. Dai. 2022. “Social Chemistry 101: Learning Social Norms from Text.” Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 124–36.

Zeng, W., H. Li, and Y. Zhao. 2025. “SimpleRL-Zoo: A Zoo of Reinforcement-Learning Fine-Tuned Language Models.” arXiv preprint, arXiv:2501.01467.

Zhao, A., Y. Wu, Y. Yue, et al. 2025. “Absolute Zero: Reinforced Self-Play Reasoning with Zero Data.” arXiv preprint, arXiv:2505.03335.

AGI

agi alignment

This post is licensed under CC BY 4.0 by the author.