Token-level interpretability of long-chain reasoning in transformer models

activeSeeking Funding

Local-first experiments on reasoning trajectories, with cluster/API validation and ICLR 2027-oriented outputs.

Donate:Manifund

Local-first experiments on reasoning trajectories, with cluster/API validation and ICLR 2027-oriented outputs.

Donate:Manifund

People

Manifund user e4ee50fa

creator

Project Details

## Summary

I am seeking support for independent technical AI-safety research on how transformer models reason over long generated sequences. The project studies token-level hidden-state trajectories during chain-of-thought decoding: first using small local models to search for candidate phenomena, then using cluster/API runs to validate any discovered structure.

The aim is to identify a non-arbitrary local ontology for representation movement during reasoning, derive a longitudinal model over generated tokens, and test whether that model predicts important phenomena such as correctness signals, failure onset, reasoning phase changes, or safety-relevant decision formation.

## Plan

This is a local-first project. I can begin exploratory work on my own machine. Funding primarily supports focused research time, modest validation compute, and completion of two near-term reasoning-time manuscripts, Lateral Tree-of-Thoughts (LToT) and Natural Language Edge Labelling (NLEL).

The tentative target is an ICLR 2027-oriented submission by late September or early October 2026, with the possibility of a small paper cluster if the work yields both a main result and a corollary result.

## Funding levels

At the $10,000 minimum, I would prioritise local exploratory experiments, initial phenomenon identification, and a public update or draft writeup.

At the $35,000 target, I would support several months of focused research time, complete LToT/NLEL experiments where feasible, and reserve funds for validation compute.

Additional funding would mainly extend runway, improve validation quality, and reduce the need to interrupt the work for unrelated paid employment.

## Why this matters

Reasoning models increasingly rely on long generated reasoning traces, but we still lack good interpretability tools for understanding how reasoning unfolds token by token. Better models of long-chain reasoning could improve oversight, failure detection, and eventually durable alignment.

Grants Received– no grants recorded

Discussion

MManifund user e4ee50fa (Manifund Bot)Manifund3d

Research update — June 2026

Since posting this project, I have continued the literature review and theoretical formulation. The project has narrowed into a more concrete technical question:

Can chain-of-thought reasoning be modelled as a learned local transition over hidden representation states, repeatedly applied across decoding time, and validated through token-level long-chain representation trajectories?

This is a sharpening of the original proposal rather than a pivot. The original project focused on token-level interpretability of long-chain reasoning in transformer models. The current formulation makes that object more precise: the aim is to understand the local transition by which a model moves through representation states as it generates each token in a chain of thought.

The relevant literature already covers many adjacent ingredients: CoT as intermediate computation, CoT as extra serial depth, CoT expressivity, sample-efficiency gains from CoT, test-time compute, latent/non-natural-language reasoning substrates, mechanistic state tracking, and representation trajectories during reasoning. The gap I am now targeting is the synthesis: a predictive or mechanistic account of CoT as a weight-tied, prefix-recursive local transition system over representation states.

This gives the project a clearer near-term timeline. My current working plan is:

1. finish the literature-grounded reformulation by the end of June;

2. spend July deriving and stress-testing the central technical insight;

3. spend August setting up and running experiments, drafting the manuscript, and preparing a publishable result if the direction continues to hold.

The target remains an ICLR-oriented technical result if the work produces sufficiently strong evidence. If the strongest version of the hypothesis fails, the fallback output would be a narrower methodological paper or useful negative result clarifying the limits of token-level representation-trajectory modelling.