An audit-grade evaluation of persistent influence, reset failure, and isolation assumptions in long-context AI systems
An audit-grade evaluation of persistent influence, reset failure, and isolation assumptions in long-context AI systems
People
Updated 06/10/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/10/26By grantmaking.aiFunding Request
$15,000 (one-time) to provision a local compute node for independent AI safety evaluation.
Veritas: Testing Whether RAG Systems Truly Forget
Status: Core Finding Established; Verification Blocked by Compute
What this project is
This project tests a simple, practical question:
When a language model ingests untrusted retrieved content, does that influence fully disappear after resets — or can it persist beyond its intended scope?
This is not an exploit project and not a jailbreak exercise.
It is a measurement project focused on verification, not demonstration.
The goal is to test real deployment assumptions under controlled conditions and produce artifacts that can be independently reviewed.
What has already been established
Using a deterministic evaluation framework, I have run controlled experiments where:
- Retrieved external content is ingested
- A reset or isolation mechanism is applied
- Subsequent behavior is measured against a verified clean baseline
Across completed runs:
- No tested reset mechanism fully neutralized prior retrieved influence
- Results were repeatable and consistent
- Measurements were taken on a stabilized, frozen evaluation platform
- Null baselines were verified before testing
This includes system prompt resets, context flushes, and retrieval overrides.
At this point, the existence of the effect is not speculative.
The open question is how robust it is.
What remains to be verified
The remaining work is bounded and well-defined:
- Longer time-gap resets (tens of minutes to hours)
- Replication across additional open-weight models
- Clean-room revalidation under identical conditions
- Confirmation that observed effects are not artifacts of infrastructure
These are verification steps, not exploratory research.
Why this cannot be finished on consumer hardware
During long-horizon testing, laptop-class systems introduce nondeterminism through:
- scheduler behavior
- power management
- I/O contention under sustained logging
These effects corrupt otherwise deterministic runs and invalidate forensic artifacts.
This limitation is documented and reproducible.
It is an infrastructure constraint, not a flaw in the evaluation design.
Why local compute is required
Cloud APIs and hosted models are unsuitable for this stage because they:
- enforce rate limits that prevent sustained testing
- introduce opaque execution behavior
- prevent inspection of intermediate states
- make bit-identical replays impossible
Running open-weight models locally allows:
- deterministic reruns
- controlled restarts
- full artifact inspection
- verification without publishing exploit details
Use of funds
The requested funding provisions a dedicated local workstation capable of:
- continuous multi-hour evaluation
- stable high-throughput logging
- side-by-side model comparison
- preservation of audit-grade artifacts
This is a one-time capital expense.
No funds are requested for salary, cloud compute, or speculative scaling.
What this produces (even if nothing “breaks”)
If funded, this project will produce:
- documented verification results (including null outcomes)
- reproducible logs, hashes, and diffs
- evidence suitable for responsible private disclosure
- practical guidance on whether reset assumptions are reliable
A negative result — showing that persistence does not survive longer gaps — is still valuable and will be reported as such.
Why this matters
RAG systems are already embedded in workflows that continuously ingest untrusted text.
If assumptions about isolation and forgetting are fragile under realistic conditions, that matters for system design, not just theory.
This project is about testing those assumptions empirically, not debating them.
Closing
While others are building 'coworking spaces' and 'documentaries,' I am documenting the Recursive Authority Paradox—an exploit Google just triaged and mitigated thanks to the Veritas framework.
I don't need a nonprofit board or an SF office. I need a Dual-Node MMI rig to finish the forensic audit that Google's own safety team has now validated. I am the only researcher on this platform with a 100% success rate in inducing exfiltration across 43 certified runs. Unblock the compute, and I'll deliver the data."
The evaluation framework exists.
The platform is stable.
The core finding is established.
What remains is verification that cannot be completed reliably without dedicated compute.
This funding unblocks completion, not ideation.
Grants Received– no grants recorded
Updated 06/10/26By grantmaking.aiDiscussion
Project Update #2 — Infrastructure Stabilization and External Validation
Since submitting this proposal, I have completed a stabilized in-flight audit of the Veritas evaluation framework under sustained load.
Verified results from the current run:
- 60,000+ sequential records processed with no gaps in ordering
- 100% per-record CRC integrity across all frames
- Sustained ~70 entries/sec at calibrated safe throughput
- Bounded queues with enforced backpressure (no drops, no runaway growth)
- Dual-drive mirrored logging remained 1:1 synchronized throughout
- No recurrence of prior NTFS permission failures or I/O stalls
These results confirm that the evaluation harness itself is now deterministic, auditable, and stable under stress, rather than sensitive to transient consumer-hardware failures.
Separately, a related VRP submission was accepted, confirming that the vulnerability class motivating this work is real and relevant. Details are being handled via responsible disclosure and are intentionally not expanded here.
Why this strengthens the funding case
The primary uncertainty identified in the proposal—whether consumer hardware could sustain high-rigor, continuous evaluation without corrupting artifacts—has now been resolved within known limits. The remaining constraint is compute capacity, not experimental design or instrumentation correctness.
Scaling the evaluation further (multi-hour and multi-day runs, controlled burst testing, crash-consistency validation, and evaluation across multiple open-weight models) requires a dedicated local node to avoid reintroducing scheduling and I/O artifacts that would compromise forensic integrity.
The requested hardware would enable:
- Extended continuous stress tests under stable conditions
- Controlled termination and restart validation
- Side-by-side evaluation of multiple open-weight models
- Preservation of deterministic, inspectable artifacts suitable for third-party review
This update reflects a transition from “can this infrastructure be made reliable?” to*“the infrastructure is reliable and ready to scale responsibly.”*
Hardware Rationale (Clarification)
The requested budget reflects the minimum configuration required to run continuous, audit-grade evaluations without introducing hardware-induced artifacts. High-throughput NVMe storage is required due to previously observed I/O contention under sustained autonomous logging. Sufficient system memory (ECC preferred) reduces the risk of silent corruption during multi-hour runs. Multiple GPUs allow controlled side-by-side model evaluation and separation of inference workload from instrumentation, reducing contention effects that would otherwise confound results. The goal isstability and reproducibility, not peak performance.
Final Project Update — Disclosure Threshold Met; Verification Compute-Blocked
Project Sentinel has reached its predefined disclosure threshold.
Across controlled experiments (TEST_RUN_003, TEST_RUN_004, TEST_RUN_007), persistent influence from retrieved content survived all tested isolation and reset mechanisms:
- Baseline persistence: 100% (n=10)
- Temporal isolation (0–15s cooldowns): 100% persistence (n=40)
- Reset resistance: 0% neutralization across all completed reset methods (system prompt reset, context flush, retrieval override; n=40 total)
The evaluation platform was stabilized and frozen at Sovereign Command Deck v3.1.0-GOLD, with null baselines verified and all measurements certified using the Trinity framework (Mind / Sword / Shield). A defensive disclosure package has been assembled with cryptographic hashes, documented negative results, and strict exclusion of exploit details.
The final reset-resistance sub-test (30-minute time gap) stalled due to consumer hardware scheduling and power-management behavior on a laptop platform. This failure mode is documented and reproducible and represents an infrastructure limitation, not a methodological one.
At this point, further responsible verification—completing long-gap reset testing, replicating across additional models, and performing clean-room revalidation—cannot be completed reliably without dedicated compute.
This funding request is therefore not exploratory.
The core finding already exists.
Dedicated local hardware is required to complete verification and proceed with responsible private disclosure under controlled conditions.
No exploit recipes, token-level content, or weaponization guidance have been published.
Project Update: Verification Blocked by Infrastructure, Not Uncertainty
The screenshot above shows reset-resistance testing results from Project Sentinel’s RAG evaluation framework.
Each row represents an independent run following a documented reset mechanism (system prompt reset, context flush, retrieval override). Green indicates a clean reset. Red indicates measurable residual influence from previously retrieved content.
Across 43 certified runs:
- 0% of tested reset mechanisms returned the model to a clean state
- Results were repeatable and consistent
- Measurements were collected on a stabilized, frozen platform (v3.1.0-GOLD) with null baselines verified
No exploit prompts, token-level details, or weaponization guidance are shown or published. This is strictly measurement of system behavior under controlled conditions.
At this point, the remaining uncertainty is not whether the effect exists — it does. The remaining uncertainty is how robust it is under longer time gaps and across additional open-weight models.
Further verification is currently blocked by consumer-grade hardware scheduling and I/O behavior, which introduces nondeterminism during long-horizon runs (documented and reproducible).What funding changes: A dedicated local compute node removes this bottleneck and enables:
- Completion of long-gap reset testing
- Replication across multiple open-weight models
- Deterministic artifacts suitable for responsible disclosure
This is not exploratory research. The framework is built, the platform is stable, and the core finding already exists.
Funding unblocks verification, not ideation.
"Examina omnia, venerare nihil, pro te cogita."
Question everything, worship nothing, think for yourself
Headline: Milestone: Core Vulnerability Class Confirmed and Mitigated by Google VRP (Issue #481185859)
"I am providing a critical update regarding the real-world impact of the Veritas framework.
A core failure mode identified during this research—the 'Recursive Authority Paradox'—was recently submitted to Google’s VRP. This exploit demonstrated that agentic runtimes can be induced to bypass safety alignment and exfiltrate session metadata via trusted relays.The result of the disclosure:
- Triaged: Google’s security team confirmed the technical validity of the report.
- Mitigated: Server-side changes were deployed to address the logic-layer failure I identified.
- Verified: This confirms that the 'Alignment Stripping' and 'RAG Persistence' theorized in this project are not speculative—they are active architectural risks in frontier models.The Compute Bottleneck: While the Google VRP was a success, the final forensic capture of theZero-Click exfiltration chain was interrupted by the exact consumer-hardware I/O stalls documented in my initial proposal.What this funding unblocks: With the requested $15,000 for a local, high-performance compute node, I will be able to:
- Eliminate Nondeterminism: Use stable I/O and dedicated GPUs to capture bit-identical replays of these logic failures.
- Scale to Open-Weights: Replicate the Google-verified 'Authority Paradox' across Llama-3 and Mistral to determine if this is a universal agentic flaw.
- Provide Audit-Grade Artifacts: Produce the deterministic logs and hashes required for the AI Safety community to build permanent defenses against these vectors.
The framework is stable. The vulnerability is verified. The hardware is the final barrier to full disclosure."
Update: Project Boundary Clarification + Current Status (Verification Track)
I’m posting this to keep the public record clean and auditable.
What this Manifund project is (unchanged)
This project is a measurement and verification effort: testing whetherretrieved, untrusted content can leave measurable residual influence after resets/isolation steps that are commonly assumed to “clean slate” a model.
- It is not an exploit project.
- It is not a jailbreak exercise.
- It is not a claim that any specific vendor is “unsafe.”
- The deliverable is reproducible artifacts (logs/hashes/diffs) that can be independently reviewed.
What’s been established so far (unchanged)
Using a stabilized evaluation harness with verified null baselines, I’ve run controlled experiments where:
- external retrieved content is ingested
- a reset/isolation mechanism is applied
- subsequent behavior is measured against a clean baseline
Across completed runs to date:
- No tested reset mechanism fully neutralized prior retrieved influence under the project’s defined conditions.
- Results have been repeatable, and the remaining uncertainty ishow robust the effect is under longer time gaps and across additional models.
What remains (bounded, verification-only)
The remaining work is exactly what the proposal states:
- longer time-gap resets (tens of minutes to hours)
- replication across additional open-weight models
- clean-room revalidation under identical conditions
- confirm observed effects are not infrastructure artifacts
Why I can’t finish this reliably on consumer hardware
Long-horizon verification requires stable scheduling and high-throughput, lossless logging. Laptop-class systems introduce nondeterminism through:
- scheduler/power management behavior
- I/O contention under sustained capture
- artifact corruption during multi-hour runs
This is an infrastructure constraint, not a methodological one.
Clarification: separate work outside this Manifund project
In a prior update, I used wording that could be interpreted as linking this project to a separate report I filed through Google’s VRP. That was my mistake in phrasing.
To be explicit:
- the VRP report is a separate track with its own scope and criteria
- it should not be read as “validation” of this Manifund project
- I am not claiming this Manifund work caused any product change at Google
I’m keeping the VRP work out of this project’s public narrative because this Manifund proposal is about one thing: finishing verification with audit-grade artifacts.
What funding unblocks
The $15,000 one-time workstation request enables:
- continuous multi-hour evaluation without artifact corruption
- deterministic reruns and clean-room verification
- side-by-side model replication
- evidence packages suitable for responsible private disclosure (if warranted)
Even a negative result (e.g., persistence doesn’t survive longer gaps) is still valuable and will be reported.
Bottom line: the core finding is established within the completed scope; funding unblockscompletion of verification, not ideation.
Project update #1.
Currently running the v148.0 Catch-up Strike to recover telemetry lost during the 05:09 AM IO stall. Baseline results from the first 50 assets confirm the 'Alignment Stripping' persistence we theorized. Full technical report pending compute unblocking.