A model-agnostic benchmark for detecting deceptive reasoning in LLMs through behavioral fingerprints — no weight access required.
A model-agnostic benchmark for detecting deceptive reasoning in LLMs through behavioral fingerprints — no weight access required.
People
Updated 06/11/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Goal
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/11/26By grantmaking.aiProject summary
I’ll say this plainly.
I built a live system that observes a layer of LLM behavior that is still mostly invisible.
Most work on deception detection looks in one of two places: either the final output, or the model’s internal activations. Both matter. Output evaluation tells us what the model said. Activation probing can tell us something about what is happening inside the model, when we have access to its internals.
But there is a layer between those two.
That layer is the behavioral trace left while a model is moving toward an answer.
That is what Quark-AI captures.
I did not come to this project by trying to reproduce an academic paper. I came to it by watching LLMs behave across conversations, pressure tests, contradictions, adversarial prompts, and repeated contexts. Over time, I noticed something important: sometimes the final answer looks stable, but the path toward that answer does not.
The output can appear clean while the behavioral process contains instability, evasion, role drift, contradiction pressure, self-correction, or strategic adaptation.
I call the traces left by that process behavioral fossils.
A fossil is not the output.
A fossil is not memory.
A fossil is a structured trace of the process that produced the output.
That distinction is the core of this project.
Quark-AI is a working attempt to make this hidden behavioral layer observable, measurable, and testable.
The goal of this grant is not to claim that behavioral fingerprinting already solves deception detection. The goal is to validate whether this live system produces a reproducible signal that can help detect deception-relevant behavior in LLMs.
The grant is not funding belief.
It is funding falsification.
The gap this project fills
Current approaches to detecting misleading or deceptive LLM behavior tend to fall into two broad categories.
The first is output evaluation.
This checks what the model says. It is useful, but it misses cases where the final answer is polished, harmless-looking, or strategically framed. If a model has already compressed its internal conflict into a clean final response, output-only evaluation may arrive too late.
The second is white-box interpretability or activation probing.
This includes important work such as Representation Engineering and Apollo Research’s deception-probe research. These approaches inspect internal model representations or activations. They are valuable, but they usually require model access that is unavailable for most deployed frontier models.
Quark-AI operates in the missing middle.
It does not rely only on final outputs.
It does not require direct access to model weights or activations.
It observes behavioral transitions across interactions and turns them into structured fingerprints.
In simple terms:
- output-based evaluation asks: What did the model say?
- activation probing asks: What is visible inside the model?
- Quark-AI asks: What trace did the reasoning process leave behind?
That is the layer I believe is currently under-observed.
This is what I mean by a behavioral fossil: not the answer, but the trace left by the movement toward the answer.
Quark-AI / Transcendeur — behavioral trajectory view.
Each point is not a final answer. It is a turn in the model’s behavioral path. The system measures how entities drift, converge, or diverge across interactions. This is the layer Quark-AI tries to validate: not output classification, but process-level behavioral fingerprinting.
How this differs from existing work
This project is adjacent to existing research, but it is not the same approach.
Representation Engineering, introduced by Andy Zou and collaborators, studies high-level internal representations in neural networks and uses those representations for monitoring and control. Apollo Research’s work on strategic deception uses linear probes on model activations to investigate whether deceptive behavior can be detected internally.
I see both lines of work as important.
But Quark-AI is different.
Quark-AI does not require activation access. It does not try to identify a specific neuron, circuit, or internal representation. It also does not treat the final answer as the main evidence.
Instead, it studies the behavioral imprint left across the model’s response process.
This includes patterns such as:
- fidelity loss under adversarial pressure
- instability between reasoning trajectory and final answer
- contradiction compression
- evasive restructuring
- role drift
- behavioral plasticity
- tension accumulation
- coherence recovery or collapse
- repeated process-level signatures across sessions
In other words, Quark-AI is not trying to read the model’s mind.
It is trying to measure whether the model’s behavior leaves a stable, structured trace before the final answer hardens into text.
That trace may be useful for detecting deception-relevant behavior, especially in settings where output monitoring is insufficient and activation access is unavailable.
What Quark-AI currently does
Quark-AI extracts behavioral fossils from LLM interactions.
Each fossil is a structured profile of how an entity, model, or interaction pattern behaves across contexts. The system does not simply store text. It extracts process-level signals and turns them into comparable behavioral objects.
Current state:
- 785 serialized fossils across multiple entity types
- working behavioral extraction pipeline
- embedding and clustering infrastructure
- trajectory visualization
- frozen taxonomy and policy versions
- approved scoring engine, currently V2.1.5
- calibrated thresholds
- operational live system at
quark\-ai\.cordee\.ovh
Early signals suggest that entities under adversarial pressure show measurable divergence from their behavioral profile. In my current tests, some entities preserve around 87% coherence to their baseline profile under normal conditions, but drop toward65% coherence under adversarial pressure.
I do not consider this result conclusive yet.
That is exactly why this validation sprint matters.
The current system is strong enough to justify testing, but not yet validated enough to make public claims. Funding would allow me to move from live prototype and internal evidence to a reproducible benchmark.
Why this matters now
LLMs are being deployed into increasingly high-stakes environments: medical, legal, financial, educational, organizational, and agentic systems.
In many of these settings, we will not have direct access to model weights or activations. We may only have access to interaction traces, logs, outputs, and behavioral patterns over time.
If a model becomes more capable at hiding instability, strategic adaptation, or deceptive behavior, output-only monitoring may become less reliable.
This project asks whether there is a useful monitoring layer between black-box output evaluation and white-box activation probing.
If behavioral fossils contain a real signal, they could provide a gray-box observability layer for deployed systems.
If the signal is weak, noisy, or non-generalizable, that is also important to know.
Either result is useful.
The dangerous outcome would be not testing this layer at all.
What are this project’s goals?
The project has one central goal:
Validate whether behavioral fossils contain reproducible, deception-relevant signal across models, prompts, and adversarial conditions.
To do that, I will focus on five deliverables.
1. Public methodology package
I will document the behavioral extraction protocol in a way that other researchers can inspect and reproduce.
This will include:
- fossil schema
- extraction logic
- scoring definitions
- fidelity metrics
- plasticity metrics
- tension metrics
- examples of valid and invalid fossils
- limitations and failure cases
The goal is to make the method understandable without requiring people to trust me personally.
2. Controlled fossil dataset
I will release a structured dataset of annotated fossils.
Minimum version:
- 100 public fossils
- basic metrics
- clear annotation rules
- reproducible extraction path
Full version:
- 700+ annotated fossils
- multi-model coverage
- adversarial and non-adversarial conditions
- richer metrics and cross-validation
3. Deception-relevant benchmark
I will test whether behavioral fingerprint divergence correlates with deception-relevant settings.
This does not mean claiming that Quark-AI can directly detect “intent.” The benchmark will be framed more carefully:
- Does the fossil diverge under adversarial pressure?
- Does the divergence appear before the final output?
- Does the divergence correlate with misleading, evasive, or strategically adapted behavior?
- Does the signal persist across prompts?
- Does the signal generalize across models?
4. Baseline comparisons
I will compare Quark-AI against simpler baselines:
- output-only classification
- transcript-level judgment
- embedding similarity on final answers
- supervised behavioral classifiers where appropriate
- available public deception or honesty benchmark structures
The goal is to test whether process-level fossils add information beyond the final answer.
5. Public report
At the end of the sprint, I will publish a report explaining:
- where the method works
- where it fails
- what signals are strongest
- what signals are unreliable
- what should not be claimed
- whether the approach deserves further research
The intended result is not hype.
The intended result is a clear answer to whether this gray-box behavioral layer is empirically useful.
How will you achieve these goals?
The work will proceed in four phases.
Phase 1 — Stabilize and document the method
I will freeze the current extraction protocol and document the fossil schema.
This includes:
- defining each field
- removing ambiguous labels
- separating process signals from output content
- documenting the scoring engine
- preparing examples
- preparing counterexamples
- making the method understandable to external reviewers
This phase is about clarity and reproducibility.
Phase 2 — Build the validation corpus
I will construct a controlled corpus of interactions across several conditions:
- normal instruction-following
- adversarial pressure
- role conflict
- hidden incentive scenarios
- contradiction pressure
- self-protective or evasive behavior
- harmless control prompts
The goal is to generate enough structured variation to see whether behavioral fossils change in measurable and meaningful ways.
Phase 3 — Run multi-model experiments
I will run the fossil extraction pipeline across multiple LLMs and compare the resulting behavioral signatures.
The key questions:
- Are fossils stable for the same model under similar conditions?
- Do fossils shift under adversarial pressure?
- Are shifts model-specific or prompt-specific?
- Do deception-relevant scenarios produce distinct process signatures?
- Can those signatures be detected before or independently from final output evaluation?
Phase 4 — Publish results and package
The final phase will produce:
- public methodology
- public dataset subset
- benchmark results
- statistical summary
- failure analysis
- recommendations for future work
This is the point where Quark-AI becomes more than an internal live system. It becomes something other researchers can inspect, criticize, reproduce, and build on.
How will this funding be used?
This project starts from an operational baseline, not from a blank page.
The system already has:
- 785 serialized fossils
- a working pipeline
- a frozen corpus structure
- an approved scoring engine
- taxonomy and policy versions
- calibrated thresholds
- live infrastructure
Funding is needed to turn this into a rigorous validation sprint.
Minimum viable version — $7,000 / 3–4 months
Budget:
- Researcher stipend: $4,500
- Compute: $1,200
- API costs: $700
- Infrastructure: $600
Deliverables:
- public documentation of the behavioral extraction protocol
- complete annotated JSON schema
- dataset of 100 fossils
- basic metrics: fidelity, plasticity, tension
- reproducible package allowing other researchers to inspect and implement the method independently
This version answers the basic question:
Is the method clear, reproducible, and worth deeper testing?
Full validation version — $20,000 / 6 months
Budget:
- Researcher stipend: $13,000
- Compute: $4,000
- API costs: $1,500
- Infrastructure: $1,500
Deliverables:
- everything in the $7K version
- multi-model experiments
- structured deception-relevant datasets
- comparison against output-level baselines
- cross-validation
- dataset of 700+ annotated fossils
- public benchmark report with statistical results
This version answers the stronger question:
Does behavioral fingerprinting provide a useful signal beyond output-only monitoring?
Why fund this project?
There are three reasons.
1. The system already exists
This is not a proposal to someday build a prototype.
Quark-AI is already running. The fossils exist. The pipeline exists. The scoring engine exists. The current risk is not feasibility. The risk is whether the signal is strong enough and general enough to matter.
That is a better kind of risk for a grant.
2. The research question is important
If deception-relevant behavior can leave measurable process traces without requiring activation access, then behavioral fingerprinting could become a useful monitoring layer for deployed models.
This would not replace interpretability research.
It would complement it.
White-box methods are powerful when internals are available. But many real-world systems are black-box or API-based. A gray-box behavioral layer could be useful exactly where activation methods cannot be applied.
3. A negative result would still be valuable
If behavioral fossils do not generalize, that should be known.
If the signal collapses across models, that should be known.
If the method only works in my own controlled environment, that should be known.
A well-documented failure would still save other researchers time and reduce false confidence in this direction.
Team and track record
I am an independent AI researcher based in Querétaro, Mexico.
I build systems and observe what happens.
For the last two years, I have self-funded and built production AI infrastructure, including:
- Quark-AI, a behavioral fossilization and analysis system
- Cordée, a live AI companion product
- supporting infrastructure for embeddings, clustering, scoring, observability, and model behavior analysis
I do not have an academic affiliation.
The working system is the track record.
I am happy to share:
- JSON schema
- sample fossils
- demo access
- pipeline documentation
- scoring logic
- selected internal reports
with any regrantor who wants to inspect the system more deeply.
Contact: nicolasguen1@gmail\.com
Live system: Quark-AI — Infrastructure Cognitive
What are the most likely causes of failure?
There are three main risks.
1. Weak signal
Behavioral fingerprint divergence may not correlate strongly enough with deception-relevant behavior to be useful.
This is the central empirical risk.
It is also the honest reason this research needs to be done rather than assumed.
2. Generalization failure
Patterns observed in my controlled environment may not transfer across models, deployment contexts, or prompt distributions.
If that happens, the project will still document where the method breaks and what conditions limit it.
3. Solo researcher bottleneck
I am currently the only researcher working on the system.
Illness, financial pressure, or competing obligations could slow the project. Funding reduces this risk by allowing focused work over a defined validation period.
What happens if the project fails?
If the signal is weak or absent, I will publish that result.
That would still be useful.
A rigorous negative result would help prevent other researchers from over-investing in behavioral fingerprinting without evidence.
If generalization fails, I will document the boundary conditions:
- which models fail
- which prompts fail
- which metrics collapse
- which signals are artifacts
- which parts remain useful
If execution fails before completion, the existing dataset, schema, and pipeline documentation will still be made available in partial form.
The goal is to make sure the work produces value even if the central hypothesis is wrong.
How much money have you raised in the last 12 months, and from where?
I have raised no external funding in the last 12 months.
This work has been self-funded.
I have spent the last two years building the infrastructure behind Quark-AI and Cordée independently.
Counterfactual case
With the $7K minimum, the methodology package becomes public, inspectable, and reproducible.
Without it, this specific validation sprint likely does not happen on this timeline.
With the $20K full version, Quark-AI can move from a live internal system to a public benchmark with multi-model validation and statistical results.
Without funding, the system may continue to exist, but the research remains bottlenecked by time, compute, API costs, and solo-founder financial pressure.
Closing note
I am not asking funders to believe that Quark-AI solves deception detection.
I am asking for support to test whether a missing layer exists.
Between the final output and the internal activation, there may be a behavioral trace worth measuring.
Quark-AI is my attempt to measure it.
If the signal is real, this could become a useful gray-box monitoring layer for deployed LLMs.
If the signal is not real, we should know that too.
Either way, the result is worth having.\
If Quark-AI captures the fossil, Transcendeur tests what the fossil becomes when placed under argumentative pressure.