Putting explainability at the forefront of AI text detection
Putting explainability at the forefront of AI text detection
People
Updated 06/10/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/10/26By grantmaking.aiCome check out a (very experimental) demo to see if you'd like to back further research! 👉 https://ai-tells.techInspiration
AI text detectors are everywhere right now, but they all have the same problem: they give you a label and nothing else. For example, "This is 99% AI," and you're supposed to just trust it. For students accused of cheating, for researchers trying to audit content, for anyone who wants to understand the decision, that's not good enough.
TELL (Traced Evidence for Learned Labels) is a research prototype that changes this. Instead of outputting a score, it outputs evidence, specific spans in the original text with an explanation of why that phrase looks AI-generated or human-written. The goal is to move AI detection from a black box to something closer to a peer reviewer.
The pipeline has two parts. A policy model, trained with GRPO, learns to insert <tell explanation="\.\.\."\> tags around spans in the text without changing a single word. Then a separate, frozen scoring model assigns a continuous score between -1 (human) and +1 (AI) to each tag. The final verdict is the average of all tells. By separating the "tagging" and the "scoring" into two different models, the system can't reward-hack: the policy model can't just pick words it knows will score high, it actually has to find real evidence because otherwise the frozen model will just not give it a reward.
The key research question I want to keep working on is generalization: right now, the model is trained on scientific abstracts in English, and it already shows promising results. But real-world AI detection needs to work across languages, domains, and writing styles, including non-native English. I'm Spanish, so I'm particularly interested in what happens with translated text or Spanish writing. My plan is to expand the training data, improve Z-score normalization for out-of-distribution (OOD) cases, and run more systematic evaluations.
Why I wanna build this
I got accused of using AI to write something once. The detector gave a high score, and that was it, no explanation, no evidence, nothing I could point at and say "that's wrong." I had no way to defend myself, and that felt super unfair.
But honestly, the more I thought about it, the more I realized the problem is not just about me. Detectors are being used to make real decisions, but a system that just says "99% AI" and moves on doesn't have any accountability, it's just a different kind of guessing.
So I wanted to build something that works more like a human reviewer: it points at specific things and says "this transition is too formulaic" or "this phrasing is unusual for a human." You can agree or disagree. You can push back and defend yourself using your critical thinking (and even use that to improve the system!).
How will this funding be used?
Honestly, the main bottleneck right now is compute. Training runs are expensive, and GRPO is sampling-heavy. This is how I'd use the funding:
- API costs for Tinker (training and sampling), and an inference engine (frozen scorer): around 70% of the budget.
- Evaluation: building a proper benchmark across domains and languages (including Spanish), and collecting or licensing some human-written text data: around 20%.
- Miscellaneous infrastructure (hosting the demo online, storage for checkpoints): around 10%.
The minimum funding would let me run two or three full training runs with different data mixtures and evaluate OOD generalization more carefully. With full funding I could also start the multilingual track.
Team and current progress
Right now this is a solo project. I'm Aldan Creo, an MSDS student at the Halıcıoğlu Data Science Institute (HDSI) at UC San Diego, funded by Fulbright. My research is on AI-generated text detection and LLM watermarking, working with Prof. Yu-Xiang Wang. In parallel, I work in Prof. Earlence Fernandes's lab on prompt injection.
TELL started as a hackathon project for DiamondHacks 2026, where I built the full pipeline from scratch. It's also closely related to my ongoing graduate research and a previous project (CoDeTect) on out-of-distribution code detection, where I proposed a similar Z-scoring approach for language normalization.
What are the most likely causes and outcomes if this project fails?
The most likely failure mode is that the policy model doesn't generalize beyond scientific abstracts. If that happens, the system works in a narrow domain but isn't practically useful. A second risk is that the frozen scorer becomes a bottleneck: if it's badly calibrated, the tells can be misleading even if the format is correct. Both of these are known issues I'm already working on. If the project fails completely, the worst outcome is that it stays a narrow research prototype, which is still useful for the scientific community and as a proof of concept.
How much money have you raised in the last 12 months, and from where?
I'm funded through the Fulbright program for my studies, but that doesn't cover research compute costs. I haven't raised any external funding specifically for this project.
Grants Received
Updated 06/10/26By grantmaking.aiDiscussion
Update: I've put up a website with a very early experimental version - come check it out! https://ai-tells.tech :)
@acmc awesome to see that you've already launched; it's a nice-looking website!
Feedback, I pasted in a snippet and it was extremely slow (>3min), only to hit some kind of error. Imo, getting it to work relatively fast is quite important, both for a demo and for real-world usage.
I fear your demonstration is putting too much weight into whether a piece of text sounds like it is from an assistant or not, which reflects a small training foundation of probably mostly assistant prompted generated text. It is quite good at telling when it sounds like stock-standard AI slop, but I feel anyone can do that. When I change just a few characters, change the em dashes to -, change sided " to standard quotation marks, and remove one period from an ellipses, as well as remove the names of the speakers and replace them with Speaker 1/ Speaker 2. The estimate drops from -0.15 human (wrong) to -0.95 human (wrong).
I threw a few passages in there of rather nuanced heavy conversations I had, as well as some conversations I had with a smaller model. It got perfect marks against the local llama model that fits on my gpu. But at the moment, seems to be worse than a coin flip for ambiguous text (which is what we're trying to fill the gap for with these types of detectors; obvious ai text is obvious), which is confusing. Pangram's research paper on their method seems to be what you are doing except done in a way that doesn't rely on AI to rate AI work; it instead works on the false positive rate by continually training in false positives so that it can actually detect what a false positive will look like, which has improved Pangram's results explosively in my opinion. I thought it was useless before, but it is worth examining their methodology. Here, the explanations the models gave both on the experimental page + the other ones I saw (as one broke) looking at the network request were massive underestimates of what an LLM can do and made sweeping assumptions about the depth or philosophical difficulty of a passage.
At times, it seems to rate a passage purely on whether it sounds empathetic enough to be human and doesn't sound fake. Unfortunately, we live in a time where nuanced, well-spoken AI will be commonplace soon enough, so that's not good enough either. I struggle to know whether more training would solve this issue, because if the problem is it needs to distinguish between meaningful text and fakely meaningful text an ai writes, if it's an AI, then how can it know what is meaningful and what isn't if that's the definition?
I submitted all my entries with feedback correct/incorrects so you can review them if you want. Interesting idea for an alternative solution to this problem with an uncommon training style for this purpose, but I fail to see the things it's intended to show off.
@rubyaftermidnight Hi! That's great feedback, thanks so much for the detailed comment!!
To be clear, I think the version you were trying was one of the earliest iterations. What I wanted to show at that time was just the idea of "tagging" the reasons why the system thinks something is / is not AI. But I didn't expect it to perform well, it was a super early experiment. My bad because I wasn't super clear on that - I've added a disclaimer to the website that I think makes things more clear, would that help?
> This system is a very early experimental prototype. Do NOT trust the model's predictions for real-world decisions, we've trained it for very few steps. And keep in mind that we iterate frequently, so any outputs you see here may change significantly as we keep developing it. Our goal here is to showcase how a real systemcould work, for example when it comes to showing the annotated reasons for the predictions, but you should not expect it to perform well, at least for now.
Then, I've been working on doing better training - and longer runs - and I think the system is much better now (and nuanced). Would you have some time to test it out and let me know if you think this iteration's better? (I added some examples to the website if you want inspiration) :)
Hey Aldan, I'm viewing your project, and your demo so far has been promising. Clearly labelling AI use and disclosing it seems appropriate and like one thing we really need. My main concern right now is that your app would be the perfect ground for an RL environment to train for an LLM to be even more non-detectable. I was wondering if you had any theory about this failure mode?
Hi! Thanks for the nice feedback :) To be honest this is always gonna be an arms race IMO and at some point we won't be able to distinguish it anymore - i.e. when LLM output distributions are undistinguishable from humans' - maybe you'd like to read on this https://arxiv.org/abs/2303.11156
Austin, thank you so much! I wasn't sure if this idea would connect well with people outside my research bubble, so seeing someone actually back it is, to be honest, incredibly motivating!! It means a lot, thanks truly :)
@acmc you're welcome! I'm viewing this as a very cheap bet on someone outside of the standard EA/AI safety network, but otherwise with reasonable credentials, working in a problem area I think is interesting and important. For Manifund specifically, we're seeing a lot of AI generated project proposals, which is not disqualifying but can be a negative signal, and I'd love better, explainable detectors.
For fun, I put your app through Pangram, which flags your proposal as about half AI, haha: https://www.pangram.com/history/07a7f2de-1e42-4bd8-906c-8699f86061e6
@Austin hahah, yup, that's because I actually did use it, of course the ideas are mine but I wrote it with AI. Personally I don't think there's anything inherently wrong with that, but I do think there's degrees of "AI sloppiness", and it can be frustrating at times. So yeah, having detectors we can actually trust and understand is something I think we're all not thinking about enough and I hope this will help change that a bit. Thanks again! :)