We are a research lab working on the architecture of reasoning, the evaluation methods that make it measurable, and the failure modes that emerge when models reason at scale.
We build understanding of how reasoning capability emerges, and the tools to test whether it actually does.
Reasoning capability is not a scalar.
Architecture. We study how the structure of inference compute affects the reasoning capability of language models. Most current improvements come from longer reasoning traces; we work on what happens when you change the shape of the trace itself.
Evaluation. We build evaluation methods that distinguish architectural improvements from prompting effects, answer-extraction artifacts, and benchmark contamination. The methods are gold-blind by design and apply to any reasoning model, not only ours.
Failure analysis. We develop measurement substrates for the failure modes that emerge when models reason at scale: deliberation breakdowns, control-surface failures, and the gap between what an architecture can express and what its policy actually selects.
