HYVE Labs · Research Paper · May 2026
Operations as Primitives.
A compositional framework for training reasoning in small language models — twelve cognitive operations replacing the four classical philosophical modes.
Abstract
Existing approaches to training reasoning capabilities in language models organize their training curricula around philosophical categories of reasoning — deductive, inductive, abductive, and causal — treating these as the atomic units of cognition to be drilled into the model. We argue that this framing is mistaken at the level of cognitive architecture, and that this mistake imposes a ceiling on the reasoning quality achievable through supervised fine-tuning. Philosophical categories of reasoning are not primitives. They are post-hoc descriptions of complex cognitive episodes that decompose into more fundamental operations.
We propose an alternative framework in which reasoning is trained through twelve discrete cognitive operations — decomposition, premise identification, implication tracing, contradiction detection, evidence weighing, scope determination, temporal sequencing, absence reasoning, analogical mapping, confidence calibration, error recognition, and strategy selection — and we argue that the four classical philosophical reasoning modes emerge naturally as compositions of these primitives.
We further argue that a training curriculum built on operations-as-primitives requires structural diversity along three additional axes that have been under-theorized in the reasoning fine-tuning literature: record-type diversity, system-prompt variation, and domain-agnostic problem construction.
This paper presents the complete architectural framework as a defensive publication. We disclose the twelve operations, the six record types, the ten problem domains, the three-tier system prompt architecture, the seven-phase dataset generation pipeline, and the LoRA fine-tuning specification targeting the Phi-4 14B model. We do not present empirical results; this paper is a theoretical and architectural contribution intended to establish prior art on the framework and to invite empirical investigation by the broader research community.
The case against philosophical categories as primitives
The four classical reasoning categories are products of analytic philosophy and the philosophy of science. They were developed as descriptive categories for analyzing arguments after they had been made — not as generative categories for producing reasoning in real time.
The problem arises when these descriptive categories are imported into AI training as prescriptive templates. Three confusions follow:
Granularity. Philosophical categories describe reasoning at a much coarser grain than the cognitive operations that actually produce it. A single piece of deductive reasoning involves identifying premises, tracing their implications, checking for contradictions, weighing premises against each other, and calibrating confidence in the conclusion. Categorizing the whole episode as “deductive” obscures the operations that constitute it.
Direction. Philosophical categories are characterizations of completed reasoning. Cognitive operations are generative — what the reasoner is doing in the moment. A physician examining a patient does not think “I shall now deploy abductive reasoning.” She thinks “what would explain this constellation of findings?”
Composition. Real reasoning episodes almost always involve operations that span multiple philosophical categories. A diagnostic reasoning task may begin with deductive elimination, proceed through abductive inference, incorporate inductive reasoning, and conclude with causal reasoning. If training data is partitioned into categorical buckets, the model never sees naturally composed reasoning episodes.
The three confusions converge on a single failure mode in fine-tuned models: brittleness. A model trained on philosophical-category-labeled data learns to produce reasoning that looks like the category it has been prompted into but lacks the operational substrate that would make the reasoning robust.
The twelve cognitive operations
The operations were derived through cognitive task analysis on the reasoning demands of eleven distinct professional domains. Each operation can be specified and trained independently. Each is exercised across multiple domains, so training the operation on domain-agnostic content transfers to domain-specific reasoning. The operations compose to produce the four classical philosophical reasoning modes plus combinations that do not map cleanly onto any single classical mode.
Operation 01
Decomposition
Breaking a complex problem into well-defined sub-problems.
Operation 02
Premise Identification
Distinguishing what has been given from what is assumed and what is genuinely unknown.
Operation 03
Implication Tracing
Following claims forward to their necessary consequences.
Operation 04
Contradiction Detection
Recognizing when two claims, or a claim and an observation, cannot both hold.
Operation 05
Evidence Weighing
Assessing the relative strength of competing pieces of evidence.
Operation 06
Scope Determination
Deciding whether a stated rule or principle applies to a specific case.
Operation 07
Temporal Sequencing
Reasoning about the order in which events or rules occurred and how that order changes the analysis.
Operation 08
Absence Reasoning
Drawing inferences from what is not present. Negative evidence.
Operation 09
Analogical Mapping
Identifying structural similarity between cases that differ in surface features.
Operation 10
Confidence Calibration
Honestly assessing how certain one is, and why. A counterweight to the hallucination tendency.
Operation 11
Error Recognition
Catching one’s own mistakes mid-reasoning when a contradiction or absurdity fires.
Operation 12
Strategy Selection
The metacognitive operation. Choosing which operations to deploy on a given problem before beginning.
The four classical philosophical modes correspond to specific compositions: deductive = operations 2, 3, 4 (identify premises → trace implications → verify no contradictions); inductive = operations 5, 9, 10 (weigh evidence → map analogically → calibrate confidence in generalization); abductive = operations 1, 5, 10 (decompose observation → weigh candidate explanations → calibrate confidence in best); causal = operations 3, 7, 9 (trace implication of putative cause → verify temporal precedence → map to analogous cause-effect cases). The philosophical modes can be reconstructed from the operations; the operations cannot be reconstructed from the philosophical modes. This asymmetry is the technical content of the operations-as-primitives thesis.
The six record types
A reasoning curriculum must train more than the operations themselves. It must train the cognitive behaviors that surround and modulate the operations: how to recover from a wrong start, how to handle genuine ambiguity, how to recognize when information is insufficient, how to resist cognitive bias.
Existing fine-tuning datasets predominantly use a single record type: problem → reasoning chain → answer. This is the analogue of teaching a driving student by showing only successful drives. We propose six record types.
| # | Record Type | Share | Behavior Trained |
|---|---|---|---|
| 1 | Clean Multi-Operation Solve | 30% | Standard reasoning with explicit decision-point transparency |
| 2 | Productive Failure + Recovery | 24% | Error recognition and recovery — the most important record type |
| 3 | Competing Interpretations | 16% | Calibrated handling of genuine ambiguity |
| 4 | Strategy Selection / Cold Start | 14% | Metacognitive strategy diagnosis |
| 5 | Insufficient Information | 8% | Recognition of unanswerable questions |
| 6 | Adversarial Traps | 8% | Resistance to cognitive bias |
The most important type is Productive Failure + Recovery — the assistant begins with a plausible but incorrect approach, encounters a contradiction, explicitly identifies the failure, backtracks, and re-solves. This record type is almost entirely absent from existing reasoning datasets, and is the principal vehicle for training error recognition (Operation 11) and strategy switching.
Three-tier system prompt architecture
The system prompt must be a meaningful control signal, not a static string the model learns to ignore. We specify a three-tier architecture: a global reasoning identity (Tier 1), a record-type modulation (Tier 2), and per-domain alignment (Tier 3). Each tier admits substantial paraphrastic variation across training records. The model learns to attend to system-prompt content because the content actually changes — and changes in ways correlated with the desired output behavior.
Domain-agnostic problem construction
Reasoning operations should be trained on content that exercises the operation without requiring domain knowledge. The operation of scope determination can be exercised on a problem about whether a hypothetical bylaw applies to a specific committee scenario; that same operation transfers, at inference time, to statutory interpretation, to takhi al-amm in Islamic jurisprudence, to clinical guideline application, and to insurance policy interpretation. The operation is the carrier; the domain is the dressing.
Domain transfer is achieved at inference time through prompt-level domain alignment or, where deeper transfer is required, through lightweight secondary fine-tuning on a small per-domain corpus. The base reasoner remains domain-agnostic; the domain specializations are layered on top.
Fine-tuning specification
The full paper specifies the seven-phase dataset generation pipeline, the LoRA hyperparameter configuration, the target model (Phi-4 14B), and the proposed evaluation framework. The 25,000-example target is a reasonable upper bound given a generation pipeline that targets high per-example quality, and a reasonable lower bound for covering twelve operations across six record types and ten problem domains with sufficient redundancy for stable training.
What we disclose is the complete architectural framework. What we protect is the specific seed problems used in our own implementation and the verbatim text of our system prompt variants. The protected material is implementation detail; the disclosed material is the architecture itself.
Download
The full paper, including all twelve operation definitions, the complete compositional table, the full six-type response format specifications, the three-tier system prompt architecture, the seven-phase generation pipeline, the LoRA configuration, and the proposed evaluation framework, is available as a PDF.
Citation: Faruki, B. K. (2026). Operations as Primitives: A Compositional Framework for Training Reasoning in Small Language Models. HYVE Labs Research Paper, MindHYVE.ai, Inc.
Read further
Orchestration as the Locus of Value in Regulated AI Reasoning — the companion position paper. Operations-as-Primitives specifies how the orchestrator (Element 2) is trained.
The model is not the product — the introduction of the Metacognitive Reasoning Architecture (MRA) category that this work is part of.
Why Eve-Genesis trains reasoning modes, not answers — the editorial framing of the same compositional thesis.