MindHYVE.ai

Architecture

How to build a mind: the engineering behind a Metacognitive Reasoning Architecture

A complete tech stack walkthrough for building an AI system that orchestrates models — not one that replaces them.

Bill FarukiFounder & CEO8 min read

Originally published on billfaruki.substack.com on March 18, 2026. The companion essay to The model is not the product.

Last week I introduced a new category of AI system: the Metacognitive Reasoning Architecture, or MRA. The response was clear — people get the concept. Intelligence is structure, not scale. The models are brains. The MRA is the mind.

The immediate follow-up question: how do you actually build one?

This is that answer. A complete engineering walkthrough of the technology stack required to implement an MRA on commercially available cloud infrastructure, using commercially available foundation models, with a standard TypeScript engineering team.

No custom silicon. No model training. No PhD-level ML engineering. Just structured orchestration of existing inference endpoints, built with the same tools your team already uses.

The constraint that defines everything

An MRA has a single architectural constraint that shapes every technology decision: it does not generate tokens. It orchestrates systems that generate tokens.

This means the MRA itself is an orchestration layer — services that classify, decompose, dispatch, verify, synthesize, monitor, and remember. The compute-heavy work (the actual model inference) happens on managed endpoints that someone else operates. Your engineering effort goes into the logic that sits between those endpoints.

The practical consequence: if your team can build a microservices backend, your team can build an MRA. The difficulty is not in any individual component. It is in getting the seven layers to interact correctly, recursively, under real-time metacognitive oversight.

The stack at a glance

The MRA we built at HYVE Labs runs entirely on Azure. Not because Azure is the only option — the architecture is cloud-portable — but because we already operate on Eve-Grid™ (our proprietary cloud architecture on Microsoft Azure) for our production AI products, and the Azure AI Foundry model catalog gives us access to models from OpenAI, Anthropic, Meta, Mistral, and others through a single deployment surface.

The runtime is TypeScript everywhere. Same language in the orchestration services, the background workers, the API gateway, the shared type definitions, and the configuration schemas. One language means one set of mental models, one type system enforcing contracts across every layer boundary, and zero context-switching overhead for the engineering team.

The monorepo is managed by Turborepo with pnpm workspaces. Each MRA layer is a separate package with explicit import boundaries. The orchestration services run on Fastify — chosen for its schema-validation-first design and raw throughput. Zod validates every piece of data crossing a layer boundary. If a problem genome is malformed, if a confidence vector has an out-of-range dimension, if a DAG node is missing its dependency set — the system catches it at the boundary, not three layers downstream.

Layer 0: The Intake Cortex

The entry point. A query arrives, and before anything else happens, the MRA needs to understand what kind of problem it is dealing with.

The Intake Cortex runs a lightweight model — GPT-4.1-nano or equivalent — on an Azure ML Online Endpoint. This is a classification task, not a generation task. The model receives the query and outputs a structured object we call the problem genome: six typed axes describing domain, difficulty, ambiguity, composability, risk profile, and similarity to past episodes.

The genome is defined as a Zod schema. Every downstream layer trusts this schema because Zod validates it at creation. The genome is also cached in Azure Cache for Redis, keyed on an embedding hash of the query. If a structurally identical query type has been classified before, the system skips the model call entirely. For production traffic where many queries are structurally similar, this cuts Layer 0 latency to single-digit milliseconds.

The genome determines everything downstream: how many parallel paths to spawn, which models to recruit, how many verification stages to run, and how much compute budget to allocate. A trivial factual question might get one model, one pass, zero verification. A research-grade reasoning problem might get twenty-four parallel paths across five models with full adversarial testing and recursive backtracking.

This is where the MRA's 1000:1 compute elasticity originates. Not from complex autoscaling logic. From a classification decision made in the first hundred milliseconds.

Layer 1: The Decomposition Engine

Once the genome is produced, the Decomposition Engine transforms the query into a Directed Acyclic Graph of sub-problems. This layer runs on Azure Container Apps as a Fastify service.

For difficulty 3+ problems, it spawns competing decompositions: the same query is sent to two or three different models in parallel, each asked to produce a sub-problem structure. A meta-selection prompt then evaluates the candidates and picks the best decomposition or synthesizes a hybrid.

The DAG is a typed adjacency list with metadata on every node: domain tag, difficulty estimate, dependency set, verification criteria, and model budget. These node messages are published to Azure Service Bus in topological order. Independent nodes — those with no upstream dependencies — can begin processing immediately in the next layer.

Layer 2: The Hypothesis Forest

This is where the multi-model inference happens, and where the MRA's architecture diverges most sharply from anything a single model can do.

For each leaf node in the DAG, the forest dispatches parallel inference calls across three diversity axes: model diversity (different foundation models), strategy diversity (different prompting approaches), and context diversity (different retrieval configurations against Azure AI Search).

The parallel dispatch uses Promise.allSettled() on arrays of model-strategy-context combinations. Wall-clock time equals the slowest individual call, not the sum. Twenty-four parallel paths take the same time as one, until you saturate endpoint capacity.

Layer 3: The Verification Gauntlet

Every candidate from the Hypothesis Forest must survive five stages of adversarial testing.

Stage one, self-consistency: the same question is rephrased five ways and sent back to the originating model. Divergent answers reveal instability. Stage two, cross-model challenge: a different model is given the candidate answer and tasked with finding the weakest point. Stage three, adversarial red team: a model is prompted to construct the strongest possible counterargument, assuming the answer is wrong. Stage four, empirical grounding: every factual claim is extracted and verified against Azure AI Search indices. Stage five, formal verification: for outputs that admit computational checking, Azure Functions execute code in isolated sandboxes, compute mathematical derivations independently, or compare predictions against base rates from episodic memory.

What emerges from the gauntlet is not a raw model output. It is a solution with a five-dimensional confidence vector — factual grounding, logical consistency, adversarial resilience, cross-model agreement, and formal correctness — each scored on a continuous scale. A product can say “only surface answers with adversarial resilience above 0.85” — something impossible with a single model's uncalibrated confidence.

Layer 4: The Synthesis Engine

The Synthesis Engine recomposes verified sub-solutions back up the DAG. At every merge point, it runs a consistency check: do the sub-answers contradict each other?

When contradictions are detected, the engine publishes the conflicting branch back to Azure Service Bus with the contradiction as context and an incremented recursion depth counter. This is the recursive loop — the branch returns to Layer 2 for re-exploration, passes through Layer 3 for re-verification, and returns to Layer 4 for another merge attempt.

The backtracking logic is roughly two hundred lines of TypeScript — it is not complex code. The complexity is in the design decision to make it possible at all. No single model can backtrack during generation. The MRA can, because generation and synthesis are separate stages with a message bus between them.

The intelligence is in how you compose them. That is the entire point of a Metacognitive Reasoning Architecture.

Layer 5: The Metacognitive Controller

The Metacognitive Controller is the component that makes this an MRA rather than a pipeline. It is a lightweight model — GPT-4.1-mini on an ML Online Endpoint — that reads structured metrics and issues intervention commands.

It polls Azure Application Insights custom metrics on a two-second loop via a durable Azure Function with a timer trigger. The metrics it monitors: confidence trajectories across parallel paths, compute budget utilization, convergence rates, and per-layer latency.

The intervention commands are typed union types in TypeScript — reclassify, merge nodes, substitute model, redecompose, early exit — published to a dedicated Service Bus topic that all layers subscribe to.

Layer 6: Episodic Memory

Every completed reasoning episode writes a structured record to Azure Cosmos DB: the problem genome, the decomposition DAG, the model mix, which strategies succeeded and failed, which verification stages were hardest, how many recursive iterations were needed, the final confidence vector, and any user feedback received downstream.

Simultaneously, the problem genome embedding is written to an Azure AI Search vector index. When a new query arrives, Layer 0 performs a K-nearest-neighbor search against this index to retrieve the five most similar past episodes. These episodes inform every downstream decision.

The data accumulates. After a thousand episodes in a domain, the system has empirical evidence for which model combinations work best. After ten thousand, it knows the difference between subtypes — cardiology versus radiology, contract law versus patent claims. This knowledge lives in the orchestration layer. It is proprietary to the operator. And it improves with every inference episode at zero retraining cost.

What products see

The entire MRA is consumed through a single API endpoint. A product sends a query with optional hints — domain, maximum acceptable latency, minimum required confidence — and receives a verified answer, a confidence vector, and metadata about the reasoning process. Twenty-four parallel paths, five verification stages, recursive backtracking, metacognitive oversight — all of it invisible. The product gets a better answer than any single model could produce, and a machine-readable confidence score that enables principled downstream decisions.

What you actually need to build this

The infrastructure bill is Azure services your team is probably already running: Container Apps, ML Online Endpoints, Service Bus, Cosmos DB, AI Search, Redis, Functions, Application Insights. The development effort is TypeScript microservices with Zod validation at every boundary.

The hard part is not any individual component. The hard part is the interaction design between layers — getting the recursive loop right, tuning the Metacognitive Controller's intervention thresholds, building the prompt library for the Verification Gauntlet, and designing the episodic memory schema so that it captures the right information for future strategic guidance.

This is systems engineering, not ML engineering. The models are off-the-shelf. The intelligence is in how you compose them.

That is the entire point of a Metacognitive Reasoning Architecture.

Editor's note

HYVE Labs is MindHYVE's R&D Division. The MRA architecture described here is the engineering substrate of AXIOM™ — MindHYVE's named MRA, currently in active R&D as the architectural successor to the Eve-Fusion™ compound reasoning family. See the introduction essay, The model is not the product, for the conceptual framing.