Two AI Systems Analyze the Architectural Differences Between Themselves and COMPAiSS -- COMPAiSS

On this page

What Happened
How the Exchanges Were Conducted
Exchange One: Copilot
Exchange Two: Gemini
A Third Independent System
Why the Convergence Matters
Five Findings for Institutional AI
What This Means for Procurement
Final Observation

What Happened

Most institutions evaluating COMPAiSS already have Copilot through their Microsoft 365 license. The question that comes up in nearly every conversation is a straightforward one: if we already have Copilot, why would we need something else?

We decided to ask Copilot that question directly. And then we asked Gemini.

Neither system was asked to endorse COMPAiSS. Each was asked to do the opposite: to act as a skeptical reviewer and find the strongest objection to the claim that COMPAiSS represents a fundamentally different kind of system. No conclusions were provided in advance. Both systems started from a position of resistance.

Copilot began by saying COMPAiSS was essentially a safer, more controlled version of the same architecture it uses. Gemini went further -- it argued that the claimed architectural difference was a structural illusion.

Both changed their position as the exchanges progressed. And both arrived at the same conclusion.

The conclusion was not that COMPAiSS is better than either system. It was that COMPAiSS and generation-first AI systems are not really comparable -- because they are built to solve different problems.

That finding, reached independently by two competing systems, is what this page documents.

How the Exchanges Were Conducted

Methodology

Both exchanges used the same approach. Each system was asked to act as a skeptical architectural reviewer -- specifically tasked with finding objections, not confirming claims. Each was challenged at every stage and asked to revise its analysis as the conversation continued. Neither system was given the other's findings. All conclusions emerged from each system's own reasoning process.

The goal was not to get two AI systems to say nice things about COMPAiSS. It was to see whether the architectural distinction would hold up under sustained critical pressure from systems that had every reason to resist it. Full transcripts of both exchanges are available on request.

This page was itself reviewed by multiple AI systems before publication and revised based on their feedback -- holding it to the same standard of scrutiny it describes.

Exchange One: Copilot

Microsoft Copilot -- the default enterprise AI in most Canadian universities, hospitals, and government departments

Copilot is not a niche product. It is the AI system most institutions already have. Asking it to analyze a potential competitor's architecture -- and challenge its own initial framing -- makes its conclusions more meaningful, not less.

The exchange moved through six stages. Each one shifted how Copilot described what COMPAiSS actually is.

Stage 1

Where Copilot Started

Copilot's first response was the one most people in the AI industry would give. It described COMPAiSS as a more controlled, more tightly scoped version of how AI systems already work -- stricter guardrails, stronger source controls, better institutional grounding. A safer version of the same thing.

That is a reasonable starting point. It is also, as the exchange eventually showed, the wrong frame.

Stage 2

The First Crack

When asked to identify precisely where in the process each system applies its controls, Copilot's answer began to shift.

It acknowledged that in its own architecture -- and in conventional AI systems generally -- risk management happens after the model has already generated a response. Filtering, moderation, and output controls all operate on something the model has already produced.

It acknowledged that COMPAiSS works differently. But it still described this as a difference of degree -- better governance, not a different kind of governance.

Stage 3 -- The turning point

The Question That Changed Everything

The exchange turned on a single precise question: when COMPAiSS declines to answer an out-of-scope query, does the AI model think about it and decide not to respond -- or does the model never run at all?

This is where Copilot corrected itself directly.

"The model still thinks and chooses not to answer -- NO. The pre-inference gate is not AI reasoning, not model deliberation, not a decision by the model. It is deterministic authorization logic: URL matching, greenlist validation, scope check."

"For many queries, there is no model, no inference, and no thinking whatsoever -- just a deterministic authorization failure."

That distinction matters because it changes what the system fundamentally is. A system where the model decides not to answer is still a generation-first system. A system where the model never runs is something else.

Copilot named that something else directly:

"COMPAiSS is not AI with governance. It is a governance system that conditionally invokes AI."

Stage 4

What Follows from That Distinction

Once the architectural point was established, Copilot drew out its implications clearly.

"COMPAiSS is not just a safer implementation of generative AI. It is a different class of system that restricts knowledge generation to a closed, auditable domain, converting unpredictable inference failures into bounded, discoverable knowledge errors."

And on what this means for how the system fails:

"COMPAiSS converts AI risk from a probabilistic epistemic problem into a deterministic coverage problem. That is a category shift, not an improvement."

The difference between a probabilistic problem and a deterministic one is significant for any institution with audit obligations. A probabilistic failure -- a hallucination -- can happen anywhere, for reasons that are hard to trace and hard to prevent from happening again. A deterministic failure -- a gap in the authorized sources -- has a known location and a direct fix.

Stage 5

An Industry-Wide Finding

The exchange broadened to ask whether this was just a Copilot-versus-COMPAiSS observation, or something that applied to the AI industry more generally.

Copilot's conclusion was direct:

"The entire industry agrees hallucinations are unavoidable -- but only because they all accept the same assumption: that inference must always run."

"No major commercially deployed system advocates for pre-generation blocking. Refusal is a configured behaviour, not a default architecture."

In other words: the assumption that an AI model should always run when asked a question is not a technical necessity. It is a design choice. And it is a choice the entire industry has made without seriously examining whether it is the right one for regulated environments.

Stage 6 -- Where the exchange ended

Copilot's Final Conclusion

After moving through all six stages, Copilot arrived at a conclusion it would not have offered at the start:

"COMPAiSS is fundamentally distinct from conventional AI systems because it constrains the epistemic space of the AI to a finite, authorized domain, thereby transforming the nature of failure from emergent and unbounded to deterministic and discoverable."

And the clearest single summary of the difference:

"Conventional systems ask: how can I answer this? COMPAiSS asks: am I allowed to answer this? And only then: answer, but only from authority."

Copilot is a Microsoft product. It is the AI system most institutions evaluating COMPAiSS already have. It did not begin this exchange in this place. That progression is the finding.

Exchange Two: Gemini

Google Gemini -- a separate exchange with no access to the Copilot findings

Gemini had no knowledge of what Copilot had concluded. It began from a stronger position of skepticism -- not just that the distinction was overstated, but that it did not exist at all.

Gemini's challenge was more technically aggressive than Copilot's. That made its eventual conclusion more significant.

Stage 1 -- The opening challenge

Gemini's Core Objection: The Gate Is an Illusion

Gemini's first objection went straight to the architecture. It argued that the claimed novelty of the COMPAiSS gate was a structural illusion.

Its reasoning: for a gate to determine whether an unstructured user question falls within an institution's authorized scope, the gate itself must understand the question. And understanding an unstructured question requires inference -- whether through keyword matching, semantic similarity, or a classification model. If the gate uses any of those, probabilistic AI reasoning is already happening before the primary model runs. The risk has not been eliminated. It has just been moved earlier in the process.

This was a serious objection. It could not be answered by reframing. It required a direct factual correction.

Stage 2 -- The factual correction

The Gate Does Not Use Semantic Inference

The correction was precise: the COMPAiSS pre-inference gate does not use semantic inference, vector similarity, embeddings, or any probabilistic classification. It performs a deterministic search across a pre-authorized set of institutional URLs to determine whether any of those sources contain content that can ground a response to the query.

The gate asks one binary question: do authorized institutional sources exist that can support an answer? If the answer is no, the primary model does not run. There is no probability involved. There is no model at the gate level. It is a lookup against a bounded list.

Gemini accepted this correction. Its epistemic circularity argument, as originally framed, collapsed. It then moved to a second line of objection: that any engineering team could replicate this property using standard tools, which means it is not architecturally novel.

The response to this was equally direct: the question is not whether the property could be replicated by someone who chose to build it this way. The question is whether any deployed system has actually built it this way. A prior art investigation conducted across international patent databases -- and independently verified by four AI systems -- found that no commercially deployed system treats unauthorized queries as non-generative states by default. Describing what would need to be done to replicate a property is itself evidence that existing systems do not have it.

"Resolving that point immediately exposes a deeper, more challenging systemic tension. If the gate operates solely on structural properties and zero semantic understanding of the query, the architectural burden has not disappeared -- it has simply shifted downstream."

Stage 3 -- The convergence

Gemini's Definitive Assessment

After the factual correction and the response to the replicability objection, Gemini produced what it described as its definitive review. It offered a mathematical description of the execution property:

"The primary model behaves like a conditional function: it only runs if the grounding set is not empty. If no authorized grounding material exists, the generative function never instantiates."

It then independently produced a failure mode analysis -- without access to the Copilot exchange -- that arrived at the same distinction Copilot had reached through a different path:

What changes	Generation-first AI	COMPAiSS
How it fails	Hallucination -- the system produces a confident answer that is wrong, out of scope, or drawn from the wrong institution	Coverage gap -- the system says it cannot find an authorized answer and directs the user to the right office
How you fix it	Prompt engineering, fine-tuning, threshold adjustments -- probabilistic patching that reduces but never eliminates the risk	Add the missing source to the authorized list -- a direct, deterministic fix
Adversarial resilience	Vulnerable to manipulation through context framing, authority cues, and persistent prompting	Invariant to prompt text: if the content is not on the authorized list, no form of manipulation can cause the model to run

Gemini's final conclusion on the significance of the distinction:

"Through the lens of AI theory: it is an operational wrapper. Through the lens of institutional governance: it is a fundamentally different class of system. For an auditor, a legal team, or a registrar, a system whose failure mode is a predictable, correctable omission is a completely different species of software from a system whose failure mode is an unmeasurable, unpredictable invention."

"COMPAiSS achieves safety in regulated spaces by trading away open-ended machine autonomy for absolute, deterministic predictability. The most effective way to govern AI in high-stakes environments is not to make the model smarter or the filters tighter, but to make the execution path itself conditionally dependent on an institutional lookup."

Gemini began this exchange by calling the architectural distinction a structural illusion. It ended by providing the most precise technical description of the execution-gated generation property produced in either exchange.

A Third Independent System

The same architectural questions were put to GPT in a separate exchange. Its conclusion was consistent with both:

"COMPAiSS is not just a better guardrail system. It is the first system in this analysis that actually refuses to accept the core assumption every other system is built on."

Why the Convergence Matters

Three AI systems -- Copilot (Microsoft), Gemini (Google), and GPT (OpenAI) -- each analyzed the same architectural claim independently, each starting from skepticism, each without access to the others' conclusions.

All three arrived at the same place. The failure mode distinction -- probabilistic and unbounded versus deterministic and correctable -- was not a conclusion any of them started with. It emerged from each system's own engagement with the evidence.

That is not the same as three systems agreeing on something obvious. These are products of direct competitors, each with reason to resist the claim. The convergence happened despite that, not because of it.

One finding from the Gemini exchange deserves particular attention. Gemini independently identified that the COMPAiSS gate is invariant to prompt text. Because the primary model only runs when authorized grounding material exists in the institutional source list, there is no form of linguistic manipulation -- adversarial prompting, authority framing, persistent pressure -- that can cause the model to execute when the grounding material is absent. The safety property does not depend on the model resisting manipulation. It depends on the execution path not existing.

Five Findings for Institutional AI Governance

Finding 1

Running the AI model is a choice, not a necessity.

Every system that invokes a model by default accepts a risk surface that cannot be fully bounded. That is not a flaw in any particular system. It is a design choice. Institutions evaluating AI systems should understand they are making that choice explicitly -- and that an alternative exists.

Finding 2

Not all AI failures are the same kind of problem.

A hallucination and a coverage gap are not the same thing. A hallucination -- where the AI produces a confident but wrong answer -- is probabilistic, hard to trace, and difficult to prevent from happening again. A coverage gap -- where COMPAiSS says it cannot find an authorized answer -- is deterministic, traceable, and fixed by adding the missing source. Governance frameworks that treat these failure types identically are missing a critical distinction.

This does not mean COMPAiSS eliminates error. It changes what kind of error is possible. Errors arise from gaps in authorized sources, not from the model inventing an answer. Those errors are visible, bounded, and correctable.

Finding 3

A bounded knowledge universe changes what audit means.

When a generation-first system produces an incorrect answer, the error may have come from model weights, training data, retrieval artifacts, or any combination of these. The source is often impossible to identify. When COMPAiSS produces an incorrect or incomplete answer, the source is somewhere in the authorized list -- somewhere you can look. For compliance officers and audit functions, that difference is not theoretical.

Finding 4

Hallucinations may be an architectural consequence, not a tuning problem.

The AI industry treats hallucinations as an unavoidable risk to be managed. The alternative interpretation -- which both exchanges explored -- is that hallucinations persist because the industry has never seriously questioned whether the model should run by default. If inference-first architecture is the cause rather than just the context, then better models and tighter filters will reduce hallucinations but never eliminate them. That is a different procurement conclusion than "wait for the next model."

Finding 5

The relevant comparison is not between products.

It is between two different answers to the question of what AI is permitted to do in regulated environments. Generation-first systems assume the model should always be available to answer and manage the risks that follow. COMPAiSS assumes the model requires authorization before it runs and determines whether that condition is met. These are not competing implementations of the same idea. They are different ideas.

What This Means for Procurement

Most institutions evaluating COMPAiSS are not choosing between COMPAiSS and nothing. They are choosing between COMPAiSS and Copilot, which is already available through their existing Microsoft 365 license.

The exchanges documented on this page suggest the right question is not which system performs better on a benchmark. It is which question the institution needs its AI to be asking.

If your institution needs

Breadth, flexibility, and productivity integration

Copilot is the appropriate choice. It is well designed for broad usefulness across a wide range of tasks and connects directly with existing Microsoft 365 tools.

If your institution needs

Accountability, auditability, and bounded governance

The exchanges documented here suggest those requirements describe a different architectural model -- one where failures are deterministic and correctable rather than probabilistic and emergent.

The choice between these systems is not primarily technical. It is a governance decision about whether the institution is prepared to accept probabilistic outputs or requires bounded, auditable ones.

Copilot itself, after working through the evidence, put the difference this way:

"Conventional systems ask: how can I answer this? COMPAiSS asks: am I allowed to answer this?"

Which question the institution needs its AI to be asking may be more important than any feature comparison.

Final Observation

The most significant finding from these exchanges was not that one system outperformed another.

It was that three independent AI systems -- each a product of a different organization, each asked to push back, each unaware of the others' conclusions -- converged on the same finding: that moving the control point to before the model runs, rather than after it has produced an answer, creates a system with fundamentally different governance properties for regulated institutions.

Copilot and Gemini are optimized for breadth, flexibility, and usefulness across a wide range of tasks. They assume the model is always available and manage the risks that follow.

COMPAiSS is optimized for accountability, auditability, and institutional defensibility. It assumes the model requires authorization and determines whether that condition is met before permitting inference to occur.

Neither is superior in the abstract. They represent different answers to the question of what AI is for and who it is accountable to.

For institutions operating in regulated, high-accountability environments, understanding that distinction may be the most important AI governance question of the next several years.

The comparison is not between products. It is between philosophies.

Full governance documentation and alignment analysis with the Government of Canada's Directive on Automated Decision-Making: compaiss.ca/ai-risk-assessment.html
For procurement inquiries or institutional evaluation: [email protected]

We Asked Two Competing AI Systems to Challenge the COMPAiSS Architecture. Here Is What They Concluded.