MAC: Multi-Agent Control — Symbolic MIMO Channel Framework
Multi-Agent Control (MAC) framework for auraison. Simulation code: control-plane/backend/mac/
1. Introduction
This document formalizes multi-agent context communication as a symbolic MIMO channel using the language of information theory. A multi-agent context communication system can be modeled as a multi-input multi-output (MIMO) communication channel in which agents exchange symbolic messages. The messages carry context, which functions as a structured representation of knowledge, beliefs, or environment state.
2. Multi-Agent Communication as a Channel
Consider a system of N agents:
Each agent maintains an internal context state
where is the space of symbolic representations (graphs, tokens, plans, embeddings, etc.).
When agents communicate, they transmit messages
that are functions of their internal context.
The messages pass through a communication channel and are received by other agents.
3. MIMO Symbolic Channel Formulation
Let
be the vector of transmitted messages.
The channel produces received messages
with conditional probability
This is formally a MIMO communication channel:
Each agent then updates its context:
Thus the full system evolution is
4. Symbolic Nature of the Channel
Unlike classical communication systems where symbols are bits, here the symbols belong to a structured alphabet
Examples:
| Symbol type | Meaning |
|---|---|
| Natural language tokens | reasoning traces |
| PDDL operators | plans |
| JSON tool calls | actions |
| embeddings | semantic summaries |
| scene graphs | environment state |
Thus messages are sequences
5. Information-Theoretic Quantities
Channel capacity
The maximum information exchange between agents:
This represents the maximum context transfer rate.
Mutual information between agents
For two agents and :
measures shared knowledge.
Communication increases this mutual information.
Context compression
Agents typically compress context before transmission:
Information theory interpretation:
where
- = message rate
- = entropy of the context.
Large contexts require summarization or embeddings.
6. Noise and Ambiguity
Language communication introduces channel noise:
Examples:
| Noise source | Effect |
|---|---|
| ambiguous language | semantic distortion |
| hallucination | channel corruption |
| lossy summarization | information loss |
| tool failures | message drop |
This makes the channel stochastic.
7. Context Synchronization Problem
Agents aim to minimize context divergence
where is KL divergence.
Communication protocols attempt to enforce
This is analogous to distributed consensus in multi-agent systems.
8. Relation to Agent Architectures
In modern AI systems:
| Component | Information-theory role |
|---|---|
| LLM reasoning | encoder |
| tool output | channel observation |
| prompt construction | message encoding |
| memory store | channel state |
| agent update | decoder |
Thus the system becomes:
9. Emergent Communication
If agents learn to communicate, the system optimizes
subject to bandwidth constraints.
This leads to emergent symbolic protocols, similar to:
- differentiable communication channels in multi-agent RL
- language emergence studies.
10. Multi-Agent Communication Graph
Communication often follows a graph
where
- = agents
- = communication channels.
Information flow becomes
which resembles network information theory.
11. Interpretation for Agentic AI Systems
In modern agent frameworks such as Claude Code agents, Ray distributed agents, OpenClaw multi-agent orchestration, and robotics VLA agents, each agent acts as:
Context → Encoder → Message → Channel → Decoder → Updated Context
Context = memory + environment state + reasoning traces.
12. Practical Implications
This formulation explains several practical phenomena.
Context window limits
A bandwidth constraint:
Summarization agents
Compression operators
Planning agents
Encoding structured plans instead of raw context.
Vector stores
Externalizing channel memory.
13. Summary of Channel Formulation
Multi-agent context communication can be modeled as:
where:
- agents are distributed information processors
- communication is a symbolic MIMO channel
- the objective is maximizing mutual context information under bandwidth and noise constraints.
14. Rate-Distortion Interpretation
The rate–distortion perspective provides a precise way to understand why modern agent architectures — LLM context windows, summarization, vector retrieval, and memory stores — appear to work well. In a multi-agent or single-agent reasoning system, the fundamental constraint is limited channel bandwidth, which forces compression of context before reasoning or communication.
14.1 Context Window as a Bandwidth Constraint
Let an agent possess a full internal state
which includes
- observations
- memory
- tool outputs
- reasoning traces
- environment state.
The entropy of this context is
However, the LLM can only receive a limited number of tokens.
If the context window allows tokens, then the transmitted representation must satisfy
Thus
is a compressed representation of the context.
This is exactly the rate constraint
in rate–distortion theory.
14.2 Distortion of Context
Compression inevitably loses information.
Define a distortion function
where
is the reconstructed context used by the model.
Examples of distortion:
| Distortion type | Example |
|---|---|
| semantic loss | missing facts |
| temporal loss | missing earlier events |
| reasoning loss | lost intermediate thoughts |
| structural loss | incomplete graph |
14.3 Rate-Distortion Function
The optimal trade-off is defined by
subject to
Interpretation:
- = tokens transmitted
- = context error
- = preserved information.
14.4 Why Summaries Work
A summarization agent computes
designed to minimize distortion for a given token budget.
The ideal summarizer approximates
subject to
Thus summarization is lossy compression optimized for reasoning relevance.
14.5 Vector Retrieval as Side Information
RAG systems modify the channel model.
Instead of sending full context, the system transmits a query
which retrieves external information
from a memory store.
The LLM receives
This resembles source coding with side information (Wyner–Ziv coding).
The rate–distortion function becomes
The external memory reduces required bandwidth.
14.6 Multi-Agent Memory Sharing
In a multi-agent system:
Agent transmits compressed context
to another agent.
The receiving agent reconstructs
and updates its state
The efficiency depends on
the preserved information between agents.
14.7 Why Planning Helps
Structured plans reduce entropy.
Raw context entropy:
Plan representation entropy:
with
Example:
Instead of transmitting
observations + reasoning + history
the agent transmits
PLAN:
1. find mug
2. grasp mug
3. place mug on table
This acts as semantic compression.
14.8 Agent Architectures as Communication Systems
Modern agent systems implicitly implement rate–distortion optimization.
| Component | Role |
|---|---|
| summarizer agent | lossy compression |
| vector DB | side information |
| planner | semantic compression |
| memory store | external entropy reservoir |
| context window | channel capacity |
14.9 Optimal Memory Architecture
From an information theory standpoint, the optimal architecture contains three layers.
- Short-term memory (prompt) — High fidelity, limited bandwidth.
- Semantic memory (vector store) — Medium fidelity retrieval.
- Episodic archive (lakehouse / logs) — High entropy, rarely accessed.
This hierarchy approximates successive refinement coding.
14.10 Interpretation for Robotics and VLA Agents
In physical AI systems (e.g., VLA models controlling robots):
The agent must compress
into a representation usable by the policy.
Vision encoders act as rate-limited encoders.
where is a low-dimensional latent.
The robot policy receives
This is again a rate–distortion optimized representation.
14.11 Implication for Agent Scaling
Scaling agent systems is largely about optimizing information flow.
Three main strategies appear:
- Increase channel capacity — larger context windows.
- Improve compression — better summarization.
- Add side information — RAG memory.
14.12 Key Insight
The central constraint of agentic AI systems is not compute but information bandwidth.
The architecture that wins is the one that best solves
which is precisely the rate–distortion problem.
15. Soft Symbols and Semantic Distortion
Once symbols are represented as continuous embeddings rather than discrete tokens, the classical Hamming distance becomes inappropriate. Hallucination in this setting corresponds to semantic drift — a sequence of symbols that deviates from the meaning of the correct answer — rather than bit flips. The problem becomes one of semantic distortion in a continuous representation space.
15.1 Soft Symbol Representation
Let a symbolic vocabulary be
Each symbol is mapped to an embedding
Example
"cat" → [0.12, -0.33, ..., 0.87]
"dog" → [0.10, -0.29, ..., 0.91]
A sequence
becomes
Thus the channel now transmits vectors instead of discrete tokens.
15.2 Continuous MIMO Channel
Each agent transmits a compressed embedding representation
The channel corrupts the vectors
where
This models
- reasoning noise
- summarization distortion
- LLM generation variability.
The receiver performs fusion
15.3 Semantic Distortion Metric
Instead of Hamming distance we measure distortion in embedding space.
A natural choice is cosine distance
Sequence distortion
15.4 Hallucination as Semantic Deviation
Let
be the ground-truth semantic sequence.
A predicted sequence
is considered hallucinated if
for some semantic tolerance threshold .
Thus hallucination becomes semantic deviation beyond tolerance.
15.5 Multi-Agent Noise Suppression
Suppose each agent sends
If noise is independent, the optimal estimator is
Noise variance becomes
Therefore
as the number of agents increases.
This provides a theoretical justification for multi-agent reasoning reducing hallucination.
15.6 Semantic Majority Voting
Instead of token voting, we perform embedding consensus.
Algorithm
for each position i:
gather embeddings from agents
compute centroid
choose symbol whose embedding is nearest to centroid
This is equivalent to minimum semantic distortion decoding.
15.7 Compression in Embedding Space
Agents may compress embeddings using projection
where
This models
- summarization
- reasoning traces
- compressed plans.
Decoding reconstructs
similar to MIMO linear receivers.
15.8 Python Extension Concept
Replace discrete tokens with embeddings.
Example sketch:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def semantic_distance(a, b):
return 1 - cosine_similarity(a.reshape(1,-1), b.reshape(1,-1))[0,0]
def mimo_fusion(embeddings):
return np.mean(embeddings, axis=0)
def detect_hallucination(true_seq, pred_seq, tau):
d = np.mean([
semantic_distance(t, p)
for t,p in zip(true_seq, pred_seq)
])
return d > tau
15.9 Experimental Demonstration
The project experiment becomes:
- generate semantic sequence
- embed tokens
- simulate noisy channel
- compare
- Single agent
- vs multi-agent fusion
- Measure
- semantic distortion
- hallucination probability.
Plot:
- Agents vs hallucination rate
- Agents vs distortion
- Compression vs distortion
15.10 Vector Gaussian MIMO Interpretation
With embeddings the channel becomes a vector Gaussian MIMO channel
Decoding attempts to estimate .
Hallucination corresponds to large reconstruction error in semantic space.
This aligns the LLM hallucination problem with
- rate–distortion theory
- distributed estimation
- semantic communications theory.
16. Turbo Codes and Iterative Belief Propagation
The connection to turbo codes and iterative belief propagation arises because both multi-agent systems and turbo decoders attempt to estimate latent variables from multiple noisy observations through iterative probabilistic refinement. Once symbols are represented as soft embeddings, the multi-agent communication system becomes mathematically close to turbo decoding and belief propagation.
16.1 Soft Symbols and Log-Likelihoods
In classical channel coding, a received symbol is not decoded as a hard bit but as a soft value, typically a log-likelihood ratio (LLR):
These soft values allow iterative algorithms to refine estimates.
In the embedding formulation, each symbol has a vector representation
A noisy observation is
Instead of an LLR over two symbols, we now have a distribution over the vocabulary:
This is directly analogous to soft decoding.
16.2 Multi-Agent Observations as Parallel Channels
Suppose multiple agents produce semantic observations
Each observation is
This resembles parallel channels in coding theory.
The optimal estimator combines likelihoods:
Taking logs
Thus every agent contributes soft evidence.
This is exactly how soft information is accumulated in turbo decoding.
16.3 Iterative Belief Propagation Interpretation
Consider a factor graph.
Nodes:
- latent symbols
- agent observations
- sequence constraints (grammar, reasoning structure)
Edges represent dependencies.
The joint distribution is
Belief propagation updates messages:
These messages iteratively refine the estimate of the symbol sequence.
This is structurally identical to turbo decoding loops.
16.4 Hallucination as Decoding Error
In coding theory:
In the semantic case:
where approximation is defined in embedding space.
Thus hallucination corresponds to
or equivalently semantic distortion exceeding a threshold.
16.5 Turbo Code Analogy
Turbo codes contain:
- two encoders
- interleaver
- iterative decoder exchanging soft information.
Mapping to multi-agent reasoning:
| Turbo component | Agent system |
|---|---|
| encoder 1 | reasoning agent |
| encoder 2 | retrieval agent |
| interleaver | context transformation |
| noisy channel | LLM generation noise |
| soft decoder | consensus reasoning |
| iterations | debate / refinement loop |
Each iteration improves the posterior
16.6 Multi-Agent Debate as Iterative Decoding
Consider agents exchanging beliefs
Iteration rule:
This resembles loopy belief propagation.
The process converges to a consensus distribution over symbols.
16.7 Semantic Channel Capacity
For embeddings the channel becomes a vector Gaussian channel
Capacity
Multiple agents increase effective SNR.
Noise variance becomes
Thus hallucination probability decreases exponentially with agent count.
16.8 Practical Algorithm
A turbo-like iterative fusion algorithm can be implemented.
Example sketch:
def iterative_semantic_fusion(agent_embeddings, vocab_embeddings, iterations=5):
belief = np.ones(len(vocab_embeddings)) / len(vocab_embeddings)
for _ in range(iterations):
for y in agent_embeddings:
likelihood = np.exp(-np.linalg.norm(
vocab_embeddings - y, axis=1
)**2)
belief = belief * likelihood
belief = belief / belief.sum()
return belief
This produces a posterior distribution over symbols.
16.9 Interpretation for LLM Systems
This viewpoint explains why the following techniques reduce hallucination:
- self-consistency decoding
- multi-agent debate
- reflection loops
- tool verification
All of them act as additional parity constraints on the latent semantic sequence.
The system is effectively performing error-correcting decoding of meaning.
16.10 Key Insight
Multi-agent reasoning architectures behave like semantic error-correcting codes.
The latent meaning sequence is encoded through multiple reasoning paths and noisy language generation processes. Iterative belief updates gradually remove inconsistencies, analogous to turbo decoding removing channel noise.
17. Chain-of-Thought as Parity Checks
The analogy between chain-of-thought (CoT) reasoning and parity constraints in error-correcting codes (particularly LDPC codes) becomes precise once reasoning steps are treated as latent variables that impose structural constraints on the final answer. Under this view, hallucination is analogous to decoding error in a noisy channel, and intermediate reasoning provides redundancy that enables error correction.
17.1 Latent Reasoning Graph
Let the correct semantic answer be a latent variable
A reasoning trace consists of intermediate steps
Each step imposes a constraint relating the answer and other steps.
For example
The full reasoning structure can be represented as a factor graph:
Nodes:
- latent answer
- reasoning steps
- observations (prompt, retrieved facts)
Edges encode dependencies.
17.2 Channel Noise in LLM Generation
LLM generation introduces noise
where noise may represent:
- token sampling randomness
- incomplete context
- reasoning mistakes.
Without reasoning steps the model directly predicts
from the prompt, equivalent to single-shot decoding.
This is fragile.
17.3 Chain-of-Thought as Redundant Encoding
When the model produces reasoning steps, the answer is not produced independently.
Instead
Thus the reasoning trace creates redundant constraints.
This resembles a linear code
where
- = codeword bits
- = parity check matrix.
In the reasoning case
Each reasoning step acts like a parity check equation.
17.4 LDPC-Style Factor Graph
An LDPC code is represented as a bipartite graph.
Variable nodes:
Check nodes enforce constraints.
In the reasoning analogy:
Variable nodes:
- answer token embeddings
- reasoning step embeddings.
Check nodes:
- logical constraints
- arithmetic relationships
- consistency with retrieved facts.
Graph structure:
Answer node
| \ \
r1 r2 r3
\ | /
constraints
17.5 Belief Propagation Decoding
In LDPC decoding, messages propagate between nodes.
Message from variable node
Message from check node
Iterative updates refine probability estimates.
In reasoning systems:
- reasoning steps update beliefs about the answer
- the answer updates beliefs about steps.
The process resembles
- self-reflection
- debate
- verification loops.
17.6 Self-Consistency as Monte Carlo Decoding
Self-consistency sampling generates multiple reasoning paths
Each produces a candidate answer.
The final answer is selected by majority vote or likelihood.
This approximates marginalizing the posterior
similar to ensemble decoding.
17.7 Hallucination as Parity Violation
If a reasoning step contradicts another step
the system detects inconsistency.
Examples:
Arithmetic reasoning
2+3 = 5
5+4 = 9
Answer = 8
Constraint violation reveals an error.
Thus reasoning steps act as error-detecting redundancy.
17.8 Information-Theoretic Interpretation
Suppose the final answer has entropy
Adding reasoning steps increases transmitted information
but because is redundant,
The redundancy improves decoding reliability.
This is exactly the mechanism used in error-correcting codes.
17.9 Implications for Multi-Agent Systems
If multiple agents generate reasoning traces
the system forms a large constraint graph
More constraints reduce the feasible answer space.
Thus hallucination probability decreases.
17.10 Visualization of the Analogy
Coding theory:
message → encoder → noisy channel → decoder
LLM reasoning:
prompt → reasoning trace → noisy generation → verification / consensus
Redundancy in reasoning plays the role of coding gain.
17.11 Key Insight
Chain-of-thought reasoning functions as a semantic error-correcting code. Intermediate steps introduce redundancy that constrains the answer space, allowing iterative inference mechanisms — similar to turbo decoding or belief propagation — to correct errors introduced by the stochastic generation process.
18. Critical Review and Testable Hypotheses
Added 2026-03-10. The preceding sections present an elegant narrative connecting multi-agent LLM systems to information theory. Several claims are metaphorical rather than mathematical. Below is a critical analysis and 24 testable hypotheses organized by topic.
18.1 Cross-Cutting Weaknesses
| # | Issue | Impact |
|---|---|---|
| W1 | MIMO misnomer — the formulation describes agents independently observing the same source and fusing results. This is SIMO (single-input, multiple-output) / diversity combining, not true MIMO. True MIMO requires cross-channel interference (off-diagonal H matrix entries). | Scaling predictions are wrong: diversity gain ~ log k, not spatial multiplexing gain ~ k. |
| W2 | i.i.d. noise assumption — all theoretical gains (σ²/k, exponential error decay with agent count) require independent agent errors. Agents sharing the same LLM weights, training data, and prompt produce correlated errors. | Actual effective variance is σ²(1+(k-1)ρ)/k. For ρ→1 (identical agents, identical prompts) there is zero noise reduction. |
| W3 | Additive Gaussian noise on embeddings is untested — the most critical modeling assumption. LLM errors are structured, multimodal, and context-dependent, not i.i.d. Gaussian perturbations. | The entire soft-symbol formulation (§15), noise suppression proof, and channel capacity formula depend on this. |
| W4 | Analogies without structural verification — turbo codes, LDPC, Wyner-Ziv are invoked by analogy but the necessary conditions (interleaver design, check node sparsity, side-information independence) are not verified. | Claims about coding gain, belief propagation convergence, and rate reduction are suggestive but unproven. |
18.2 Hypothesis Map
Task 1: MIMO Symbolic Communication (auraison-ncq.1)
-
H1.1 — Multi-agent context exchange on independent tasks behaves as SIMO (diversity gain ~ log k), not MIMO (capacity gain ~ k).
- Test: Measure mutual information I(C_i; C_j) before/after communication for k=2..8 agents on a shared reasoning task; fit to log(k) vs linear k.
-
H1.2 — Correlated agent errors (shared LLM backbone, same prompt) reduce the effective diversity gain below the independent-noise bound σ²/k.
- Test: Compare hallucination rates for k agents using the same vs different LLMs; measure noise correlation coefficient ρ and verify effective variance is σ²(1+(k-1)ρ)/k.
-
H1.3 — The communication graph topology G=(V,E) affects convergence rate of context synchronization D(C_i||C_j)→0.
- Test: Compare star, ring, and fully-connected topologies for N=5 agents; measure rounds to reach D < ε.
Task 2: Rate-Distortion Interpretation (auraison-ncq.2)
-
H2.1 — LLM summarization approximates the rate-distortion bound.
- Test: For a fixed corpus C, generate summaries at budgets B ∈ {50, 100, 200, 500, 1000} tokens. Measure downstream task accuracy (proxy for distortion). Plot R vs D and compare against the Shannon lower bound for a fitted source model.
-
H2.2 — RAG reduces the effective rate needed to achieve a given distortion level.
- Test: Compare task accuracy at fixed context budget B with and without RAG retrieval. Measure the rate savings ΔR = R_no_RAG(D) - R_RAG(D) and verify it equals I(C; K) as Wyner-Ziv predicts.
-
H2.3 — Plan representations achieve lower entropy than raw context for equivalent task performance.
- Test: Measure H(plan) vs H(raw context) using a compression proxy (gzip ratio), verify H(P) << H(C) while downstream task accuracy remains within tolerance D.
-
H2.4 — The three-layer memory hierarchy outperforms flat retrieval.
- Test: Compare task accuracy for (prompt-only) vs (prompt+vector) vs (prompt+vector+archive) at fixed total token budget; verify diminishing returns consistent with successive refinement.
Task 3: Python MIMO Simulator (auraison-ncq.3)
-
H3.1 — Multi-agent majority voting reduces hallucination rate proportional to 1/k under synthetic i.i.d. noise.
- Test: Sweep k ∈ {1,2,3,5,7,10} agents, noise rates p ∈ {0.05, 0.1, 0.2, 0.3}, measure false positive rate; fit to theoretical curve P_err ~ p^k.
-
H3.2 — The hallucination reduction saturates or reverses beyond a critical agent count k* when agent noise is correlated.
- Test: Introduce noise correlation ρ ∈ {0, 0.2, 0.5, 0.8} between agents; identify k* where adding agents no longer helps.
-
H3.3 — There exists an optimal compression budget B* that minimizes hallucination for a given number of agents.
- Test: Sweep B and k jointly; plot the 2D surface of hallucination rate vs (B, k) and identify the Pareto frontier.
-
H3.4 — Structured noise (clustered errors, systematic biases) defeats majority voting faster than i.i.d. noise.
- Test: Compare i.i.d. vs bursty vs systematic noise models at equal average error rate; measure the gap in hallucination rates.
Task 4: Soft Symbols and Semantic Distortion (auraison-ncq.4)
-
H4.1 ★ — LLM token-level errors are NOT well-modeled by additive Gaussian noise in embedding space.
- Test: Collect (correct_token, hallucinated_token) pairs from a real LLM on a QA benchmark. Compute the distribution of error vectors e_err = φ(hallucinated) - φ(correct). Test for Gaussianity (Shapiro-Wilk, Q-Q plot). Prediction: the distribution will be heavy-tailed and multimodal.
-
H4.2 — Cosine centroid fusion outperforms token-level majority voting for semantically similar error modes.
- Test: Generate k agent responses where errors are near-synonyms (e.g., 'big'/'large'/'huge'). Compare centroid-nearest-neighbor decoding vs majority vote. Prediction: centroid fusion wins when errors cluster semantically.
-
H4.3 — Centroid fusion FAILS when errors are adversarially distributed.
- Test: Construct scenarios where k-1 agents hallucinate a semantically coherent but wrong answer. Verify that centroid fusion amplifies the error rather than correcting it.
-
H4.4 — The effective noise reduction with real LLM agents follows σ²(1+(k-1)ρ)/k, not σ²/k.
- Test: Run k ∈ {1..5} instances of the same LLM on the same questions at temperature > 0. Measure pairwise error correlation ρ. Verify the actual distortion reduction matches the correlated-noise formula.
Task 5: Turbo Codes, Belief Propagation, CoT as LDPC (auraison-ncq.5)
-
H5.1 — Multi-agent debate converges to a consensus distribution, and the number of iterations to convergence scales with graph connectivity.
- Test: Run k ∈ {2,3,5} agents in iterative belief exchange on factual QA. Measure KL divergence between agent beliefs at each iteration. Track convergence rate and identify cases of oscillation or divergence.
-
H5.2 — Longer CoT traces reduce hallucination following a diminishing-returns curve analogous to coding gain.
- Test: For a fixed task, vary CoT budget T ∈ {0, 1, 2, 4, 8, 16} steps. Measure error rate. Fit to the coding gain curve P_err ~ exp(-αT). Prediction: there exists T* beyond which additional steps add noise rather than redundancy.
-
H5.3 ★ — Multi-agent debate with DIFFERENT model families (decorrelated errors) outperforms debate with identical models, analogous to the interleaver effect in turbo codes.
- Test: Compare hallucination rates for (a) 3× GPT-4, (b) GPT-4 + Claude + Gemini, (c) 3× Claude, on a shared benchmark. Prediction: mixed-model ensemble (b) achieves the lowest error rate.
-
H5.4 — Self-consistency sampling does NOT approximate the true posterior for out-of-distribution questions.
- Test: Compare self-consistency answer distribution vs ground truth distribution on questions where the LLM has known systematic biases. Prediction: self-consistency amplifies the bias rather than correcting it.
-
H5.5 — Explicit parity-check verification (tool use, calculator, code execution) provides stronger error correction than implicit reasoning redundancy.
- Test: Compare error rates on arithmetic/logic tasks for (a) CoT-only, (b) CoT + tool verification, (c) multi-agent debate without tools. Prediction: (b) dominates.
Task 6: Transformers as Soft-Symbol Encoders (auraison-ncq.6)
-
H6.1 — Attention head diversity provides a diversity gain analogous to multiple projection matrices.
- Test: Measure pairwise cosine similarity between attention head outputs. Ablate individual heads and measure hallucination rate increase — heads with more diverse projections should be more critical.
-
H6.2 ★ — Token embeddings that are closer in cosine distance are more frequently confused in hallucinations.
- Test: Collect hallucination pairs (correct, hallucinated) from a real LLM. Measure cosine distance between their embeddings. Compare against random token pairs. Prediction: hallucinated tokens are significantly closer to the correct token than random.
-
H6.3 — The softmax temperature acts as a noise parameter in the channel model.
- Test: Vary temperature T ∈ {0.1, 0.3, 0.5, 0.7, 1.0, 1.5} and measure hallucination rate on a factual QA benchmark. Fit to a channel error rate model P_err = f(T). Prediction: error rate increases monotonically with T, consistent with σ² ∝ T.
-
H6.4 — Fine-tuning reshapes the embedding space to increase minimum distance between confusable tokens, analogous to constellation optimization.
- Test: Compare the embedding-space geometry (minimum pairwise distance among top-k confusable tokens) before and after RLHF/DPO fine-tuning. Prediction: fine-tuning increases separation between frequently confused token clusters.
18.3 Priority Hypotheses
The three starred (★) hypotheses are the most critical to validate first:
- H4.1 — Noise characterization. This is the foundation: if LLM errors aren't Gaussian, the entire soft-symbol framework needs a different noise model.
- H5.3 — Model diversity as interleaver. This is the strongest practical prediction and the easiest to test with existing LLM APIs.
- H6.2 — Embedding proximity predicts hallucination. This would empirically ground the connection between transformers and the channel model.
18.4 What Would Make This Publishable
- Empirical validation of H4.1 (noise distribution characterization)
- Demonstration of H5.3 (mixed-model ensemble > same-model ensemble)
- Rate-distortion curve measurement (H2.1) showing LLM summarization approaches the bound
- A revised formulation that accounts for correlated noise and non-Gaussian error structure