Most progress in AI looks like this: more data, more compute, more signal1. This is a small story about the opposite. Over the past few months, I built a system that pulls facts out of a knowledge graph to ground an LLM's answers. The architectural change that ended up mattering most was a deliberate subtraction. We cut a chunk of information out of one part of the model on purpose.
The problem
If you have used an LLM, you have seen it confidently make things up. Retrieval-augmented generation, or RAG, addresses this by looking up real information before answering. Knowledge graphs are a particularly good source of facts. They store the world as a web of entities and relationships: structured, explicit, and inspectable. "Christopher Nolan directed Inception" is a graph fact.
The hard case is multi-hop questions: the kind that need a chain of facts. "Who designed the concert hall on top of an old cocoa warehouse in Hamburg?" To answer that, a retriever has to find the Elbphilharmonie, then Herzog & de Meuron designed it, and hold on to both. The current state of the art in graph-based RAG, SubgraphRAG, scores each candidate fact in the graph independently. That works for one-hop questions. For chains, it can keep individually high-scoring facts while quietly losing the chain that connects them. Recall on multi-hop questions sags. This project picks at exactly that gap.
The pipeline, in five steps
Concretely, here is what happens when our system gets a question like that one:
-
Pick anchors. Out of the millions of facts in the graph, pull a couple of dozen that look semantically related to the question. Think of them as bookmarks: landmarks that orient the rest of the search. We pick them by cosine similarity between question and fact embeddings.
-
Expand the neighbourhood. Starting from the topic of the question (Elbphilharmonie) and from each anchor, walk outward across the graph for a hop or two and collect every fact you pass. You now have a candidate set of a few thousand triples.
-
Tag every entity with its position. For each entity in that neighbourhood, write down where it sits structurally: how far from the topic, how far from each anchor, whether it lies on a shortest path between them. Pure geometry. Nothing about meaning yet.
-
Refine, but blind the gate. A small graph neural network now updates each entity by listening to its neighbours. The catch: it is only allowed to read the structural tags when deciding whose messages to amplify. Content gets updated, but content cannot influence whose voice carries. That is the deliberate subtraction.
-
Score. A tiny scoring head rates each candidate fact for relevance, given the question and the refined entity embeddings. The top k go to the LLM.
Why blinding the gate matters
The classic problem with stacking message-passing layers in a graph is oversmoothing. Each round, every node averages a little of itself toward its neighbours. After a few rounds, every entity in a connected region looks the same.
The standard fix is to gate the messages: let each edge decide how much of its content gets through. The natural thing to gate on is similarity, "listen more to neighbours that already look like me." That is exactly where the trouble starts. Similar nodes amplify each other faster than dissimilar ones. The soup forms quicker, not slower. The previous SubgraphRAG paper saw this and concluded that GNNs hurt graph retrieval.
Our gate is forbidden from looking at content. It only reads the structural tags (distance from topic, distance from anchors, path indicators) when deciding whose voice carries. The feedback loop is severed.
Watch the loop below. Same graph, same starting colours, two gates. The semantic panel collapses into a uniform soup; the structural panel barely moves.
What this buys us
Across all questions on the WebQSP benchmark, our retriever lifts triple recall by 2.2 percentage points at k = 100, from 88.3 to 90.5 (averaged over three seeds). The bigger story is multi-hop. On chained questions, the kind that broke SubgraphRAG, we are roughly five points ahead at k = 100.
Two ablations are worth pulling out:
- Removing the structural gate while keeping the GNN halves the gain over SubgraphRAG at k = 100, from two points down to one. The gate is doing the work, not the GNN alone.
- Removing the GNN entirely is the largest single ablation, costing 5.4 points of recall. Put together: the structurally-gated GNN is the biggest contributor to the result.
What this changes
SubgraphRAG's authors concluded that GNNs hurt graph retrieval, and blamed semantic diffusion noise for it. That conclusion held for the GNN they tried, which was unconstrained. Our result reverses it: with a structural-only gate, the GNN becomes the single biggest contributor to recall. Turns out, GNNs are not broken for retrieval. They just need a smaller job.
Caveats and a closing thought
This is a project result, not a production system. We have evaluated retrieval, not the answer the LLM produces with the retrieved facts. Whether the recall gain translates to fewer hallucinated answers is a separate experiment we have not run.
Still, the architectural lesson generalises beyond this project. When a part of your model is causing trouble, sometimes the right move is not to give it more information. It is to take some away.
In a graph, where you are is sometimes more useful than what you mean.
Footnotes
-
Rich Sutton, The Bitter Lesson (2019). The methods that have actually worked in AI, over and over, are the ones that scale with compute and lean on general learning rather than hand-engineered structure. This post is a small exception, on a task where the structure being learned is the data. ↩