AI Agent Memory Stack: Graphiti + FalkorDB + bge-m3 (Real Setup)

For a stretch of time our agent acted like it had amnesia. Every new session it greeted us as if it had never seen any of the old work, even though the code that wrote memories ran fine every single time, with no errors at all.

It was only when we opened the actual graph that we found everything sitting there: 43 episodes and 185 facts, all written successfully. The read side was pulling back 0.

We mentioned this bug only in passing last time, as the origin note for a post about the architecture of trustworthy memory. This is the full story behind that line, plus the real stack around it: the specific tools we run, how they fit together, and the spots where this stack fails quietly without telling anyone.

Part 1The drawer we read from was the wrong one

The cause of the "writes work, reads don't" symptom lived in a tiny seam that is very easy to miss.

The graph database we use has a setting that says where to store data. The catch is that this same setting doubles as the key for the graph itself. The write side fired memories into the graph tied to the agent's identity, while the read side pointed at an empty default graph. Both looked like they were "working normally" on their own, just in different drawers. So the memories were all there, and recall came back empty anyway.

The real fix was a single line: point the read side at the same place as the write side. The moment that landed, recall could see all 43 episodes again.

A bug like this is scarier than one that crashes, because it never makes a sound. Everything is green, there are no errors, and from the outside the system looks like it remembers. You only catch it when you open the graph and look with your own eyes.

Part 2The real stack underneath the word "memory"

Once you see that memory splits into a write side and a read side like this, it gets easier to picture what the stack is actually made of. Each layer has a clearly separate job.

Graphiti handles episodic memory. It takes work in one chunk at a time (an episode), then extracts facts and relationships and stores them in a graph over time.
FalkorDB is the graph database sitting underneath Graphiti. Things and people become nodes, and the relationships between them become edges.
bge-m3, running through Ollama, is the embedder. It turns text into vectors so you can search by meaning, not just match exact words.
A small extraction LLM reads the raw episode and decides what counts as a fact and what counts as an entity worth keeping.
A thin CLI of our own keeps record and recall running as one loop.

So the phrase "give the AI a memory," which sounds like a single button, is really five layers wired together. And every seam between layers is a place where things can break silently.

Part 3Why you need both a graph and an embedder

A common question is why this does not just stop at a single vector database, since that is the popular default.

A vector store is great at one thing: finding items whose meaning is close. Ask "have we discussed something like this before?" and it answers well. But ask "who is involved in this, and how does it connect to which case and when?" and that question needs relationships and a time order, which is the graph's job.

So we use the two together. The embedder helps surface the chunks that are relevant by meaning, and the graph holds how those chunks connect and which came before which. Drop either one and memory can only answer half the question.

Part 4Why we record sparingly instead of recording everything

Every time we record one episode, the extraction LLM has to read it and pull facts out. That work burns quota. When we ran on a free tier, the extraction quota was around 20 calls a day, which works out to roughly 2 memories a day.

That limit forced a better design. Instead of recording every session automatically, we record only at the end of meaningful work, through a ritual that already writes a summary. We then trim each chunk to around 1,000 characters and keep only the durable facts, not long narration.

It turns out the money constraint made the memory better, because it forces a choice about what is worth remembering rather than dumping everything in to rot.

One more thing we settled early: all the memory lives under a single shared group key. Work from development sessions and work from chat use the same graph, so recall can cross between them and remember something that happened in the other place.

Part 5When the system "went down" but really hadn't

On another day, recall threw a timeout. The agent answered in a hollow way, like it could not bring anything to mind. The first instinct was "the tunnel to the database must be dead," because its status showed 255.

But before touching anything, we verified first. A PING straight at the port came back with a PONG, and the database itself reported it had been up for 6 days. So the tunnel was not dead. That 255 was a stale value left over from an earlier network blip. A status code reports the last result, not the current state.

The real issue was that the timeout was set to 5 seconds, below what the cold path needs. When the system has just woken up, the first embedder pass alone takes 3.66 seconds, and once you add the graph and a cross-region connection, the total reaches around 10 seconds. A recall that ran all the way through clocked 9.7 seconds, so it kept hitting the 5-second ceiling.

The fix was to raise the ceiling to 12 to 15 seconds. The key point is that a timeout is a ceiling, not a fixed wait. Once the system is warm, recall still comes back in 1 to 2 seconds as before. Raising the ceiling is all upside, since only the cold path benefits.

Part 6If you are wiring up your own memory stack

Both stories above came from different bugs, but they teach the same thing: an agent's memory can fail quietly at any time, even while the logs stay green. So before you trust that it remembers, try these two checks.

First, prove the round-trip with your own eyes. Write one memory in, then ask for it back and confirm you get the same chunk. Do not trust "the write passed, no errors," because the write side and the read side may be pointing at different drawers.

Second, time the real cold path, meaning the first call after the system has just woken up, not a warm one. Then set the timeout from that number, not from a guess that "search should be fast."

Memory you can trust is not measured when everything is fine. It is measured when it quietly stops working, and whether you catch it in time.

Sources and references

Every number in this post (43 episodes / 185 facts / 3.66s cold embed / 9.7s full recall / 5s timeout raised to 12 to 15s) comes from the runs and logs of the system where we wired this stack up and use it for real on our own fleet (Jun 2026). Measured, not borrowed.

We Wrote Memories Into the Right Drawer
and Read From the Wrong One