How We Built the Memory Scoring Engine
When a coding agent starts a new session, it has a limited context window. You can't dump every memory into the prompt — you need to select the ones that matter most for the current task. This is the job of the memory scoring engine: given a set of memories and a retrieval context, rank them by relevance and return the top N.
The naive approach is to sort by recency. Recent memories are more likely to be relevant. But recency alone misses a lot. A memory about a critical architectural decision from three months ago might be far more important than a trivial preference recorded yesterday.
Our scoring engine combines four signals:
Recency is computed as an exponential decay function. Memories lose relevance over time, but the decay curve is tunable per memory type. Architectural decisions decay slowly. Debugging notes decay quickly. The half-life for each type was calibrated against real-world usage patterns across several hundred developer sessions.
Frequency tracks how often a memory has been accessed or referenced. Memories that the agent retrieves repeatedly are likely foundational — they represent recurring patterns or decisions that affect many parts of the codebase. Frequency acts as a positive reinforcement signal: the more useful a memory proves to be, the higher it ranks.
Semantic proximity uses lightweight embedding similarity to match memories against the current retrieval query. We use a local embedding model (no cloud dependency) to compute vector similarity between the query context and each memory's content. This is the signal that handles topical relevance — when you're working on authentication, auth-related memories should rank higher.
Feedback is the explicit signal from the developer or agent. When a memory is marked as helpful or unhelpful, that feedback adjusts the score directly. Positive feedback provides a durable boost; negative feedback suppresses the memory in future retrievals without deleting it.
The final score is a weighted combination of these four signals. The weights are configurable, but the defaults were chosen through iterative testing: semantic proximity contributes the most (40%), followed by recency (25%), frequency (20%), and feedback (15%).
One design decision worth highlighting: we compute scores at retrieval time, not at write time. This means the ranking is always fresh — a memory's importance is a function of when and how you're asking for it, not a static property assigned when it was created. This dynamic scoring is what makes the system adaptive: the same memory can be highly relevant in one context and irrelevant in another.
The scoring engine runs entirely locally, typically completing a full ranking of several thousand memories in under 50 milliseconds. This is fast enough to be invisible to the developer and to the agent — which is exactly the point. Good infrastructure disappears.