Stop Re-Explaining Your Infrastructure to LLM Every Monday Morning

Apr 06, 2026

If you do incident response, platform work, or anything where problems recur — you know the feeling. An alert fires. You open a new chat. You paste the metric query, describe your cluster topology, explain which slot is logical vs physical, and slowly rebuild the AI’s understanding of your setup from scratch. Again.

I got tired of this. So I built a tool to fix it.

Session Manager in 30 Seconds

Session Manager is an open-source MCP server that records your full conversation history and lets you resume it later. Think of it as a notebook that writes itself — every investigation, every architecture discussion, every debugging session gets saved automatically.

You investigate an incident on Tuesday. On Friday, the same alert fires. You load that session and pick up where you left off. The AI already has your cluster topology, your naming conventions, what you tried, and what worked.

It works with any MCP-compatible client. Start a session in one tool, continue in another — your conversation history follows you.

Resuming Past Work

I got an alert on a Tuesday: a PostgreSQL logical replication slot went inactive. The subscriber worker was crash-looping every 2 seconds. After a few disable/enable cycles, the original slot recovered — but then a second slot showed up as inactive.

I tried dropping it. The drop returned success. The slot reappeared. I dropped it again. It came back again. Three rounds of this before I figured out what was happening — `orders_pub` was a **publication name**, not the slot name. Same prefix, two different PostgreSQL objects. And pgAdmin’s implicit transaction was silently rolling back the drop.

48 messages, about an hour. The kind of investigation where you build up context step by step — cluster topology, naming conventions, what each slot does, what you already tried, what failed and why.

A week later, the same alert fires. Without Session Manager, I start from scratch. With it:

switch_session

I start at message 49, not message 1. The AI already knows the cluster, the slots, what worked, and why we scoped the PromQL the way we did.

After the investigation, I used the same tool(Session Manager) to generate a playbook and dropped the resulting playbook into our alert annotations. Next time this fires, whoever is on call has the steps before they even open an AI chat. Investigate once, record it, generate a playbook, and attach it to the alert. The investigation pays forward.

Switching Between Projects

Resuming investigations is useful, but it’s not where Session Manager helps me most. The real value is juggling parallel work.

I had a security hardening project running — across 58 VMs. SSH into each host, check stuff, do stuff, update the tracker.

In this case, Session Manager and LLM are like side buddies keeping notes, one more pair of eyes.

In between, urgent issues kept coming in. An alert fires, a deployment breaks, something needs attention now. I’d drop the security work, switch to the incident, resolve it, and then... what was I doing? Which VMs had I already processed? What was the exact offboarding workflow?

switch_session

The AI loads the session. It knows I’d completed 19 of 58 hosts, which playbooks I’m using, which users are retained vs removed on each host, and what the next batch looks like. I pick up exactly where I left off.

That’s the workflow I use every day. Multiple sessions open for different projects. When I need to switch context, I switch sessions. No re-explaining, no “let me bring you up to speed.”

80+ Sessions and Counting

I’ve been using Session Manager for a few months now. Incident investigations, security hardening across a VM fleet, architecture discussions, deployment troubleshooting. The ability to switch context and resume work from days or weeks ago has changed how I work with AI tools.

The code is open source: sessionmngr

If you want to try the hosted version — start a session in one client, continue in another, your conversation history follows you — I’m looking for early testers. Leave your email, and I’ll send you an API key.

I’m particularly interested in hearing from people who do incident response or infrastructure work. What’s actually useful? What’s missing? That’s what I want to learn.

Discussion about this post

Ready for more?