Claude Code plugins: how we evaluate before installing

Last week our timeline had one name on it: Ponytail, the plugin that makes your agent write code as short as a lazy senior dev. People shared the pretty numbers everywhere. Trending right alongside it: Headroom, a context compressor that gained tens of thousands of stars in a few weeks. We almost hit /plugin install on both. But we stopped and asked one question: is "it is popular" the same thing as "it fits us"?

By the end of the week we had installed neither on our main setup. Not because they are bad. Both are genuinely good. They just did not pass the test we run on every tool. This post is that test, plus the tools we already had that made adding more unnecessary.

First, the basicsWhat Claude Code plugins are

A Claude Code plugin is a shareable extension package: it bundles several things that add capability to Claude Code into one unit. A single plugin can carry skills (capabilities the model invokes on its own), subagents (helpers it delegates subtasks to), hooks (event handlers that run automatically), MCP servers (bridges out to external tools like GitHub or Slack), plus slash commands and default settings, all in one install.

Installing has three beats: add a marketplace, which is a catalog of plugins, with /plugin marketplace add; browse what's available with /plugin; then install one at a time with /plugin install. Plugins from different marketplaces are namespaced so their commands don't collide.

The common mix-up is treating plugin, skill, and MCP as the same thing. A skill is one unit of capability, MCP is the way to reach an external tool, and a plugin is the box that wraps those up so they can be shared and versioned. Once you see it's a box you can install in seconds, the question that matters isn't "how do I install it" but "which ones should I install", which is the rest of this article.

Part 1Popular is not a buy signal

Whatever is trending was built for the median user, not for your workflow or your values. The more hype, the higher the pressure to install fast, and that is exactly the moment to slow down, not speed up. Hype is a signal to evaluate, not a signal to install. The difference is small but expensive, because every tool you add is not free. It takes room in your head, eats your context budget, and costs you debugging time when something breaks.

Part 2The test: blast radius times measurable pain

We do not judge a tool by how cool the feature is. We judge it on two axes: one, blast radius, how far it spreads if it breaks; two, measurable pain, is there a problem this tool actually solves right now that we can put a number on. Drop a tool onto those axes and the answer usually shows up on its own:

Zero radius (free, does not touch the existing system): try it, nothing to lose.
Narrow radius (a library wired into one spot you own the code for): make it an experiment, turn it on only when there is real pain.
Whole-system radius (sits in the path everything flows through): send it back, the risk is not worth it.

There is a third axis: does it fight how you like to work. We build ahead. A tool that keeps pushing us to cut everything to the bone every single turn will keep braking that habit, no matter how short the code it writes.

Part 3Check the killer condition before you spend effort measuring

The Headroom test taught us the most. Headroom compresses tokens before they reach the model, to save on the API bill. We meant to measure real before/after on a job with a big context blob. But before measuring, we asked the one question that ends the whole plan:

The job we would put Headroom on, does it pay for the model per token, or as a flat subscription?

The answer was subscription, every job. The agent we run calls the model through a flat-quota path, not a per-token bill. Which means there is no per-token bill to compress in the first place. So we cut the entire measurement plan on one question that took under a minute to check.

The lesson: find the killer condition first and check it first. Do not spend effort measuring something that might get thrown out before you even start.

Part 4Two real examples, and what we already had

Ponytail fell on the redundancy-and-values axis. The tools for trimming code bloat, we already had in the kit, and we want you to see them in case you have them too:

/simplify in Claude Code: reviews a diff and does the cutting for you, not just pointing at it.
/code-review: covers both bloat and bugs in one command, a superset of /simplify.
caveman: compresses the agent's words, a different axis from compressing code, stacks without overlap.

What Ponytail adds, a list of things for you to go cut yourself, overlaps with the /simplify we already have, and ours does the cutting for us on top of that. Add that it fights our build-ahead habit, and that it injects rules every turn, eating context budget the whole time. As for the "80-94% less code" number, that is Ponytail's own benchmark, not ours, and even if it holds, it does not answer a pain we do not have yet. Verdict: skip. But skip is not never. The day Ponytail does the cutting itself instead of handing us a list to go cut, and drops the every-turn rule injection, it earns a fresh look. Until then, what we already have covers the axis.

Headroom fell on the killer condition above, as a whole-system install. But it is not a bad tool. The day we have a job that genuinely pays per token, we will come back and try it as a scoped library, measuring before/after in that one spot, not putting a proxy in the path of the whole system. Verdict: not now, but the wake-up condition is written down.

Both verdicts came from the same criteria, not from the hype.

Where this lands

A test you can use right away when a new tool trends:

Ask first: "popular" or "fits us."
Put it on two axes: blast radius times measurable pain.
Find the killer condition and check it before spending effort.
Check whether the kit you already have covers that axis before adding a new one.
Write down what would flip the no: a condition or a date. A no with no trip-wire quietly hardens into a veto while the tools keep getting better.

Passes all five, then install. Does not pass, you still have what already works. The hype will keep coming. It is your criteria that should stay put.

Where to start: next time your finger moves to install, stop and ask question 1. That one is enough.

References

Ponytail repo (80-94% is the maintainer's benchmark) github.com/DietrichGebert/ponytail
Headroom repo (60-95%, maintainer's) github.com/chopratejas/headroom
our own session decision (2026-06-17)

In the same series

“Make it pretty” is the wrong prompt: a design skill that refuses to code until strategy questions are answered

A plugin passed the gate, now who gets it? Curate a per-agent skill suite instead of installing everything for everyone

Evaluating before you install is step one. Once installed, the next step is auditing which MCP servers your agent actually calls and pruning the rest. Read how to audit after install

Another way to shape Claude Code to your hand is a hook, the seam where you teach your agent new habits. Try a voice notification for which task finished and which tab, with Claude Code hooks.

Evaluating before you let something in is the same discipline as reviewing code before merge: why a second model should review the code AI wrote

This is one layer of the full production AI agent architecture (7 layers).

Our timeline was full of new plugins. We evaluated first and installed none.