A morning with SWE-agent
Like almost everybody, I’ve been using LLMs more and more recently. I’ve used ChatGPT to generate quick scripts, and I use GitHub Copilot autocompletions in my IDE; but I haven’t yet tried going as far as getting an LLM to autonomously make a complicated change in an existing codebase. So, in the spirit of my post on 20 minutes with Alpaca.cpp, here's a post about a morning with SWE-agent.
I recently discovered a regression in the library I use for syntax highlighting. I had poked at it a little bit myself before filing a bug, but couldn’t immediately figure out what the problem was and decided it wasn’t worth my time. It seemed like a good challenge for this experiment, though: the issue is well scoped, straightforward to understand, and I’d already written a test case for the agent to evaluate itself against. On the other hand, it’s also quite challenging to figure out what’s actually causing the bug.
SWE-bench is a benchmark test suite for exactly this kind of problem, and their website lists a whole bunch of tools and their success rate. I decided to try SWE-agent in particular because:
- “SWE-agent 1.0 (Claude 3.7 Sonnet)” is currently at the top of the SWE-bench Full leaderboard
- SWE-agent’s Hello World example was pointing at it at a GitHub issue; since I had already written a GitHub issue for my highlight.js problem, that seemed like an easy place to start.
Installing SWE-agent
I set up a new virtual machine to run everything on a clean slate, using Vagrant with a very straightforward configuration that installs Ubuntu, upgrades the system, and installs python3-venv and Docker:
Vagrant.configure("2") do |config|
config.vm.box = "cloud-image/ubuntu-24.04"
config.vm.define :node do |libvirt|
libvirt.vm.provider :libvirt do |domain|
domain.memory = "1024"
end
end
config.vm.provision "shell", inline: <<-SHELL
apt-get update
apt-get upgrade --yes
apt install --yes python3.12-venv
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
usermod -a -G docker vagrant
SHELL
end
Then I booted it up to let it do the install, rebooting afterwards for good measure:
vagrant up && vagrant reload && vagrant ssh
Next, I followed the SWE-agent install instructions and put it in a Python virtual environment:
git clone https://github.com/SWE-agent/SWE-agent.git
cd SWE-agent
python3 -m venv venv
venv/bin/python -m pip install --upgrade pip
venv/bin/pip install --editable .
Then I gave it an Anthropic API key and ran the Hello World example, which completed successfully.
Setting up the environment
After that, there followed a whole period of experimentation to get the agent’s working environment set up correctly. I started with the Hello World example, swapping out only the repository and my issue description:
venv/bin/sweagent run \
--agent.model.name=claude-3-5-sonnet-20241022 \
--agent.model.per_instance_cost_limit=2.00 \
--env.repo.github_url=https://github.com/highlightjs/highlight.js \
--problem_statement.github_url=https://github.com/highlightjs/highlight.js/issues/4234
Prompt improvements
The first obvious problem was that, after reading a bunch of JavaScript code, the agent went off and started writing Python code instead of running the test cases. I quickly realized this was caused by the default prompt, which is specified in config/anthropic_filemap.yaml
:
I've uploaded a python code repository in the directory {{working_dir}}. Consider the following PR description:
<pr_description> {{problem_statement}} </pr_description>
Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met? I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way! [...]
This was causing the AI to go down a bit of a garden path; it did eventually figure out what was going on and switch to writing JavaScript but that took extra steps and time. I swapped out “python” for “javascript” and modified all the references to “a script to reproduce the error” to advise it to apply the changes to the existing test files instead.
Docker image
The next issue was that it was defaulting to using the Docker python:3.11
image to run commands, which was missing npm
and related tools. (Claude was smart enough to start running apt-get install npm
on its own once it discovered that the tools it needed weren’t there, but I didn’t want to put it through the trouble of dealing with that every time I ran it.) Using the SWE-agent environment guide, I set up a Docker container based on Ubuntu 24.10 with node
and npm
installed, plus the swe-rex
runtime interface:
FROM ubuntu:oracular
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
WORKDIR /
RUN apt-get update
RUN apt-get install --yes nodejs npm pipx
RUN pipx install swe-rex
RUN pipx ensurepath
ENV PATH="$PATH:/root/.local/bin/"
SHELL ["/bin/bash", "-c"]
I built this with docker build -t agent-environment -f agent-environment.Dockerfile .
and added --env.deployment.image=agent-environment
to my sweagent run
command.
Some things take a while
Next I discovered that when the agent ran npm install
it was timing out after 30 seconds, leaving it unable to run any tests. It tried to soldier on anyway, making wild guesses as to what the problem was, but I stopped it and added --agent.tools.execution_timeout=120
as another flag.
This ought to be cached
Around this time it crashed halfway through a run, because GitHub rate-limited an API request for the PR description. (The stack trace and timing implied that sweagent
was re-downloading the description on every step, which seems very inefficient, but I didn’t dig into this.) I could have given it a GitHub API token, but I thought it made more sense to clone the repo and make a copy of the PR description locally so I could reuse them on each run, so I did that and updated the flags to --env.repo.path ~/highlight.js --problem_statement.path ~/issue4234.md
.
No less, only cat
At one point I suggested to the model that it should check git show d78749a
so that it could see the change which had caused the problem. Unfortunately, when it tried that the command timed out after 2 minutes, with no output. (The model concluded, reasonably, that git
was broken and decided to continue without trying to diagnose why.) It seems that whatever shell environment SWE-Rex sets up doesn’t work well with interactive commands like less
; I fixed that issue with --env.post_startup_commands 'export PAGER=cat'
.
Long test output
Another problem I noticed is that when it ran npm run test
, the output had the name of every single passing test. There’s no reason for the LLM to have to read all that every time; it’s costing me extra money and also probably using up context window space that could be used for something better. The mocha
test runner doesn’t seem to have a good way to quieten output so I gave the LLM a manual hint in the prompt:
(To reduce extraneous noise, run the tests with
npm run test | tail -n20
.)
Look at it go!
Once I’d made all those changes, SWE-agent could really start getting into the meat of the problem and I started getting promising-looking output. Here’s the command I ended up with:
~/SWE-agent/venv/bin/sweagent run \
--agent.model.name=claude-3-5-sonnet-20241022 \
--agent.model.per_instance_cost_limit=2.00 \
--env.deployment.image=agent-environment \
--agent.tools.execution_timeout=120 \
--env.repo.path ~/highlight.js \
--env.post_startup_commands 'export PAGER=cat' \
--problem_statement.path ~/issue4234.md
And the changes I made to the original prompt:
instance_template: |-
- <uploaded_files>
- {{working_dir}}
- </uploaded_files>
- I've uploaded a python code repository in the directory {{working_dir}}. Consider the following PR description:
+ I've uploaded a JavaScript code repository in the directory {{working_dir}}. Consider the following PR description:
<pr_description>
{{problem_statement}}
</pr_description>
Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met?
- I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
Your task is to make the minimal changes to non-tests files in the {{working_dir}} directory to ensure the <pr_description> is satisfied.
+ You should also make sure to update the test files and run the tests to make sure your changes didn't break anything else.
+ (To reduce extraneous noise, run the tests with `npm run test | tail -n20`.)
Follow these steps to resolve the issue:
1. As a first step, it might be a good idea to find and read code relevant to the <pr_description>
- 2. Create a script to reproduce the error and execute it with `python <filename.py>` using the bash tool, to confirm the error
+ 2. Update the test files and run them to reproduce the error using the bash tool, to confirm the error
3. Edit the sourcecode of the repo to resolve the issue
- 4. Rerun your reproduce script and confirm that the error is fixed!
+ 4. Rerun the tests and confirm that the error is fixed!
- 5. Think about edgecases and make sure your fix handles them as well
+
Your thinking should be thorough and so it's fine if it's very long.
@@ -39,13 +42,9 @@ agent:
- |
Thank you for your work on this issue. Please carefully follow the steps below to help review your changes.
- 1. If you made any changes to your code after running the reproduction script, please run the reproduction script again.
- If the reproduction script is failing, please revisit your changes and make sure they are correct.
- If you have already removed your reproduction script, please ignore this step.
- 2. Remove your reproduction script (if you haven't done so already).
- 3. If you have modified any TEST files, please revert them to the state they had before you started fixing the issue.
- You can do this with `git checkout -- /path/to/test/file.py`. Use below <diff> to find the files you need to revert.
- 4. Run the submit command again to confirm.
+ 1. If you made any changes to your code after running the tests, please run the tests again.
+ If the tests are failing, please revisit your changes and make sure they are correct.
+ 2. Run the submit command again to confirm.
Here is a list of all of your changes:
However, it ended up going around in circles and desperately trying to change things. I tried to give it some hints as to what it would need to look at:
Some hints:
- use
git show d78749a
to view the commit which caused the regression; make sure you understand and explain why that change was made before trying to fix things.- It will probably be helpful to investigate how
hljs.inherit
works.
It did take my advice, but that didn't help; after starting off with a reasonable thought process and investigation, it fell into a loop of desperately saying it was trying “one final approach” and then making random changes (or sometimes “changes” which didn’t change anything) which didn’t make any progress towards fixing the issue.
Claude 3.7
Since Claude 3.5 wasn’t getting anywhere, I decided to try 3.7, which is more expensive but Anthropic currently describes it as “Our most intelligent model.” This also required setting max_output_tokens
manually for reasons I didn’t bother to investigate:
venv/bin/sweagent run \
--agent.model.name=claude-3-7-sonnet-20250219 \
--agent.model.max_output_tokens=64000 \
...
This was more promising! It even managed to discover some mistakes that I had made in the original issue description: in my proposed test case, I had accidentally left out the colon after TODO:
and also copy-pasted the wrong class name. It separately noticed both of these problems and called them out in its thought process, although it assumed that the issue description was accurate and explained that it had to modify the code to support this as a special case.
Once I corrected the issue description, it continued to work better than 3.5, in that it didn't get stuck repeating “one final approach”. However, it did still devolve into repeatedly making changes and hoping the tests would pass this time. It did seem to be proceeding with somewhat more intention, but it still spent most of its time making a change, discovering it had broken some test, making another change to try to fix that test, and discovering something else had broken. It never took a step back and went, “hmm, I think I need to investigate deeper and figure out how this is supposed to work.”
In one case it did produce a patch which had all of the test cases passing, but it did that by identifying all of the failing test cases it had caused and modifying them to match the results it had produced. To be fair, this is a reasonable decision under other circumstances, but this is ugly syntax highlighting at best:
mycmd --disable-foo<span class="hljs-comment">
# a keyword as part of a parameter</span>
some-cmd set-some-setting
Conclusion
LLMs are amazing, and continue to get better! I spent a mere $5 in Anthropic credits on this experiment, and was impressed by what the computer did manage to troubleshoot its way through on its own. It would have been nice to see it go off and do a deep dive into the library’s internals to work out why this bug happened, but having tried to diagnose the issue myself I can’t exactly blame it for not succeeding at that. Even so, it seems like autonomous agents are already pretty capable; it got the build process up and running far faster than I did, and without consulting any documentation.
I don’t think I’m going to continue using SWE-agent, though. It’s entirely designed around one-shot tasks (which is what SWE-bench’s measurements are based on), but I think it would be a lot more useful if I could collaborate with the agent, letting it do all of the easy bits and giving it guidance where necessary. OpenHands, which is currently in third place on SWE-bench Full, appears to have an interface like that, so I’m planning to try it out next.
I also feel like this experience has solidifed a growing feeling I had about one of the problems about LLMs as they stand today: they act like a person, but they act like a different person every time. It turns out that an important part of having co-workers is that you get to know them over time, learn how they think and what their strengths and weaknesses are, and can generally predict how they’ll behave. An LLM is a distillation of the entire human experience, and it doesn’t have a stable identity from run to run, so the built-in human anthropomorphization instinct gets thrown off. This is probably a surmountable challenge, but the best we seem to have right now is writing detailed prompts, which as far as I can tell only goes so far towards improving it.