Every so often, we would get this error from a specific test:
The error was intermittent, so we fell into the seductive but dangerous trap of simply rerunning our tests whenever it failed. Recently, I had a bit of time and decided to dig into it and fix it once and for all.
My first task was to see if I could reproduce the error locally. We often saw the error when running the tests on our Jenkins continuous integration server, so there was the possibility that the problem was environmental. However, we also knew that the failure was intermittent, so we couldn’t be sure it was environmental even if the test passed locally a few times.
I rigged up a small shell script to simply run the test over and over again while I wandered away from my computer. The script looked something like:
1 2 3 4 5 6 7 8 9 10 11 12
I’m sure there’s a proper statistical manner to determine how many times the test would have to be run without failing to be reasonably certain that the test won’t fail, but I didn’t have to worry about that - the test failed somewhere within the first ten or so runs.
Now that I knew the test could fail on my local setup, it was time to dig into what the test was doing. The test was fairly concise and readable (which I highly appreciated) and looked something like:
1 2 3 4 5
At this point, I cloned the capybara repository and started poking
git grep quickly found where
defined. Following the thread of code led to
has_no_selector?, which calls the
all method. This method
had a pretty clear split between the “finding” part of the code and
the “filtering” part. There was no magic I used here to see this, just
previous experience debugging race conditions.
I opened up the installed gem and inserted a sleep directly into the code between the “finding” and “filtering” sections. It’s ugly doing this, but it’s good to try to not change too many things at once when debugging. I played with the sleep value a bit and eventually found a value that reliably reproduced the failure. Success!
Well, maybe not complete success, but at least a step in the right direction. Even though I could reproduce the problem, I had only reproduced it in our production application, and I had modified my installed gem directly. It was time to make a nice test case.
Since I knew that I would want to make changes to Capybara, I used the
:path parameter in the Gemfile to point to my local checkout of
Capybara. This is an awesome feature of Bundler that you might not
know about. It also means I’m not messing with my generally-available
copy of Capybara, which is good for my sanity.
I then created a stripped-down version of the test, the same as the example above. After getting everything hooked up, I ran the test but it didn’t fail. This was bad news - I had done a few big steps between the production app and the smaller test case - which one of them could have changed the behavior?
Well, almost. Even though I had a test case, I still needed to show that code to someone who could do something about it. Checking back at the Capybara website, I looked for how to submit a ticket. Right at the top is a nice, clear comment:
Need help? Ask on the mailing list (please do not open an issue on GitHub)
So, I pushed my changes to Capybara to my fork and updated my test app to use a remote git version of the gem (another cool Bundler feature). I then pushed my test case to GitHub as well. I took a bit of time to create a short README so that anyone stumbling on the test app would have a clue as to what it was.
After that, it was just a small matter to write up a clear email to the Capybara list. I still find emailing new lists scary. Who knows how the list will respond? This time I got a nice surprise:
That’s a very nice bug report, Jake.
It appears to be a bug indeed. I’ve been able to reproduce it on master as well.
A GitHub issue has been opened, and the bug is well on the way to being fixed. Yay for Open Source!