Debugging a Rare Bug (1 in 800 Autograder Runs) with AI

A little bit of background

This is the last semester of my undergraduate life at UofM, and I took a rather challenging course: distributed systems.

Just like other courses with coding projects, the correctness of our code is determined by an autograder. The difference, though—for anyone who doesn't know about distributed systems—is that the programs we are writing are nondeterministic, since things happen in parallel on different machines (more specifically, coroutines are used to simulate different machines). It is then very likely that you have a wrong piece of code that passes the autograder many times but occasionally fails once or twice.

I believe the autograder code is also hard to write, since in the second project I successfully found a bug in the autograder that fails correct solutions at a rate of 1 in 20. This is not directly linked to the bugs I want to describe, but it sets a subtle mindset for me that if I fail at a very low rate, it might be that the autograder is wrong, not my code.

We have a total of 4 projects. The third project is a Paxos-based KV server, where we need to write the Paxos server and the KV server that utilizes the Paxos server's interface. For anyone who doesn't know what Paxos is, it is basically a mechanism that gives you a total order of events across the servers, robust against server/network failures as long as a majority is healthy. The last project adds horizontal scaling, allowing for sharding and also changing the ownership of shards, which also uses the Paxos server to provide reliability.

That is to say, we were reusing our code from project 3 in project 4. I didn't have time to extensively test project 3 like I did for the rest of the projects (thousands of times), but I did run it a few times and it passed, and it also passed the autograder.

Now you have all the context, and here's when project 4 starts.

We passed the autograder, but something was off

Project 4 has 2 parts. Part A is rather simple; it is the manager that controls which group of servers owns which shard. I passed it with one submission on Apr 11. Part B is where we need to implement those KV servers responsible for shard moving and all the client requests.

In the afternoon of Apr 15, after several struggling days of debugging and refactoring my Part B code, I had a version that was able to pass the local public test with a rate of 80%. I had another take-home exam, and it was about to be due. I knew I wouldn’t be debugging it the rest of the day, so I decided to make my first attempt at Part B and see how many points I could get. The autograder contains some private test cases, including some "very evil" (said Prof. Brian Nobel) serializability tests. So, the submission was more like a stage submission to test the progress I was making since we have one submission per day—and why not use it when my code has bugs? It's free. Then I saw the result: 100/100, and all public and private tests passed. You can imagine how shocked I was when I had the mindset of "This is not gonna pass; I just want to know how far I've gotten" but saw a result that I got the "right" answer. I also knew that I must have bugs somewhere because it was really a crappy version with some brutal refactoring that, honestly, I was just tinkering around to see what could make a change. I might have been lucky enough to pass the "weak" (not actually) local test, but how could I pass the evil private test?

Unlike Project 3, I do have some spare time this time, and the fact that I already got full marks reassures me that I can push hard on debugging Project 4 and play around with it.

To begin debugging it, I start with the information I have:

Project 3 passed the autograder and is probably correct since I ran it more than 10 times locally.
Project 4 is essentially Project 3 + some more code.

From this, I have two initial speculations: either the autograder is buggy again, or the newly added code in Project 4 is to blame.

One ID with two values

The ID structure in my Paxos implementation

I cannot avoid talking a little bit about Paxos for you to understand this part. Even if you're familiar with it, this paragraph contains some of my implementations, and they are kind of important to understand the whole story.

Paxos needs an equal operator to distinguish between decided values. For example, let's say you know that the 5th operation in the total order is already decided, and there's a client asking you to PUT something to a key. The natural thing to do is to try to let everyone agree that the next operation, aka the 6th operation, is the PUT you want. However, it could very likely be that the 6th operation is for some other server's proposal, and you can only know that by checking whether the decided value is equal or not to the value you propose. If it is equal, then you know you secured the slot; if it is not, you try the 7th operation, and it goes on.

The simplest way to implement this equality is to attach each value you want to propose with the proposing server ID and a unique ID within that server. Therefore, we can use the combination of (server ID, unique ID) to uniquely differentiate a value.

The reason for using this rather than comparing the value itself is that the value is usually passed by the upper layer as an interface, which the Paxos layer does not know what it means (it cannot even know what type it is). The task of the Paxos layer is simply to place it somewhere in the global slot. It could very well be that the client is retrying at more than one Paxos server, and the Paxos servers are all proposing the same value with their own server ID and unique ID, ending up deciding many of the same values in the log. But that is totally fine since it is the user of the Paxos layer who identifies this repetition, as they are the ones who can interpret the value data and are the ones trying multiple times. The only job Paxos does is to determine a global order for any given input.

With all that said, the thing I want to convey is that (server id, uniq id) should be able to uniquely determine a value in the Paxos global order, where uniq id should be unique within that server.

To be completed

A little bit of background

We passed the autograder, but something was off

One ID with two values

The ID structure in my Paxos implementation

Social