Debugging a Rare Bug (1 in 800 Autograder Runs)

A little bit of background

This is the last semester of my undergraduate life at UofM, and I took a rather challenging course: distributed systems.

Just like other courses with coding projects, the correctness of our code is determined by an autograder. The difference, though—for anyone who doesn't know about distributed systems—is that the programs we are writing are nondeterministic, since things happen in parallel on different machines (more specifically, coroutines are used to simulate different machines). It is then very likely that you have a wrong piece of code that passes the autograder many times but occasionally fails once or twice. This is annoying because in terms of the correctness of the code, even in school's project, I'm a recovering perfectionist... but my relapse rate remains 100%.

I believe the autograder code is also hard to write, since in the second project I successfully found a bug in the autograder that fails correct solutions at a rate of 1 in 20 and was awarded a bagel in the next lecture by the professor (Yes, that satisfied my vanity). This is not directly related to the bugs I want to describe, but it instilled in me a subtle mindset that if I fail at a very low rate, it might be that the autograder is wrong, not my code.

We have a total of 4 projects. The third project is a Paxos-based KV server, where we need to write the Paxos server and the KV server that utilizes the Paxos server's interface. For anyone who doesn't know what Paxos is, it is basically an amazing mechanism that gives you a total order of events across the servers, robust against server/network failures as long as a majority is healthy. The last project adds horizontal scaling, allowing for sharding and also changing the ownership of shards, which also uses the Paxos server to provide reliability.

That is to say, we were reusing our code from project 3 in project 4 as the infrastructure for all the high-level things. I didn't have time to extensively test project 3 like I did for the rest of the projects (thousands of times), but I did run it a few times and it passed, and it also passed the autograder.

Now you have all the context, and here's when project 4 starts.

We passed the autograder, but something was off

Project 4 has 2 parts. Part A is rather simple; it is the manager that controls which group of servers owns which shard. I passed it with one submission on Apr 11. Part B is where we need to implement those KV servers responsible for shard moving and all the client requests.

In the afternoon of Apr 15, after several struggling days of debugging and refactoring my Part B code, I had a version that was able to pass the local public test with a rate of 80%. I had another take-home exam for the notorious course EECS 311, and it was about to be due. I knew I wouldn't be debugging my project 4 code the rest of the day, so I decided to make my first attempt at Part B and see how many points I could get. The autograder contains some private test cases, including some "very evil" (said Prof. Brian Nobel) serializability tests. So, the submission was more like a stage submission to test the progress I was making since we have one submission per day—and why not use it when my code has bugs? It's free. Then I saw the result: 100/100, and all public and private tests passed. You can imagine how shocked I was when I had the mindset of "This is not gonna pass; I just want to know how far I've gotten" but saw a result that I got the "right" answer. I also knew that I must have bugs somewhere because it was really a crappy version with some brutal refactoring that, honestly, I was just tinkering around to see what could make a change. Especially, it had a 20% chance not passing the local test! I might have been lucky enough to pass the "weak" (not actually) local test, but how could I pass the evil private test?

Unlike Project 3, I do have a lot of spare time this time, and the fact that I already got full marks reassures me that I can push hard on debugging project 4 and play around with it.

To begin debugging it, I start with the information I have:

Project 3 passed the autograder and is probably correct since I ran it more than 10 times locally.
Project 4 is essentially Project 3 + some more code.

From this, I have two initial speculations: either the autograder is buggy again, or the newly added code in project 4 is to blame.

One ID, two values

The ID structure in my Paxos implementation

I cannot avoid talking a little bit about Paxos for you to understand this part. Even if you're familiar with it, this paragraph contains some of my implementations, and they are kind of important to understand the whole story.

Paxos needs an equal operator to distinguish between decided values. For example, let's say you know that the 5th operation in the total order is already decided, and there's a client asking you to PUT something to a key. The natural thing to do is to try to let everyone agree that the next operation, aka the 6th operation, is the PUT you want. However, it could very likely be that the 6th operation is for some other server's proposal, and you can only know that by checking whether the decided value is equal or not to the value you propose. If it is equal, then you know you secured the slot; if it is not, you try the 7th operation, and it goes on.

The simplest way to implement this equality is to attach each value you want to propose with the proposing server ID and a unique ID within that server. Therefore, we can use the combination of (server ID, unique ID) to uniquely differentiate a value.

The reason for using this rather than comparing the value itself is that: 1. the value is usually passed by the upper layer as an interface, which the Paxos layer does not know what it means (it cannot even know what type it is), and 2. even if the Paxos layer can check the content, there can be identical contents that are not duplicates (for example, APPEND the same content to the same key twice). The task of the Paxos layer is simply to place it somewhere in the global slot. It could very well be that the client is retrying at more than one Paxos server, and the Paxos servers are all proposing the same value with their own server ID and unique ID, ending up deciding many of the same values in the log. But that is totally fine since it is the user of the Paxos layer (in this case the shard kv service) who identifies and deals this duplication, as they are the ones who can interpret the value data and are the ones trying multiple times. The only job Paxos does is to determine a global order for any given input.

With all that said, the thing I want to convey is that (server id, uniq id) should be able to uniquely determine a value in the Paxos global order, where uniq id should be unique within that server.

The failing message

The symptoms seemed random. For example, here are two failing patterns:

Sometimes I have wrong answers for value queries:

--- FAIL: TestConcurrentUnreliable (3.50s)
    shardkv_test.go:349: Get(6) expected 3715455544997849684 got 
    shardkv_test.go:349: Get(8) expected 45254913212864862875727999996981590320 got 5727999996981590320
    shardkv_test.go:365: something is wrong
FAIL
exit status 1
FAIL umich.edu/eecs491/proj4/shardkv 3.524s

--- FAIL: Test failed on run 101/10000000 ===

Sometimes I have stuck servers:

--- FAIL: TestConcurrentUnreliable (31.35s)
    shardkv_test.go:368: Parallel client 0 did not complete
FAIL
exit status 1
FAIL umich.edu/eecs491/proj4/shardkv 31.379s

--- FAIL: Test failed on run 32/10000000 ===

Still, there are some findings I can tell just by checking the failing message: For the wrong answer case, it seems that the value of certain keys are lost at some point, because all cases showed that a later query either could not find the key or could only get a trailing substring.

This could be the result of two things:

The query is responded with a server that is inconsistent with its group.
This was not the root cause. It could be the result of many root causes: bad implementation of one or more operation handlers, bad low-level Paxos layer (unlikely, since they passed project 3 private tests), bad high-level shardkv server layer, etc.
The group is consistent, but during a shard ownership transition, the values of certain keys are lost.

For the stuck server issue, I didn't have a good idea just by looking at the failing message, and the next step was to take a look at the logs.

Dig into the logs

By the way, change the subject completely. It is more useful to use loggings to debug a concurrent program than stuffs like gdb. For those who love gdb, sorry! I was using the good old print to debug things because of the nature of concurrent programs that if you break at some point, the rest of the threads just keep going, which may make the error path impossible.

Now back to the original subject, I was logging exhaustively, which was very necessary to rebuild the scene.

--- FAIL: TestConcurrentUnreliable (31.35s)
    shardkv_test.go:368: Parallel client 0 did not complete
FAIL
exit status 1
FAIL umich.edu/eecs491/proj4/shardkv 31.379s

--- FAIL: Test failed on run 32/10000000 ===

For this error message, I wanted to trace back on what client 0 was doing. While tracing it, I noticed some funny things were happening about a proposal (uniq_id=1, server_id=0).

Each operation(op) was in this format: {op_type {payload} uniq_id server_id}. You don't really need to pay attention the content of the payload, but do notice that they were all about (uniq_id=1, server_id=0)

The logs are:

line 487: 
Server 0 to apply decided op {1 {1 2 true map[] map[]} 1 0}

and also line 3543:
[Peer 2] Status called for seq=2, returning fate=1, value={1 {1 2 true map[] map[]} 1 0}
RSM 2 applying operation at seq 2: {1 {1 2 true map[] map[]} 1 0}
Server 2 to apply decided op {1 {1 2 true map[] map[]} 1 0}

And also later:

line 3688:
Server 0 adding op: {1 {2 1 true map[] map[]} 1 0}
RSM 0 starting AddOp for value: {1 {2 1 true map[] map[]} 1 0}
RSM 0 inserted new operation at seq 1: {1 {2 1 true map[] map[]} 1 0}
RSM 0 starting instance 1 for value: {1 {2 1 true map[] map[]} 1 0}
[Proposer 0] Starting proposal for seq=1 with value={1 {2 1 true map[] map[]} 1 0}

and also line 3803:
Server 0 to apply decided op {1 {2 1 true map[] map[]} 1 0}

What just happened was that at the very beginning, Server 0 proposes some thing (1 {1 2 true map[] map[]}) at uniq_id=1, and that proposal was accepted by the majority according to that Server 2 applied that decided op.

However, at line 3688, Server 0 somehow wants to propose another payload (1 {2 1 true map[] map[]}), at the exact same uniq_id=1!

This shouldn't happen since uniq_ids were designed to be unique within a server and it was designed to increment from 0 in my case.

What's worse, this proposal was then accepted and gets applied (Server 0 to apply decided op {1 {2 1 true map[] map[]} 1 0})! Totally illegal!!

Unless... The uniq_id was stored in memory, so if a server gets killed and restarted, it will propose uniq_id starting from 0.

However, this couldn't be correct for two reasons:

The server does not die according to the spec. This is not tested. I did doubt about this but was convinced after actually reading the autograder code.
Even the server could be killed, the behavior/symptom should be different. The Paxos group has agreed on something before, the op {1 {1 2 true map[] map[]} 1 0}. The proposing server should be told that yes (1, 0) was accepted and the value is {1 {1 2 true map[] map[]} 1 0}. This was buggy since the server would apply a different operation, but thought the new one was done. However, this should not result in the server applying the new operation.

All that said, I wasn't 100% sure. So, I spent a couple of days trying to avoid any potential uniq_id duplication by changing the generation rules. I tried randomly choosing from int32, which didn't help. I tried int64, which didn't help. I tried uuid, which didn't help. I tried concatenation two uuids together, which didn't help.

Two uuids are essentially 256 bits of information! At this stage, theoretically, we could generate a TON of uniq_ids (as large as \(1.5\times 10^{37}\)), but still had a less than 0.1% chance of having a uniq_id collision. Even if the clients were DDoSing my poor server at a rate of 9999999999999999 requests per second, it would take approximately 47 trillion years to reach that number—a timespan orders of magnitude longer than the current age of the universe. No clients would be that evil for that long.

This are all the math I did after I figured out the bugs. I was so convinced back then that having the uniq_ids being unique will solve the problem, until I tried the 1024 bits uniq_id, which didn't help, the 2048 bits uniq_id, which didn't help, and eventually the 4096 bits uniq_id, which, of course, didn't help either. :(

At this point, although didn't do the math, I was convinced that a collision in uniq_id was not the root cause. To find it, I almost certainly had to get my hands dirty. :(

Be patient and really dig into the logs

Still writing on it