fer's brain dump

Race condition when you don’t even know you’re racing

April 14, 2021 // , , ,
Filed under dev

Who doesn’t remember War Games (1983)?

A strange game. The only winning move is not to play.

WOPR

And yet, a stranger game is a game no party knows they’re playing.

At a previous client I was one of the people in charge of migrating their Python code from managing files in NFS to a S3 solution. Which sounds easy if you ignore the fact that the main thing their app did was moving files around, extracting them, tarballing them, splitting them according to lines and size, sending them… all sorts of operations that routinely work on paths and file descriptors, which no longer exist when you deal with S3 objects (or boto3 , for all that matter).

In the end it was quite a lot of work, and most of my code in the endeavor was a layer that turned S3 objects into file-like objects when we needed them to be so, and some extra metadata caching to prevent zillions of network roundtrips where previously the Linux VFS had our backs covered.

All seemed good and well, but one of my client’s partners, only one of dozens, started complaining about receiving corrupted files from us. By partner I don’t mean some team next floor or building, but a different company across the ocean, and a third, barely approachable party was in charge of the forwarding of the file. The file itself was more or less the same size every day, in the multiple GiB range.

A (us) --SFTP--> B (some other guys) --SFTP--> C (them)

Our first reaction was to verify our own archive: the S3 object we had was correct, data valid, and indeed their checksum didn’t match ours.

After a call with C guys, their tarball would extract most files, but from certain point there would be errors for some of them, and even some that could be extracted often were corrupt. The tarball’s size matched, but, we suspected, towards the end bytes turned sour. We had no means to see what data was off, since corporate firewalls prevented us from simply rsyncing each other, and the file was just too big to sent otherwise. Going through B also was a no go, a new contract had to be signed for such an outrageous idea. We half-joked about mailing a pendrive, but security didn’t like the non-joke part of it.

Constrained by coporate policies, and after some back and forth in our test environments, it was clear that our change triggered this corruption, so we had to rule out that the corruption took place while sending.

We already had tests in place with a mock SFTP server, but we never checked that the file had no corruption because, c’mon, TCP and SSH should have detected and corrected any issue, right? Wrong, maybe some lib we used, or even code, was misbehaving and sending garbage, so it was a good moment to pay technical debt and we extended these tests to discover… nothing, everything worked correctly after these new, extensive tests. But at least that debt was paid.

Current status:

S3 off:
A --SFTP--> B --SFTP--> C
OK   OK     OK   OK     OK


S3 on:
A --SFTP--> B --SFTP--> C
OK   ?      ?    ?      KO

Trying to book a meeting with B didn’t seem to be possible, but something happened between us and them, or them and our partner. It raises questions about certain business decisions, but I’m not paid to be a business consultant 🤷.

Without any input from this party to work with, there wasn’t much to do other than looking very, very hard at our code and thinking how that could change the behavior of a system we know nothing about. Nice.

The code could have looked like this (IANAL, but I just wrote this from memory, hence my own work in my free time).

def sendFile(self,fileProxy,paramiko_client,target_path):
    if fileProxy.is_s3:
        with paramiko_client.open(target_path, 'wb') as remote_file:
            self.s3.meta.client.download_fileobj(fileProxy.bucket, fileProxy.key, remote_file)
    else:
        paramiko_client.put(fileProxy.path,target_path)

Basically, if we’re in S3 mode, we let boto3 write onto the file-like object returned by paramiko, and call it a day. fileProxy is a special file-like object that works as what it should be depending on the underlying object (S3 object, actual local file, HDFS file, Ceph object…). Very smart, very simple, but also very propietary.

If we are in regular POSIX fs mode, paramiko does both the reading and writing, without any file-like object in the middle.

The lines of each mode are different, yes, but arguably equivalent.

I was desperate. I was about to leave on holiday, and this wasn’t the type of problem I like to have in the back of my head. It felt like a bugfixing version of The Trial: the software does something wrong but you are not given tools to figure out how. Something in boto3/botocore was doing this, probably, but can’t know for sure, and there’s nowhere to start looking to know for sure.

One night I turned on the TV to numb my mind, which is something I roughly never do, and there was some 800 M race on. Starting gun, athletes running, and over time crossing the finish line. The podium was set, but runners kept arriving for a bit. Two of them were notably slower, and as I expected them to reach the line, the coverage switched to long jump. And then it clicked.

When a file is large enough (i.e. over 15 MiB) boto3 launches threads to handle each multipart piece. Once the last block of data is written, the final size of the file won’t change, but there might be other threads still writing in other parts of the file. It could happen that, if a process monitors file size as a signal for “file is ready”, it could fire that signal before it’s completely transferred. B’s system transferring the file before A finished explained pretty much all of this head-scratching behavior.

I VPN’d in, and pushed the fix for the team to test during my absence:

def sendFile(self,fileProxy,paramiko_client,target_path):
    if fileProxy.is_s3:
        paramiko_client.putfo(fileProxy, target_path)
    else:
        paramiko_client.put(fileProxy.path,target_path)

Where we use putfo, with the magical file-like object I talked about, to ensure data is written linearly by paramiko as it is read, linearly or not, by boto3.

Before leaving I also sent an email asking B whether my assumptioms were correct. My guesswork, given our connection speeds and file size, pointed at something around the 10 second mark.

When I returned to the office, there was no answer, but everyone was happy. We won the race against B that we didn’t know we were running.

Why not use mtime instead?

—Someone not at B