ReC98

📝 Posted:: 2021-06-06 00:53 UTC
🚚 Summary of:: P0143, P0144, P0145
⌨ Commits:: (Website) 9069fb7...c8ac7e5, (Website) c8ac7e5...69dd597, (Website) 69dd597...71417b6
💰 Funded by:: [Anonymous], Yanga, Lmocinemod
🏷 Tags:

Who said working on the website was "fun"? That code is a mess. This right here is the first time I seriously wrote a website from (almost) scratch. Its main job is to parse over a Git repository and calculate numbers, so any additional bulky frameworks would only be in the way, and probably need to be run on some sort of wobbly, unmaintainable "stack" anyway, right? 😛 📝 As with the main project though, I'm only beginning to figure out the best structure for this, and these new features prompted quite a lot of upfront refactoring…

Before I start ranting though, let's quickly summarize the most visible change, the new tag system for this blog!

Yes, I manually went through every one of the 82 posts I've written so far, and assigned labels to them.
The per-project (rec98 and website) and per-game (th01 th02 th03 th04 th05) tags are automatically generated from the database and the Git commit history, respectively. That might have ended us up with a fair bit of category clutter, as any single change to a tiny aspect is enough for a blog post to be tagged with an otherwise unrelated game. For now, it doesn't seem too much of an issue though.
Filtering already works for an arbitrary number of tags. Right now, these are always combined with AND – no arbitrary boolean expressions for tag filtering yet.
Adding filters simply works by adding components to the URL path: https://rec98.nmlgc.net/blog/tag/tag1/tag2/tag3/… and so on.
Hovering over any tag shows a brief description of what that tag is about. Some of the terms really needed a definition, so I just added one for all of them. Hope you all enjoy them!
These descriptions are also shown on the new tag overview page, which now kind of doubles as a glossary.

Finally, the order page now shows the exact number of pushes a contribution will fund – no more manual divisions required. Shoutout to the one email I received, which pointed out this potential improvement!

As for the "invisible" changes: The one main feature of this website, the aforementioned calculation of the progress metrics, also turned out as its biggest annoyance over the years. It takes a little while to parse all the big .ASM files in the source tree, once for every push that can affect the average number of removed instructions and unlabeled addresses. And without a cache, we've had to do that every time we re-launch the app server process.
Fundamentally, this is – you might have guessed it – a dependency tracking problem, with two inputs: the .ASM files from the ReC98 repo, and the Golang code that calculates the instruction and PI numbers. Sure, the code has been pretty stable, but what if we do end up extending it one day? I've always disliked manually specified version numbers for use cases like this one, where the problem at hand could be exactly solved with a hashing function, without being prone to human error.

(Sidenote: That's why I never actively supported thcrap mods that affected gameplay while I was still working on that project. We still want to be able to save and share replays made on modded games, but I do not want to subject users to the unacceptable burden of manually remembering which version of which patch stack they've recorded a given replay with. So, we'd somehow need to calculate a hash of everything that defines the gameplay, exclude the things that don't, and only show replays that were recorded on the hash that matches the currently running patch stack. Well, turns out that True Touhou Fans™ quite enjoy watching the games get broken in every possible way. That's the way ZUN intended the games to be experienced, after all. Otherwise, he'd be constantly maintaining the games and shipping bugfix patches… 🤷)

Now, why haven't I been caching the progress numbers all along? Well, parallelizing that parsing process onto all available CPU cores seemed enough in 2019 when this site launched. Back then, the estimates were calculated from slightly over 10 million lines of ASM, which took about 7 seconds to be parsed on my mid-range dev system.
Fast forward to P0142 though, and we have to parse 34.3 million lines of ASM, which takes about 26 seconds on my dev system. That would have only got worse with every new delivery, especially since this production server doesn't have as many cores.

I was thinking about a "doing less" approach for a while: Parsing only the files that had changed between the start and end commit of a push, and keeping those deltas across push boundaries. However, that turned out to be slightly more complex than the few hours I wanted to spend on it. And who knows how well that would have scaled. We've still got a few hundred pushes left to go before we're done here, after all.

So with the tag system, as always, taking longer and consuming more pushes than I had planned, the time had come to finally address the underlying dependency tracking problem.
Initially, this sounded like a nail that was tailor-made for 📝 my favorite hammer, Tup: Move the parser to a separate binary, gather the list of all commits via git rev-list, and run that parser binary on every one of the commits returned. That should end up correctly tracking the relevant parts of .git/ and the new binary as inputs, and cause the commits to be re-parsed if the parser binary changes, right? Too bad that Tup both refuses to track anything inside .git/, and can't track a Golang binary either, due to all of the compiler's unpredictable outputs into its build cache. But can't we at least turn off–

> The build cache is now required as a step toward eliminating $GOPATH/pkg. — Go 1.12 release notes

Oh, wonderful. Hey, I always liked $GOPATH! 🙁

But sure, Golang is too smart anyway to require an external build system. The compiler's build ID is exactly what we need to correctly invalidate the progress number cache. Surely there is a way to retrieve the build ID for any package that makes up a binary at runtime via some kind of reflection, right? Right? …Of course not, in the great Unix tradition, this functionality is only available as a CLI tool that prints its result to stdout. 🙄
But sure, no problem, let's just exec() a separate process on the parser's library package file… oh wait, such a thing doesn't exist anymore, unless you manually install the package. This would have added another complication to the build process, and you'd still have to manually locate the package file, with its version-specific directory name. That might have worked out in the end, but figuring all this out would have probably gone way beyond the budget.

OK, but who cares about packages? We just care about one single file here, anyway. Didn't they put the official Golang source code parser into the standard library? Maybe that can give us something close to the build ID, by hashing the abstract syntax tree of that file. Well, for starters, one does not simply serialize the returned AST. At least into Golang's own, most "native" Gob format, which requires all types from the go/ast package to be manually registered first.
That leaves ast.Fprint() as the only thing close to a ready-made serialization function… and guess what, that one suffers from Golang's typical non-deterministic order when rendering any map to a string. 🤦

Guess there's no way around the simplest, most stupid way of simply calculating any cryptographically secure hash over the ASM parser file. 😶 It's not like we frequently change comments in this file, but still, this could have been so much nicer.
Oh well, at least I did get that issue resolved now, in an acceptable way. If you ever happened to see this website rebuilding: That should now be a matter of seconds, rather than minutes. Next up: Shinki's background animations!

⮜ Blog