Here we go, the finale of the Shuusou Gyoku Linux port, culminating in packages for the Arch Linux AUR and Flathub! No intro, this is huge enough as it is.
Before we could compile anything for Linux, I still needed to add GCC/Clang support to my Tup building blocks, in what's hopefully the last piece of build system-related work for a while. Of course, the decision to use one compiler over the other for the Linux build hinges entirely on their respective support for C++ standard library modules. I 📝 rolled out import std; for the Windows build last time and absolutely do not want to code without it anymore. According to the cppreference compiler support table at the time I started development, we had the choice between
experimental support in the not-yet-released GCC 15, and
partial support as of Clang 17, two versions ago.
GCC's current implementation does compile in current snapshot builds, but still throws lots of errors when used within the Shuusou Gyoku codebase. Clang's allegedly partial support, on the other hand, turned out just fine for our purposes. So for now, Clang it is, despite not being the preferred C/C++ compiler on most Linux distributions. In the meantime, please forgive the additional run-time dependency on libc++, its C++ standard library implementation. 🙇 Let's hope that it all will actually work in GCC 15 once that version comes out sometime in 2025.
At a high level, my Tup building blocks only have to do a single thing to support standard library modules with a given compiler: Finding the std and std.compat module interface units at the compiler's standard locations, and compiling them with the same compiler flags used for the rest of the project. Visual Studio got the right idea about this: If you compile on its command prompts, you're already using a custom shell with environment variables that define the necessary paths and parameters for your target platform. Therefore, it makes sense to store these module units at such an easily reachable path – and sure enough, you can reliably find the std module unit at %VCToolsInstallDir%\modules\std.ixx. While this is hands down the optimal way of locating this file, I can understand why GCC and Clang would want module lookup to work in generic shells without polluting environment variables. In this case, asking some compiler binary for that path is a decent second-best option.
Unfortunately, that would have been way too simple. Instead, these two compilers approached the problem from the angle of general module usage within the common build systems out there:
Using modules within a project introduces a new kind of dependency relation between C++ source files, forcing all such code to be compiled in an implicitly defined order. For Tup, this isn't much of a problem because it has always required 📝 order-relevant dependencies to be explicitly specified. So it's been quite amusing for me to hear all these CMake-entrenched CppCon speakers in recent years comment on how this aspect of modules places such a burden on build systems… 🤭
Then again, their goal is a world where devs just write import name_of_module; and the build system figures out a project's dependency graph on its own by scanning all source files prior to compilation. Or rather, asking the compiler to parse the source files and dump out this information, using the fdeps-* options on GCC, the separate clang-scan-deps tool for Clang, or the cl /scanDependencies option for MSVC.
Because each of the three major compilers has its own implementation of modules, it's understandable why the options and tools are different. Obviously though, CMake is interested in at least getting all three to output the dependency information in the same format. So they got onto the C++ committee's SG15 working group and proposed a JSON format, which GCC and Clang subsequently implemented.
But wait! The source files for the std and std.compat modules don't lie inside the source tree and couldn't be found by such a scan over the declared project files. So SG15 later simply proposed using the same JSON format for this purpose and installing such a JSON file together with the standard library implementation.
But wait! That only shifted the problem, because now we need to find that JSON file. What does the paper have to say on that issue?
For the Standard Library:
The build system should be able to query the toolchain (either the compiler or relevant packaging tools) for the location of that metadata file.
Wonderful. Just what we wanted to do all along, only with an additional layer of indirection that now forces every build system to include a JSON parser somewhere in its architecture. 🤦
In CMake's defense, they did try to get other build systems, including Tup, involved in these proposals. Can't really complain now if that was the consensus of everybody who wanted to engage in this discussion at the time. Still, what a sad irony that they reached out to Tup users on the exact day in 2019 at which I retired from thcrap and shelved all my plans of using Tup for modern C++ code…
So, to locate the interface units of standard library modules on Clang and GCC, a build system must do the following:
Ask the compiler for the path to the modules.json file, using the 30-year-old-print-file-name option.
GCC and Clang implement this option in the worst possible way by basically conditionally prepending a path to the argument and then printing it back out again. If the compiler can't find the given file within its inscrutable list of paths or you made a typo, you can only detect this by string-comparing its output with your parameter. I can't imagine any use case that wouldn't prefer an error instead.
Clang was supposed to offer the conceptually saner -print-library-module-manifest-path option, but of course, this is modern C++, and every single good idea must be accompanied by at least one other half-baked design or implementation decision.
Load the JSON file with the returned file name.
Parse the JSON file.
Scan the "modules" array for an entry whose "logical-name" matches the name of the standard module you're looking for.
Discover that the "source-path" is actually relative and will need to be turned into an absolute one for your compilation command line. Thankfully, it's just relative to the path of the JSON file we just parsed.
Sure, you can turn everything into a one-liner on Linux shells, but at what cost?
You might argue that Tup rules are a rather contrived case. Tup by itself can't store the output of processes in variables because rule generation and rule execution are two separate phases, so we need to call clang -print-file-name at both of the places in the command line where we need the file name. But, uh, CMake's implementation is 170 lines long…
At least it's pretty straightforward to then use these compiled modules. As far as our Tup building blocks are concerned, it's just another explicit input and a set of command-line flags, indistinguishable from a library. For Clang, the -fmodule-file=module_name=path option is all that's required for mapping the logical module names to the respective compiled debug or release version.
GCC, however, decided to tragically over-engineer this mapping by devising a plaintext protocol for a microservice like it's 2014. Reading the usage documentation is truly soul-crushing as GCC tries everything in its power to not be like Clang and just have simple parameters. Fortunately, this mapper does support files as the closest alternative to parameters, which we can just echo from Tup for some 📝 90's response file nostalgia. At least I won't have to entertain this folly for a moment longer after the Lua code is written and working…
So modules are justifiably hard and we should cut compiler writers some slack for having to come up with an entirely new way of serializing C++ code that still works with headers. But surely, there won't be any problems with the smaller new C++ features I've started using. If they've been working in MSVC, they surely do in Clang as well, right? Right…?
Once again, C++ standard versions are proven to be utterly meaningless to anyone outside the committee and the CppCon presenters who try to convince you they matter. Here's the list of features that still don't work in Clang in early 2025:
C++20's std::jthread, which fixes an important design flaw of C++'s regular thread class. This would have been very unfortunate if I hadn't coincidentally already rewritten my threading code to use SDL's more portable thread API as part of the Windows 98 backport. Thus, I could adopt that work into this delivery, gifting a much-needed extra 0.3 pushes of content to the Windows 98 backport. 🙌
C++17's std::from_chars() for floating-point values, which we use to parse 📝 gain factors for waveform BGM out of Vorbis comment tags. This one is a medium-sized tragedy: Since it's not worth it to polyfill this function with a third-party library for just a single call, the best thing we can do is to fall back on strtof() from the C standard library. Why wasn't I using this function all along, you may ask? Well, as we all know by now, the C standard library is complete and utter trash, and strtof() is no exception by suffering from locale braindeath.
A good chunk() (ha) of the C++23 range adaptors. As a rather new addition to the language, I've only made sporadic use of them so far to get a feel for their optimal usage. But as it turns out, sporadic use of range adaptors makes very little sense because the code is much simpler and easier to read without them. And this is what the C++ committee has been demanding our respect for all this time? They have played us for absolute fools.
The -2 might look slightly cryptic at first, but since this code is part of a constinit block, we'd get a compiler error if we either wrote too few elements (and left parts of the array uninitialized) or wrote too many (and thus out of the array's bounds). Therefore, the number can't be anything else.
It almost looked like it'd finally be time for my long-drafted rant about the state of modern C++, but the language just barely redeemed itself with the last two sentences there. Some other time, then…
On the bright side, all my portability work on game logic code had exactly the effect I was hoping for: Everything just worked after the first successful compilation, with zero weird run-time bugs resulting from the move from a 32-bit MSVC build to 64-bit Clang. 🎉
Before we can tackle text rendering as the last subsystem that still needs to be ported away from Windows, we need to take a quick look at the font situation. Even if we don't care about pixel-perfectly matching the game's text rendering on Windows, MS Gothic seems to be the only font that fits the game's design at all:
All text areas are dimensioned around the exact metrics of MS Gothic's embedded bitmaps. In menus, each half-width character is expected to be exactly 7×14 pixels large because most of the submenu items are aligned with spaces. In text boxes and the Music Room, glyphs can be smaller than the intended 8×16 pixels per half-width character, but they can't be larger without cutting off something somewhere.
Only bitmap fonts can deliver the sharp and pixelated look the game goes for. Subpixel rendering techniques are crucial for making vector fonts look good, but quickly get ugly when applied to drop-shadowed text rendered at these small sizes:
That's MS Gothic in both pictures. The smoothed rendering on the help text might arguably look nicer, but it clashes very badly with the drop shadow in the menus.
However, MS Gothic is non-free and any use of the font outside of a Windows system violates Microsoft's EULA. In spite of that, the AUR offers three ways of installing this font regardless:
The ttf-ms-*auto-* packages download a Windows 10 or 11 ISO from a somewhat official download link on Microsoft's CDN and extract the font files from there. Probably good enough if downloading 5 GB only to scrape a single 9 MB font file out of that image doesn't somehow feel wrong to you.
The regular, non-auto or -cdnttf-ms-win* packages leave it up to you where exactly you get the files from. While these are the clearest options in how they let you manually perform the EULA infringement, this manual nature breaks automated AUR helpers. And honestly, requiring you to copy over all 141 font files shipped with modern Windows is massively overkill when we only need a single one of them. At that point, you might as well just copy msgothic.ttc to ~/.local/share/fonts and don't bother with any package. Which, by the way, works for every distro as well as Flatpaks, which can freely access fonts on the host system.
You might want to go the extra mile and use any of these methods for perfectly accurate text rendering on Linux, and supporting MS Gothic should definitely be part of the intended scope of this port. But we can't expect this from everyone, and we need to find something that we can bundle as part of the Flatpak.
So, we need an alternative free Japanese font that fits the metric constraints of MS Gothic, has embedded bitmaps at the exact sizes we need, and ideally looks somewhat close. Checking all these boxes is not too easy; Japanese fonts with a full set of all Kanji in Shift-JIS are a niche to begin with, and nobody within this niche advertises embedded bitmaps. As the DPI resolutions of all our screens only get higher, well-designed modern fonts are increasingly unlikely to have them, thus further limiting the pool to old fonts that have long been abandoned and probably only survived on websites that barely function anymore.
Ultimately, the ideal alternative turned out to be a font named IPAMonaGothic, which I found while digging through the Winetricks source code. While its embedded bitmaps only cover MS Gothic's first half for font heights between 10 and 16 pixels rather than going all the way to 22 pixels, it happens to be exactly the range we need for this game.
If you're a PC-98 hardware fan, the difference between these two fonts is probably already reminding you of the stylistic difference between NEC's and Epson's versions of the ROM font.
Both of these screenshots were made on Windows. Obviously, the Linux port shouldn't settle for anything less than pixel-perfectly matching these reference renderings with both fonts.
Alright then, how are we going to get these fonts onto the screen with something that isn't GDI? With all the emphasis on embedded bitmaps, you might come to the conclusion that all we want to do is to place these bitmap glyphs next to each other on a monospaced grid. Thus, all we'd need is a TTF/OTF library that gives us the bitmap for a given Unicode code point. Why should we use any potentially system-specific API then?
But if we instead approach this from the point of view of GDI's feature set, it does seem better to match a standard Windows text rendering API with the equivalent stack of text rendering libraries that are typically used by Linux desktop environments. And indeed, there are also solid reasons why this is a better idea for now:
There actually is a single instance where this game uses MS Gothic at a height of 24 pixels, which is too large to be covered by its embedded bitmaps and thus requires rasterization of vector outlines. Whenever the SCL parser encounters an unknown opcode, it shows this error message:
Modders may very well end up seeing this one as a result of bugs in SCL compilers.
You might see debug text as not worth bothering with, but then there's Kioh Gyoku. Not only does that game display its text at much bigger sizes throughout, but it also renders every string at 3× the size it is ultimately downscaled to, similar to the 2× scale factor used by the 640×480 Windows Touhou games. Going for a full-featured solution that works with both embedded bitmaps and outlines saves us time later.
We'd be ready for translations into even the most complex-to-render non-ASCII scripts.
Since our fonts might not support these scripts, having the API fall back on other fonts installed in the system as necessary would allow us to add these translations independently of figuring out the font situation for them.
In fact, text rendering must technically already support glyph fallback because 📝 the BGM pack selection just displays path names, which count as user input. If people use code points in their BGM pack folder names that aren't covered by either of our two fonts, they probably have some font installed on their system that can display them. Also, the missing .DAT file screen further below in that post shows that GDI already does glyph fallback with emoji, so wouldn't it be lame if the Linux version didn't have at least feature parity in this regard? Instead, the Linux stack would actually outperform GDI thanks to the former's natural support for color emoji. 🎨
Since we're explicitly porting to desktop Linux here, using the standard Linux text rendering stack is the least bloated option because Linux users will have it installed anyway. We can still reach for more minimalistic alternatives later once we do port this game to something other than Linux.
Let's look at what this stack consists of and how the libraries interact with each other:
FreeType provides access to everything related to the rendering of TTF and OTF fonts, including their embedded bitmaps, as well as a rasterizer for regular vector glyphs. It's completely obvious why we need this library.
GLib2 is a collection of various general utility functions that modern non-C languages would have in their standard libraries. Most notably, it provides the tables and APIs for Unicode character data, but its iconv wrapper also comes in quite handy for converting the Shift-JIS text from the original .DAT files to UTF-8 without additional dependencies.
FriBidi implements the Unicode Bidirectional Algorithm, just in case you've thrown some Arabic or Hebrew into your string.
HarfBuzz implements shaping, i.e., the translation of raw Unicode into a sequence of glyph indices and positions depending on what's supported by the font. We might not strictly need this library right now, but it's completely obvious why we will eventually need it for translations.
Fontconfig manages all fonts installed on the system, maps user-friendly font names to file names, tracks their Unicode coverage, and offers a central place for storing various font tweaking options.
Normally, games wouldn't need this library because they just bundle all the fonts they need and hardcode any required tweaking settings to make them look as intended. Looking back at our font situation though, installing MS Gothic in a system-wide way through a package that puts the font into a standard location will be the simplest method of meeting that optional dependency. This is a reasonable assumption in a neatly packaged Linux system where the font is just another item on the game's dependency list, but also within a Flatpak, where "system-wide" includes any fonts shipped with the image. If we now assume that IPAMonaGothic is installed in the same way, we can let Fontconfig handle the actual selection. All we need to do is to specify a preference for MS Gothic over IPAMonaGothic, and Fontconfig will take care of the rest, without us a single line of TTF-loading code.
Pango combines the three libraries above into an API that somewhat matches GDI's simplicity, laying out text in one or multiple lines based on the shaped output of HarfBuzz and substituting glyphs as necessary based on Fontconfig information. The actual rendering, however, is delegated to…
Cairo, a… "2D graphics library"? Now why would we need one of those if all we want is a buffer filled with pixels? Wikipedia's description emphasizes its vector graphics capabilities, which seems to describe the library better than the nondescript blurb on its official website, but doesn't FreeType already do this for text? After looking at it for way too long, the best summary I can come up with is "a collection of font rasterization code that should have maybe been part of FreeType, plus the aforementioned general 2D vector graphics code we don't need". Just like Pango wraps HarfBuzz and Fontconfig to lay out the individual glyphs, Cairo wraps FreeType and raw pixel buffers to actually place these glyphs on its surface abstraction. (And also Fontconfig because of all its configuration settings that can influence the rendering.) Ultimately, this means that each font is represented by a HarfBuzz+FreeType handle, a Pango+Cairo handle, and a Cairo+FreeType handle, which I'm sure won't be relevant later on. 👀
Pango does have a raw FreeType backend that could render text entirely without Cairo, but it's not really maintained and supports neither embedded bitmaps nor color emoji. So we don't have much of a choice in the matter.
Created using pango-view -t 'effective. Power لُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ🌈冗' --font='MS Gothic 16px' --backend=cairo.
Created using pango-view -t 'effective. Power لُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ🌈冗' --font='MS Gothic 16px' --backend=ft2.
Fun fact: Since Cairo also manages the temporary CPU image buffer we draw on and then hand to SDL, our backend for Shuusou Gyoku ends up with 3× the amount of Cairo function calls than Pango function calls.
In the end, a typical desktop Linux program requires every single one of these 8 libraries to end up with a combined API that resembles Ye Olde Win32 GDI in terms of functionality and abstraction level. Sure, the combination of these eight is more powerful than GDI, offering e.g. affine transformations and text rendering along a curved path. But you can't remove any of these libraries without falling behind GDI.
Even then, my Linux implementation of text rendering for Shuusou Gyoku still ended up slightly longer than the GDI one due to all the Pango and Cairo contexts we have to manually manage. But I did come up with a nice trick to reduce at least our usage of Cairo: Since GDI needs to be used together with DirectDraw, the GDI implementation must keep a system-memory copy of the entire 📝 text surface due to 📝 DirectDraw's possibility of surface loss. But since we only use Cairo with SDL, the Cairo surface in system memory does not actually need to match the SDL-managed GPU texture. Thus, we can reduce the Cairo surface to the role of a merely temporary system-memory buffer that only is as large as the single largest text rectangle, and then copy this single rectangle to the intended packed place within the texture. I probably wouldn't have realized this if the seemingly most simple way to limit rendering to a fixed rectangle within a Cairo surface didn't involve creating another Cairo surface, which turned out to be quite cumbersome.
But can this stack deliver the pixel-perfect rendering we'd like to have? Well, almost:
Cue hours of debugging to find the cause behind these vertical shifts. The overview above already suggested it, but this bug hunt really drove home how this entire stack of libraries is a huge pile of redundantly implemented functionality that interacts with and overrides each other in undocumented and mostly unconfigurable ways. Normally, I don't have much of a problem with that as long as I can step through the code, but stepping through Cairo and especially Pango is a special kind of awful. Both libraries implement dynamic typing and object-oriented paradigms in C, thus hiding their actually interesting algorithms under layers and layers of "clean" management functions. But the worst part is a particularly unexpected piece of recursion: To layout a paragraph of text, Pango requires a few font metrics, which it calculates by laying out a language-specific paragraph of example text. No, I do not like stepping through functions that much, please don't put a call to the text layout function into the text layout function to make me debug while I debug, dawg…
It'll probably take many more years until most of this stack has been displaced with the planned Rust rewrites. But honestly, I don't have great hopes as long as they stay with this pile-of-libraries approach. This pile doesn't even deserve to be called a stack given the circular dependency between FreeType and HarfBuzz…
Ultimately, these are the bugs we're seeing here:
When rendering strings that contain both Japanese and Latin characters with MS Gothic, the Japanese characters are pushed down by about 1/8th of the font height. This one was already reported in June 2023 and is a bug in either HarfBuzz, Pango, or MS Gothic. With the main HarfBuzz developer confused and without an idea for a clean solution, the bug has remained unfixed for 1½ years.
For now, the best workaround would be to revert the commit that introduced the baseline shift. Since the Flatpak release can bundle whatever special version of whatever library it needs, I can patch this bug away there, but distro-specific packages or self-compiled builds would have to patch Pango themselves. LD_LIBRARY_PATH is a clean way of opting into the patched library without interfering with the regular updates of your distro, but there's still a definite hurdle to setting it up.
The remaining 1-pixel vertical shift is, weirdly enough, caused by hinting. Now why would a technique intended for improving the sharpness of outline fonts even apply to bitmap fonts to begin with? As you might have guessed, the pile-of-libraries approach strikes once more:
We can override Cairo's metric hinting defaults with the API documented in the page I linked above. But we must only do so conditionally because 16-pixel MS Gothic does require metric hinting for its glyph placement to match GDI. The resulting hack is very much not pretty.
Cairo's font options can only be really changed at the level of a Cairo context. Any Pango font handle created from a Pango layout mapped to a Cairo context will get a copy of that context's font options at creation time. And of course, the Pango level treats these options as an implementation detail that cannot be modified from the outside. So, we need to figure out the font using raw Fontconfig calls instead of Pango's abstraction. Oh, and this copy also forces us to recreate the Pango layout if we change between 14- and 16-pixel MS Gothic, which is not necessary with IPAMonaGothic.
Don't you love it when the concerns are so separated that they end up overlapping again? I'm so looking forward to writing my own bitmap font renderer for the multilingual PC-98 translations, where the memory constraints of conventional DOS RAM make it infeasible to use any libraries of this pile to begin with 😛
Before we can package this port for Flathub, there's one more obstacle we have to deal with. Flathub mandates that any published and publicly listed app must come with an icon that's at least 128×128 pixels in size. pbg did not include the game's original 32×32 icon in the MIT-licensed source code release, but even if he did, just taking that icon and upscaling it by 4× would simultaneously look lame and more official than it perhaps should.
So, the backers decided to commission a new one, depicting VIVIT in her title screen pose but drawn in a different style as to not look too official. Mr. Tremolo Measure quickly responded to our search and Ember2528 liked his PC-98-esque pixel art style, so that's what we went for:
However, the problem with pixel art icons is that they're strongly tied to specific resolutions. This clashes with modern operating system UIs that want to almost arbitrarily scale icons depending on the context they appear in. You can still go for pixel art, and it sure looks gorgeous if their resolution exactly matches the size a GUI wants to display them at. But that's a big if – if the size doesn't match and the icon gets scaled, the resulting blurry mess lacks all the definition you typically expect from pixel art. Even nearest-neighbor integer upscaling looks more cheap rather than stylized as the coarse pixel grid of the icon clashes with the finer pixel grid of everything surrounding it.
So you'd want multiple versions of your icon that cover all the exact sizes it will appear at, which is definitely more expensive than a single smooth piece of scalable vector artwork. On a cursory look through Windows 11, I found no fewer than 7 different sizes that icons are displayed at:
16×16 in the title bar and all of Explorer's list views
24×24 in the taskbar
28×28 in the small icon next to the file name in Explorer's detail pane (which is never sharp for some reason, even if you provide a 28×28 variant?!)
32×32 in the old-style Properties window
48×48 in Explorer's Medium icons view
96×96 in Explorer's Large icons view, and the large icon its detail pane
256×256 in Explorer's Extra large icons view
And that's just at 1× display scaling and the default zooming factors in Explorer.
But it gets worse. Adding our commissioned multi-resolution icon to an .exe seems simple enough:
Bundle the individual images into a single .ico file using magick in1.png in2.png … out.ico
Write a small resource script, call rc, and add the resulting .res file to the link command line
Be amazed as that icon appears in the title and task bars without you writing a single line of code, thanks to SDL's window creation code automatically setting the first icon it finds inside the executable
But what's going on in Explorer?
Same Extra large icons setting for both.
That's the 48×48 variant sitting all tiny in the center of a 256×256 box, in a context where we expect exactly what we get for the .ico file. Did I just stumble right into the next underdocumented detail? What was the point of having a different set of rules for icons in .exe files? Make that 📝 another Raymond Chen explanation I'm dying to hear…
Until then, here's what the rules appear to be:
256×256 is the one and only mandatory size for high-res program icons on Windows.
48×48 is the next smallest supported size, as unbelievable as that sounds. Windows will never use any other icon variant in between. Some sites claim that 64×64 is supported as well, but I sure couldn't confirm that in my tests.
Those 96×96 use cases from the list above? Yup, Windows will never actually display an embedded 96×96 icon at its native resolution, and either scale up the 48×48 variant (in the Large icons view) or scale down the 256×256 variant (in the detail pane).
You only ever see an embedded icon with a size between 48×48 and 256×256 if it's the only icon available – and then it still gets scaled to 48×48. Or to 96×96, depending on how Explorer feels like.
Getting different results in your tests? Try rebuilding the icon cache, because of course Windows still struggles with cache invalidation. This must have caused unspeakable amounts of miscommunication with artists over the decades.
Oh well, let's nearest-neighbor-scale our 128×128 icon by 2× and move on to Linux, where we won't have such archaic restrictions…
…which is not to say that pixel art icons don't come with their own issues there. 🥲
On Linux, this kind of metadata is not part of the ELF format, but is typically stored in separate Desktop Entry files, which are analogous to .lnk shortcuts on Windows. Their plaintext nature already suggests that icon assignment is refreshingly sane compared to the craziness we've seen above, and indeed, you simply refer to PNG or even SVG files in a separate directory tree that supports arbitrary size variants and even different themes. For non-SVG icons, menus and panels can then pick the best size variant depending on how many pixels they allot to an icon. The overwhelming majority of the ones I've seen do a good job at picking exactly the icon you'd expect, and bugs are rare.
But how would this work for title and task bars once you started the app? If you launched it through a Desktop Entry, a smart window manager might remember that you did and automatically use the entry's icon for every window spawned by the app's process. Apparently though, this feature is rather rare, maybe because it only covers this single use case. What about just directly starting an app's binary from a shell-like environment without going through a Desktop Entry? You wouldn't expect window managers to maintain a reverse mapping from binaries to Desktop Entries just to also support icons in this other case.
So, there must be some way for a program to tell the window manager which icon it's supposed to use. Let's see what SDL has to offer… and the documentation only lists a single function that takes a single image buffer and transfers its pixels to the X11 or Wayland server, overriding any previous icon. 😶
Well great, another piece of modern technology that works against pixel art icons. How can we know which size variant we should pick if icon sizing is the job of the window manager? For the same reason, this function used to be unimplemented in the Wayland backend until the committee of Wayland stakeholders agreed on the xdg-toplevel-icon protocol last year.
Now, we could query the size of the window decorations at all four edges to at least get an approximation, but that approach creates even more problems:
Which edge do we pick? The top one? The largest one? How can we possibly be sure that the one we pick is the one that will show the icon?
Even if we picked the correct edge, the icon will likely be smaller and not cover the full area. Again, anything less than an exact match isn't good enough for pixel art.
This function is not implemented on Wayland because client windows aren't supposed to care about how the server is decorating them.
But even among X11 window managers, there's at least one that doesn't report back the border sizes immediately after window creation. 🙄
Most importantly though: What if that icon is also used in a taskbar whose icons have a different size than the ones in title bars? Both X11's _NET_WM_ICON property and Wayland's xdg-toplevel-icon-v1 protocol support multiple size variants, but SDL's function does not expose this possibility. It might look as if SDL 3 supports this use case via its new support for alternate images in surfaces, but this feature is currently only used for mouse cursors. That sounds like a pull request waiting to happen though, I can't think of a reason not to do the same for icons. contribution-ideas?
But if SDL 2's single window icon function used to be unsupported on Wayland, did SDL 2 apps just not have icons on Wayland before October 2024?
Digging deeper reveals the tragically undocumented SDL_VIDEO_X11_WMCLASS environment variable, which does what we were hoping to find all along. If you set it to the name of your program's Desktop Entry file, the window manager is supposed to locate the file, parse it, read out the Icon value, and perform the usual icon and size lookup. Window class names are a standard property in both X11 and Wayland, and since SDL helpfully falls back on this variable even on Wayland, it will work on both of them.
Or at least it should. Ultimately, it's up to the window manager to actually implement class-derived icons, and sadly, correct support is not as widespread as you would expect.
How would I know this? Because I've tested them all. 🥲 That is, all non-AUR options listed on the Arch Wiki's Desktop environment and Window manager pages that provide something vaguely resembling a desktop you can launch arbitrary programs from:
WM / DE
Manually transferred pixels
Class-derived icons
Notes
awesome
✔️
Does not report border sizes back to SDL immediately after window creation
Blackbox
bspwm
No title bars
Budgie
✔️
✔️
Title bars have no icons. Taskbar falls back on the icon from the Desktop Entry file the app was launched with.
Cinnamon
✔️
✔️
Title bars have no icons, but they work fine in the taskbar. Points out the difference between native and Flatpak apps!
COSMIC
✔️
✔️
Title bars have no icons, but they work fine in the taskbar. Points out the difference between native and Flatpak apps!
Cutefish
➖
Title bars have no icons. The status bar only seems to support the X11 _NET_WM_ICON property, and not the older XWMHints mechanism used by e.g. xterm.
Deepin
Did not start
Enlightenment
✔️
➖
Taskbar falls back on the icon from the Desktop Entry file the app was launched with. Only picks the correctly scaled icon variant in about half of the places, and just scales the largest one in the other half.
Fluxbox
✔️
GNOME Flashback / Metacity
✔️
Title bars have no icons
GNOME
✔️
✔️
Title bars have no icons
GNOME Classic
How do you get this running? The variables just start regular GNOME.
Taskbar only supports manually transferred icons. Scaling of class-derived icons in title bars is broken.
xmonad
No title bars
I tested all window managers, compositors, and/or desktop environments at their latest version as of January 2025 in their default configuration. There were no differences between the X11 and Wayland versions for the ones that offer both.
Yes, you can probably rice title bars and icons onto WMs that don't have them by default. I don't have the time.
That's only 6 out of 33 window managers with a bug-free implementation of class-derived icons, and still 6 out of 28 if we disregard all the tiling window managers where icons are not in scope. If you actually want icons in the title bar, the number drops to just 2, KDE and Pantheon. I'm really impressed by IceWM there though, beating all other similarly old and minimal window managers by shipping with an almost correct implementation.
For now, we'll stay with class-derived icons for budget reasons, but we could add a pixel transfer solution in the future. And that was the 2,000-word story behind this single line of code… 📕
On to packaging then, starting with Arch! Writing my first PKGBUILD was a breeze; as you'd expect from the Arch Wiki, the format and process are very well documented, and the AUR provides tons of examples in case you still need any.
The PKGBUILD guidelines have some opinions about how to handle submodules, but applying them would complicate the PKGBUILD quite a bit while bringing us nowhere close to the 📝 nirvana of shallow and sparse submodules I've scripted earlier. But since PKGBUILDs are just shell scripts that can naturally call other shell scripts, we can just ignore these guidelines, run build.sh, and end up with a simpler PKGBUILD and the intended shorter and less bloated package creation process.
Sadly, PKGBUILDs don't easily support specifying a dependency on either one of two packages, which we would need to codify the font situation. Due to the way the AUR packages both IPAMonaGothic and MS Gothic together with their Mincho and proportional variants, either of them would be Shuusou Gyoku's largest individual dependency. So you'd only want to install one or the other, but probably not both. We could resolve this by editing the PKGBUILDs of both font packages and adding a provides entry for a new and potentially controversial virtual package like ttf-japanese-14-and-16-pixel-bitmap that Shuusou Gyoku could then depend on. But with both of the packages being exclusive to the AUR, this dependency would still be annoying to resolve and you'd have no context about the difference.
Thus, the best we can do is to turn both MS Gothic and IPAMonaGothic into optional dependencies with a short one-line description of the difference, and elaborating on this difference in a comment at the top of the PKGBUILD. Thankfully, the culture around Arch makes this a non-issue because you can reasonably expect people to read your PKGBUILD if they build something from the AUR to begin with. You do always read the PKGBUILD, right?
Flatpak, on the other hand… I'm not at all opposed to the fundamental idea of installing another distro on top of an already existing distro for wider ABI compatibility; heck, Flatpak is basically no different from Wine or WSL in this regard. It's just that this particular ABI-widening distro works in a rather… unnatural way that crosses the border into utter cringe at times.
There are enough rants about Flatpak from a user's perspective out there, criticizing the bloat relative to native packages, the security implications of bundling libraries, and the questionable utility of its sandbox. But something I rarely see people talk about is just how awful Flatpak is from a developer's point of view:
The documentation is written in this weird way that presents Flatpak and its concepts in complete isolation. Without drawing any connections to previous packaging and dependency management systems you might have worked with, it left a lot of my seemingly basic questions unanswered. While it is important to explain your concepts with example code, the lack of a simple and complete reference of the manifest format doesn't exactly inspire confidence in what you're doing. Eventually, I just resorted to cross-checking features in the JSON Schema to get a better idea of what's actually possible.
The ABI-expanding distro part of Flatpak is actually called the Freedesktop platform, a currently 680 MB large stack of typical GUI application libraries updated once a year. It's accompanied by the Freedesktop SDK containing the matching development libraries and tools in another 1.7 GB. As the name implies, this distro is maintained by a separate entity with a homepage that makes the entire thing look deeply self-important and unprofessional. A blurry 25 FPS logo video, a front page full of spelling mistakes, a big focus on sponsors and events… come on, you have one job, and it's compiling and packaging a bunch of open-source libraries. Was this a result of the usual corporate move of creating more departments in order to shift blame and responsibility?
Optics aside, their documentation is even more bizarrely useless. The single bit of actual useful information I was looking for – i.e., the concrete list of packages bundled as part of their runtimes and their versions, is best found by going straight to their code repo.
The manifest of a Flatpak app can be written in your preferred lesser evil of the two most popular markup languages: JSON (slightly ugly for humans and machines alike), or YAML, the underspecified mess that uses syntactically significant whitespace while outlawing the closest thing we have to a semantic indentation character. Oh well, YAML at least supports comments, and we sure sorely need them to justify our bleeding-edge C++ module setup to the Flathub maintainers.
Adding more dependencies on top of the basic runtime can be done by either using runtime extensions or BaseApps. That's two entirely separate concepts that appear to do the same thing on the surface, except that you can only have one BaseApp. The documentation then waffles on and tries to explain both concepts with words that have meaning in isolation but once again answer exactly zero of my questions. Must a BaseApp contain a collection of at least two dependencies or why would anyone ever write the sentence that raises this question? Why do they judge BaseApps to be a "specialized concept" without elaborating, as if to suggest that their audience is too dumb to understand them? Why does a page named Dependencies document extensions as if I wanted to prepare my own package for extension by others? Why be all weird and require "extension points" to be defined when it all just comes down to overlaying another filesystem? Who cares about the special significance of the .Debug, .Locale, and .Sources conventions in the context of dependencies?
In the end, you once again get a clearer idea by simply looking at how existing code uses these concepts. Basically, SDK extensions = build-time dependencies, BaseApps = run-time dependencies, and extension points don't matter at all for our purposes because you can just arbitrarily extend the org.freedesktop.Sdk anyway. 🤷
Speaking of extensions: This exact architectural split between build-time and run-time dependencies is why the org.freedesktop.Sdk.Extension.llvm19 extension packages Clang, but not libc++. When questioned about this omission, one of the maintainers responded with the lamest of excuses: Copying the library would be inconvenient (for them), and something we can't even imagine a use case for. Um, guys? Here's a table. Compare the color of each cell between GCC and Clang. There's your use case.
Thankfully, you can build libc++ without building LLVM as a whole. Seeing how building libc++ takes basically no time at all compared to the rest of LLVM just raises even more questions about not simply providing some kind of script to copy it over.
Speaking of XDG directories, why do they create the .flatpak-builder cache directory in the current working directory and not under $XDG_CACHE_HOME where it belongs?
The modules in a Flatpak work in a similarly layered way as the commands in a Dockerfile, causing edits to a lower layer to evict previous builds of all successive layers from the cache. Any tweaking work in the lower layers therefore suffers from the same disruptive workflow you might already know from Docker, where you constantly shift the layers around to minimize unnecessary rebuilds because there's never an optimal order. Will we ever see container bros move on from layers to a proper build graph of the entire system? The stagnation in this space is saddening.
The --ccache option sort of mitigates the layering by at least caching object files in .flatpak-builder/ccache, which reduces repeated C compilation to a mere file copy from the cache to the package. But not only is this option not enabled by default, it also doesn't appear in any of the flatpak-builder example command lines in the documentation.
Also, it only appears to work with GCC, and setting CCACHE_COMPILERTYPE=clang seems to have no effect. Fortunately, my investment into C++ modules pays off here as well and keeps compile times decently short.
flatpak-builder doesn't validate the manifest schema? Misspelled or misplaced properties just silently do nothing?
Speaking of validation, why does flatpak-builder-lint take 8 seconds to validate a manifest, even if it just consists of a single line? Sure, it's written in Python, but that's an order of magnitude too slow for even that language.
No tab completion for any of the org.flatpak.Builder tools. Sandbox working as designed, I guess 🤷
Git submodule handling. Oh my goodness.
Flatpak recursively clones and checks out all of a repository's submodules. This might be necessary for some codebases, but not for this one: The Linux build doesn't need the SDL submodule, and nothing needs the second miniaudio submodule that the dr_libs use for its testing code. And if these recursive submodules didn't opt into shallow clones, you end up with lots of disk space wasted for no reason; 166.1 MiB in our case.
Except that it's actually twice that amount. There's the download cache that persists across multiple flatpak-builder runs, and then there's the temporary directory the build runs in, which gets a full second clone of the entire tree of submodules. This isn't Windows 8, there are no excuses for not using read-only symlinks.
None of this would be too bad if we could just do the same thing we did with Arch, ignore the default or recommended submodule processing, and let our shell script run the show and selectively download and check out the submodules required for the Linux build. But no – the build process of a Flatpak is strictly separated into a download stage and a build stage, and the build stage cannot access the network. Once again, Flatpak would have the option to allow build-time network access, but enabling it would mean no hosting and discoverability on Flathub for you.
I guess it makes sense from a security point of view, as reviewers would only have to audit a fixed set of declaratively specified sources rather than all code run by the build commands? But even this can only ever apply to the initial review. Allowing app developers to push updates independently from the Flathub maintainers is one of Flathub's biggest selling points. Once you're in, you or your supply chain can just simply hide the malware in an updated version of a module source. 🤷
Getting Tup to work within the Flatpak build environment is slightly tricky. The build sandbox doesn't provide access to the kernel's FUSE module, which Tup uses to track syscalls by default. Thankfully, Tup also supports syscall tracking via LD_PRELOAD, which allows us to still build Shuusou Gyoku in a parallelized way with a regular Tup binary. Imagine compiling FUSE from source only to make Tup compile, but then having to build the game via a tup generated single-threaded shell script…
One common user complaint about Flatpak is that it allows Windows app developers to stick to their beloved and un-Linux-y way of bundling all dependencies, as if they actually ever enjoyed doing that. In reality, it's not the app authors, but the Flathub maintainers and submission reviewers who do everything in their power to prevent Flathub from turning into a typical package manager. Since they ended up with a system where every new extension to the Freedesktop SDK somehow places a burden on the maintainers, they're quick to shut down everything they consider a bad idea, including a Tup package I submitted. What a great job for people who always wanted to be gatekeepers and arbiters of good ideas. If your system treats CMake as one of two blessed build systems that get first-class support, we already fundamentally disagree on basic questions of good taste.
Because even the build stages of individual modules are sandboxed from each other, the only way to persist a module's build outputs for further modules is by installing them into the same /app/ path that the final application is supposed to live in. Since most of these foundational modules will be libraries, /app/ will be full of C header files, static library files, and library-related tooling that you don't want to bloat your shipped package. Docker solves this with multi-stage builds: After building your app into an image full of all build-time dependencies and other artifacts vomited out by your build system, you can start from a fresh, minimal base image and selectively copy over only the files your app actually needs to run. Flatpak solves this in the opposite way, merely letting you manually clean up after your dependencies in the end. At least they support wildcards…
So you've built your Flatpak, but it has an issue that your native build doesn't have and it's time for some debugging. You open up a shell into the image, fire up gdb… and don't get debug symbols despite your build definitely emitting them. The documentation mentions that debug symbols are placed into a separate package, just like Arch Linux's makepkg does it, but the suggested command line to install them doesn't work:
error: No remote refs found for ‘$FLATPAK_ID’
The apparently correct command line can only be found in third-party blog posts. Pulling the package directly out of the builder cache is as random as it gets for someone not deeply familiar with the system.
Before you publish your package, you might want to inspect the bundle to make sure that your --cleanup entries actually covered all the library bloat you suddenly have to care about. Flatpak also adds a few slight annoyances there:
You could look into the build directory (not the repo directory! Very important difference! 🤪) you pass to flatpak-builder, but it also contains all the debug files and source code.
You could open the --devel shell and inspect the contents of /app/. This shell environment is rather minimal and misses both a lot of typical Linux userland tools and (of course) a package manager, but ls and find work and can do the job.
So if all of Flatpak feels like Docker anyway, why isn't it built on top of Docker to begin with? Instead, we got what amounts to a worse copy that doesn't innovate in any way I can notice. Why throw away compatibility with all of Docker's existing tooling just to gain hash-based deduplication at the file level for a couple of images? How can they seriously use a tagline like "Git for apps", which only makes sense for very, very loose definitions of "Git"?
Or maybe all the innovation went into the portals that make this thing work at all, and have at least this little game work indistinguishably from a native build past the initial load time…
… except when parts of it don't! 🤣 Audio is only supported through PulseAudio, which you might not have installed on Arch Linux. Thus, Flatpak ironically enforces another dependency on the host system that the app itself might not have needed.
Alright, you've submitted your app, incorporated the changes requested by the reviewers, waited a while, and now your app is live and has its own page on Flathub. You'd think I'd be done ranting at this point, but no:
You give them nice lossless PNG screenshots and icons, and they convert both of them to lossy WebP with clearly visible compression artifacts. How about some trust in the fact that people who give you small PNG files know what they're doing? Verified by a programmatic check whether such a lossy recompression even noticeably improves the file size, instead of blindly blowing up our icon to 4.58× the size of the original PNG. Source-quality images are way more important to me than brand colors.
The screenshot area on the app pages has a fixed height of 468 pixels. Is this some kind of a sick joke? How could anyone look at that height and not go "nah, that looks wrong, 12 more pixels and we'd be VGA-compatible, barely makes a difference anyway"?
That leaves us with two choices:
Crop those 12 pixels out of the raw game screenshots I originally wanted to have there, or
The latter probably isn't the worst idea as it also gives us a chance to show off the 16×16 variant of the icon at its intended size. But I sure didn't immediately find a KDE theme that both has 16-pixel window icons (unlike Breeze's 15 pixels at the Small size) and doesn't have obscenely large and asymmetric shadows (unlike Materia or Klassy). Shoutout to the Arc theme for matching all these constraints!
Might as well try converting these images to lossless WebP while I'm at it, in the hope that they then leave them alone… but nope, they still get lossily recompressed! 🤪 You know what, I'm not gonna bother with the rest of their guidelines, this is an embarrassment.
Finally, game controller support comes with a very similar asterisk. By default, it's disabled just like any other piece of hardware, and the documentation tells you to specify --device=input to activate it. However, this specific permission is a fairly recent development in Flatpak terms and thus isn't widely available yet? Therefore, the reviewers don't yet allow it in manifests, and your only alternative is a blanket permission for all devices in the user's system. But then, Flathub lists your app as having potentially unsafe user device (and even webcam!) access, even though you had no alternative except for disabling game controller support. What a nice sandbox they have there… 🙄
If that's the supposed future of shipping programs on Linux, they've sure made this dev look back into the past with newfound fondness. I'm now more motivated than ever to separately package Shuusou Gyoku for every distribution, if only to see whether there's just a single distro out there whose packaging system is worse than Flatpak. But then again, packaging this game for other distros is one of the most obvious contribution-ideas there is.
In the end though, the fact that we need to patch Pango to correctly render MS Gothic means that there is a point to shipping Shuusou Gyoku as a Flatpak, beyond just having a single package that works on every distro. And with a download size of 3.4 MiB and an installed size of 6.4 MiB, Shuusou Gyoku almost exemplifies the ideal use case of Flatpak: Apart from miniaudio, BLAKE3, the IPAMonaGothic font, the temporary libc++, and the patched Pango, all other dependencies of the Linux port happen to be part of the Freedesktop runtime and don't add more bloat to the system.
And so, we finally have a 100% native Linux port of Shuusou Gyoku, working and packaged, after 36 pushes! 🎉 But as usual, there's always that last bit of optional work left. The three biggest remaining portability gaps are
guaranteed support for ARM CPUs, which currently fail to build the project on Flathub due to a Tup issue, and who knows what other issues there might be,
Despite 📝 spending 10 pushes on accurate waveform BGM, MIDI support seems to be the most worthwhile feature out of the three. The whole point of the BGM work was that Linux doesn't have a native MIDI synth, so why should packagers or even the users themselves jump through the hoops of setting up some kind of softsynth if it most likely won't sound remotely close to a SC-88Pro? But if you already did, the lack of support might indeed seem unexpected.
But as described in the issue, MIDI support can also mean "a Windows-like plug-and-play" experience, without downloading a BGM pack. Despite the resulting unauthentic sound, this might also be a worthwhile thing to fund if we consider that 14 of the 17 YouTube channels that have uploaded Shuusou Gyoku videos since P0275 still had MIDI playing through the Microsoft GS Wavetable Synth and didn't bother to set up a BGM pack.
Finally, we might want to patch IPAMonaGothic at some point down the line. While a fix for the ascent and descent values that achieves perfect glyph placement without relying on hinting hacks would merely be nice to have, matching the Unicode coverage of its embedded bitmaps with MS Gothic will be crucial for non-ASCII Latin script translations. IPAMonaGothic's outlines do cover the entire Latin-1 Supplement block, but the font is missing embedded bitmaps for all of this block's small letters. Since the existing outlines prevent any glyph fallback in both Fontconfig and GDI, letters like ä, ö, ü, and ñ currently render as spaces.
Like most Japanese fonts from the Shift-JIS era, IPAMonaGothic also suffers from Greek and Cyrillic glyphs being full-width. But we'd probably just hunt for a different font to use with translations into those scripts. But it's not worth doing that for Latin scripts that are only missing a few special characters.
Ideally, I'd like to apply these edits by modifying the embedded bitmaps in a more controlled, documented, and diffable way and then recompiling the font using a pipeline of some sort. The whole field of fonts often feels impenetrable because the usual editing workflow involves throwing a binary file into a bulky GUI tool and writing out a new binary file, and it doesn't have to be this way. But it looks like I'd have to write key parts of that pipeline myself:
The venerable ttx provides no comfort features for embedded bitmaps and simply dumps their binary representation as hex strings.
The more modern UFO format does specify embedded images, but both of the biggest implementations (defcon and ufoLib2) just throw away any embedded bitmaps, and thus, the whole selling point of such tools.
That would increase the price of translations by about one extra push if you all agree that this is a good idea. If not, then we just go for the usual way of patching the .ttf file after all. In any case, we then get to host the edited font at a much nicer place than the Wayback Machine.
P0002
Build system improvements, part 2 (Preparations / Codebase cleanup)
P0003
Build system improvements, part 3 (Lua rewrite of the Tupfile / Tup bugfixes for MS-DOS Player)
P0004
Build system improvements, part 4 (Merging the 16-bit build part into the Tupfile)
P0281
Build system improvements, part 5 (MS-DOS Player bugfixes and performance tuning for Turbo C++ 4.0J)
P0282
Build system improvements, part 6 (Generating an ideal dumb batch script for 32-bit platforms)
P0283
Build system improvements, part 7 (Researching and working around Windows 9x batch file limits)
P0284
#include cleanup, part 1/2 / Decompilation (TH04/TH05 .REC loading)
P0285
#include cleanup, part 2/2 / Decompilation (TH02 MAIN.EXE High Score entry)
💰 Funded by:
GhostPhanom, [Anonymous], Blue Bolt, Yanga
🏷️ Tags:
I'm 13 days late, but 🎉 ReC98 is now 10 years old! 🎉 On June 26, 2014, I first tried exporting IDA's disassembly of TH05's OP.EXE and reassembling and linking the resulting file back into a binary, and was amazed that it actually yielded an identical binary. Now, this doesn't actually mean that I've spent 10 years working on this project; priorities have been shifting and continue to shift, and time-consuming mistakes were certainly made. Still, it's a good occasion to finally fully realize the good future for ReC98 that GhostPhanom invested in with the very first financial contribution back in 2018, deliver the last three of the first four reserved pushes, cross another piece of time-consuming maintenance off the list, and prepare the build process for hopefully the next 10 years.
But why did it take 8 pushes and over two months to restore feature parity with the old system? 🥲
The original plan for ReC98's good future was quite different from what I ended up shipping here. Before I started writing the code for this website in August 2019, I focused on feature-completing the experimental 16-bit DOS build system for Borland compilers that I'd been developing since 2018, and which would form the foundation of my internal development work in the following years. Eventually, I wanted to polish and publicly release this system as soon as people stopped throwing money at me. But as of November 2019, just one month after launch, the store kept selling out with everyone investing into all the flashier goals, so that release never happened.
In theory, this build system remains the optimal way of developing with old Borland compilers on a real PC-98 (or any other 32-bit single-core system) and outside of Borland's IDE, even after the changes introduced by this delivery. In practice though, you're soon going to realize that there are lots of issues I'd have to revisit in case any PC-98 homebrew developers are interested in funding me to finish and release this tool…
The main idea behind the system still has its charm: Your build script is a regular C++ program that #includes the build system as a static library and passes fixed structures with names of source files and build flags. By employing static structure initialization, even a 1994 Turbo C++ would let you define the whole build at compile time, although this certainly requires some dank preprocessor magic to remain anywhere near readable at ReC98 scale. 🪄 While this system does require a bootstrapping process, the resulting binary can then use the same dependency-checking mechanisms to recompile and overwrite itself if you change the C++ build code later. Since DOS just simply loads an entire binary into RAM before executing it, there is no lock to worry about, and overwriting the originating binary is something you can just do.
Later on, the system also made use of batched compilation: By passing more than one source file to TCC.EXE, you get to avoid TCC's quite noticeable startup times, thus speeding up the build proportional to the number of translation units in each batch. Of course, this requires that every passed source file is supposed to be compiled with the same set of command-line flags, but that's a generally good complexity-reducing guideline to follow in a build script. I went even further and enforced this guideline in the system itself, thus truly making per-file compiler command line switches considered harmful. Thanks to Turbo C++'s #pragma option, changing the command line isn't even necessary for the few unfortunate cases where parts of ZUN's code were compiled with inconsistent flags.
I combined all these ideas with a general approach of "targeting DOSBox": By maximizing DOS syscalls and minimizing algorithms and data structures, we spend as much time as possible in DOSBox's native-code DOS implementation, which should give us a performance advantage over DOS-native implementations of MAKE that typically follow the opposite approach.
Of course, all this only matters if the system is correct and reliable at its core. Tup teaches us that it's fundamentally impossible to have a reliable generic build system without
augmenting the build graph with all actual files read and written by each invoked build tool, which involves tracing all file-related syscalls, and
persistently serializing the full build graph every time the system runs, allowing later runs to detect every possible kind of change in the build script and rebuild or clean up accordingly.
Unfortunately, the design limitations of my system only allowed half-baked attempts at solving both of these prerequisites:
If your build system is not supposed to be generic and only intended to work with specific tools that emit reliable dependency information, you can replace syscall tracing with a parser for those specific formats. This is what my build system was doing, reading dependency information out of each .OBJ file's OMF COMENT record.
Since DOS command lines are limited to 127 bytes, DOS compilers support reading additional arguments from response files, typically indicated with an @ next to their path on the command line. If we now put every parameter passed to TCC or TLINK into a response file and leave these files on disk afterward, we've effectively serialized all command-line arguments of the entire build into a makeshift database. In later builds, the system can then detect changed command-line arguments by comparing the existing response files from the previous run with the new contents it would write based on the current build structures. This way, we still only recompile the parts of the codebase that are affected by the changed arguments, which is fundamentally impossible with Makefiles.
But this strategy only covers changes within each binary's compile or link arguments, and ignores the required deletions in "the database" when removing binaries between build runs. This is a non-issue as long as we keep decompiling on master, but as soon as we switch between master and similarly old commits on the debloated/anniversary branches, we can get very confusing errors:
The symptom is a calling convention mismatch: The two vector functions use __cdecl on master and pascal on debloated/anniversary. We've switched from anniversary (which compiles to ANNIV.EXE) back to master (which compiles to REIIDEN.EXE) here, so the .obj file on disk still uses the pascal calling convention. The build system, however, only checks the response files associated with the current target binary (REIIDEN.EXE) and therefore assumes that the .obj files still reflect the (unchanged) command-line flags in the TCC response file associated with this binary. And if none of the inputs of these .obj files changed between the two branches, they aren't rebuilt after switching, even though they would need to be.
Apparently, there's also such a thing as "too much batching", because TCC would suddenly stop applying certain compiler optimizations at very specific places if too many files were compiled within a single process? At least you quickly remember which source files you then need to manually touch and recompile to make the binaries match ZUN's original ones again…
But the final nail in the coffin was something I'd notice on every single build: 5 years down the line, even the performance argument wasn't convincing anymore. The strategy of minimizing emulated code still left me with an 𝑂(𝑛) algorithm, and with this entire thing still being single-threaded, there was no force to counteract the dependency check times as they grew linearly with the number of source files.
At P0280, each build run would perform a total of 28,130 file-related DOS syscalls to figure out which source files have changed and need to be rebuilt. At some point, this was bound to become noticeable even despite these syscalls being native, not to mention that they're still surrounded by emulator code that must convert their parameters and results to and from the DOS ABI. And with the increasing delays before TCC would do its actual work, the entire thing started feeling increasingly jankier.
While this system was waiting to be eventually finished, the public master branch kept using the Makefile that dates back to early 2015. Back then, it didn't takelong for me to abandon raw dumb batch files because Make was simply the most straightforward way of ensuring that the build process would abort on the first compile error.
The following years also proved that Makefile syntax is quite well-suited for expressing the build rules of a codebase at this scale. The built-in support for automatically turning long commands into response files was especially helpful because of how naturally it works together with batched compilation. Both of these advantages culminate in this wonderfully arcane incantation of ASCII special characters and syntactically significant linebreaks:
tcc … @&&|
$**
|
Which translates to "take the filenames of all dependents of this explicit rule, write them into a temporary file with an autogenerated name, insert this filename into the tcc … @ command line, and delete the file after the command finished executing". The @ is part of TCC's command-line interface, the rest is all MAKE syntax.
But 📝 as we all know by now, these surface-level niceties change nothing about Makefiles inherently being unreliable trash due to implementing none of the aforementioned two essential properties of a generic build system. Borland got so close to a correct and reliable implementation of autodependencies, but that would have just covered one of the two properties. Due to this unreliability, the old build16b.bat called Borland's MAKER.EXE with the -B flag, recompiling everything all the time. Not only did this leave modders with a much worse build process than I was using internally, but it also eventually got old for me to merge my internal branch onto master before every delivery. Let's finally rectify that and work towards a single good build process for everyone.
As you would expect by now, I've once again migrated to Tup's Lua syntax. Rewriting it all makes you realize once again how complex the PC-98 Touhou build process is: It has to cover 2 programming languages, 2 pipeline steps, and 3 third-party libraries, and currently generates a total of 39 executables, including the small programs I wrote for research. The final Lua code comprises over 1,300 lines – but then again, if I had written it in 📝 Zig, it would certainly be as long or even longer due to manual memory management. The Tup building blocks I constructed for Shuusou Gyoku quickly turned out to be the wrong abstraction for a project that has no debug builds, but their 📝 basic idea of a branching tree of command-line options remained at the foundation of this script as well.
This rewrite also provided an excellent opportunity for finally dumping all the intermediate compilation outputs into a separate dedicated obj/ subdirectory, finally leaving bin/ nice and clean with only the final executables. I've also merged this new system into most of the public branches of the GitHub repo.
As soon as I first tried to build it all though, I was greeted with a particularly nasty Tup bug. Due to how DOS specified file metadata mutation, MS-DOS Player has to open every file in a way that current Tup treats as a write access… but since unannotated file writes introduce the risk of a malformed build graph if these files are read by another build command later on, Tup providently deletes these files after the command finished executing. And by these files, I mean TCC.EXE as well as every one of its C library header files opened during compilation.
Due to a minor unsolved question about a failing test case, my fix has not been merged yet. But even if it was, we're now faced with a problem: If you previously chose to set up Tup for ReC98 or 📝 Shuusou Gyoku and are maybe still running 📝 my 32-bit build from September 2020, running the new build.bat would in fact delete the most important files of your Turbo C++ 4.0J installation, forcing you to reinstall it or restore it from a backup. So what do we do?
Should my custom build get a special version number so that the surrounding batch file can fail if the version number of your installed Tup is lower?
Or do I just put a message somewhere, which some people invariably won't read?
The easiest solution, however, is to just put a fixed Tup binary directly into the ReC98 repo. This not only allows me to make Tup mandatory for 64-bit builds, but also cuts out one step in the build environment setup that at least one person previously complained about. *nix users might not like this idea all too much (or do they?), but then again, TASM32 and the Windows-exclusive MS-DOS Player require Wine anyway. Running Tup through Wine as well means that there's only one PATH to worry about, and you get to take advantage of the tool checks in the surrounding batch file.
If you're one of those people who doesn't trust binaries in Git repos, the repo also links to instructions for building this binary yourself. Replicating this specific optimized binary is slightly more involved than the classic ./configure && make && make install trinity, so having these instructions is a good idea regardless of the fact that Tup's GPL license requires it.
One particularly interesting aspect of the Lua code is the way it handles sprite dependencies:
If build commands read from files that were created by other build commands, Tup requires these input dependencies to be spelled out so that it can arrange the build graph and parallelize the build correctly. We could simply put every sprite into a single array and automatically pass that as an extra input to every source file, but that would effectively split the build into a "sprite convert" and "code compile" phase. Spelling out every individual dependency allows such source files to be compiled as soon as possible, before (and in parallel to) the rest of the sprites they don't depend on. Similarly, code files without sprite dependencies can compile before the first sprite got converted, or even before the sprite converter itself got compiled and linked, maximizing the throughput of the overall build process.
Running a 30-year-old DOS toolchain in a parallel build system also introduces new issues, though. The easiest and recommended way of compiling and linking a program in Turbo C++ is a single tcc invocation:
tcc … main.cpp utils.cpp master.lib
This performs a batched compilation of main.cpp and utils.cpp within a single TCC process, and then launches TLINK to link the resulting .obj files into main.exe, together with the C++ runtime library and any needed objects from master.lib. The linking step works by TCC generating a TLINK command line and writing it into a response file with the fixed name turboc.$ln… which obviously can't work in a parallel build where multiple TCC processes will want to link different executables via the same response file.
Therefore, we have to launch TLINK with a custom response file ourselves. This file is echo'd as a separate parallel build rule, and the Lua code that constructs its contents has to replicate TCC's logic for picking the correct C++ runtime .lib file for the selected memory model.
The response file for TH02's ZUN_RES.COM, consisting of the C++ standard library, two files of ZUN code, and master.lib.
While this does add more string formatting logic, not relying on TCC to launch TLINK actually removes the one possible PATH-related error case I previously documented in the README. Back in 2021 when I first stumbled over the issue, it took a few hours of RE to figure this out. I don't like these hours to go to waste, so here's a Gist, and here's the text replicated for SEO reasons:
Issue: TCC compiles, but fails to link, with Unable to execute command 'tlink.exe'
Cause: This happens when invoking TCC as a compiler+linker, without the -c flag. To locate TLINK, TCC needlessly copies the PATH environment variable into a statically allocated 128-byte buffer. It then constructs absolute tlink.exe filenames for each of the semicolon- or \0-terminated paths, writing these into a buffer that immediately follows the 128-byte PATH buffer in memory. The search is finished as soon as TCC finds an existing file, which gives precedence to earlier paths in the PATH. If the search didn't complete until a potential "final" path that runs past the 128 bytes, the final attempted filename will consist of the part that still managed to fit into the buffer, followed by the previously attempted path.
Workaround: Make sure that the BIN\ path to Turbo C++ is fully contained within the first 127 bytes of the PATH inside your DOS system. (The 128th byte must either be a separating ; or the terminating \0 of the PATH string.)
Now that DOS emulation is an integral component of the single-part build process, it even makes sense to compile our pipeline tools as 16-bit DOS executables and then emulate them as part of the build. Sure, it's technically slower, but realistically it doesn't matter: Our only current pipeline tools are 📝 the converter for hardcoded sprites and the 📝 ZUN.COM generators, both of which involve very little code and are rarely run during regular development after the initial full build. In return, we get to drop that awkward dependency on the separate Borland C++ 5.5 compiler for Windows and yet another additional manual setup step. 🗑️ Once PC-98 Touhou becomes portable, we're probably going to require a modern compiler anyway, so you can now delete that one as well.
That gives us perfect dependency tracking and minimal parallel rebuilds across the whole codebase! While MS-DOS Player is noticeably slower than DOSBox-X, it's not going to matter all too much; unless you change one of the more central header files, you're rarely if ever going to cause a full rebuild. Then again, given that I'm going to use this setup for at least a couple of years, it's worth taking a closer look at why exactly the compilation performance is so underwhelming …
On the surface, MS-DOS Player seems like the right tool for our job, with a lot of advantages over DOSBox:
It doesn't spawn a window that boots an entire emulated PC, but is instead
perfectly integrated into the Windows console. Using it in a modern developer console would allow you to click on a compile error and have your editor immediately open the relevant file and jump to that specific line! With DOSBox, this basic comfort feature was previously unthinkable.
Heck, Takeda Toshiya originally developed it to run the equally vintage LSI C-86 compiler on 64-bit Windows. Fixing any potential issues we'd run into would be well within the scope of the project.
It consists of just a single comparatively small binary that we could just drop into the ReC98 repo. No manual setup steps required.
But once I began integrating it, I quickly noticed two glaring flaws:
Back in 2009, Takeda Toshiya chose to start the project by writing a custom DOS implementation from scratch. He was aware of DOSBox, but only adapted small tricky parts of its source code rather than starting with the DOSBox codebase and ripping out everything he didn't need. This matches the more research-oriented nature that all of his projects appear to follow, where the primary goal of writing the code is a personal understanding of the problem domain rather than a widely usable piece of software. MS-DOS Player is even the outlier in this regard, with Takeda Toshiya describing it as 珍しく実用的かもしれません. I am definitely sympathetic to this mindset; heck, my old internal build system falls under this category too, being so specialized and narrow that it made little sense to use it outside of ReC98. But when you apply it to emulators for niche systems, you end up with exactly the current PC-98 emulation scene, where there's no single universally good emulator because all of them have some inaccuracy somewhere. This scene is too small for you not to eventually become part of someone else's supply chain… 🥲
Emulating DOS is a particularly poor fit for a research/NIH project because it's Hyrum's Law incarnate. With the lack of memory protection in Real Mode, programs could freely access internal DOS (and even BIOS) data structures if they only knew where to look, and frequently did. It might look as if "DOS command-line tools" just equals x86 plus INT 21h, but soon you'll also be emulating the BIOS, PIC, PIT, EMS, XMS, and probably a few more things, all with their individual quirks that some application out there relies on. DOSBox simply had much more time to grow and mature and figure out all of these details by trial and error. If you start a DOS emulator from scratch, you're bound to duplicate all this research as people want to use your emulator to run more and more programs, until you've ended up with what's effectively a clone of DOSBox's exact logic. Unless, of course, if you draw a line somewhere and limit the scope of the DOS and BIOS emulation. But given how many people have wanted to use MS-DOS Player for running DOS TUIs in arbitrarily sized terminal windows with arbitrary fonts, that's not what happened. I guess it made sense for this use case before DOSBox-X gained a TTF output mode in late 2020?
As usual, I wouldn't mention this if I didn't run into twobugs when combining MS-DOS Player with Turbo C++ and Tup. Both of these originated from workarounds for inaccuracies in the DOS emulation that date back to MS-DOS Player's initial release and were thankfully no longer necessary with the accuracy improvements implemented in the years since.
For CPU emulation, MS-DOS Player can use either MAME's or Neko Project 21/W's x86 core, both of which are interpreters and won't win any performance contests. The NP21/W core is significantly better optimized and runs ≈41% faster, but still pales in comparison to DOSBox-X's dynamic recompiler. Running the same sequential commands that the P0280 Makefile would execute, the upstream 2024-03-02 NP21/W core build of MS-DOS Player would take to compile the entire ReC98 codebase on my system, whereas DOSBox-X's dynamic core manages the same in , or 94% faster.
Granted, even the DOSBox-X performance is much slower than we would like it to be. Most of it can be blamed on the awkward time in the early-to-mid-90s when Turbo C++ 4.0J came out. This was the time when DOS applications had long grown past the limitations of the x86 Real Mode and required DOS extenders or even sillier hacks to actually use all the RAM in a typical system of that period, but Win32 didn't exist yet to put developers out of this misery. As such, this compiler not only requires at least a 386 CPU, but also brings its own DOS extender (DPMI16BI.OVL) plus a loader for said extender (RTM.EXE), both of which need to be emulated alongside the compiler, to the great annoyance of emulator maintainers 30 years later. Even MS-DOS Player's README file notes how Protected Mode adds a lot of complexity and slowdown:
8086 binaries are much faster than 80286/80386/80486/Pentium4/IA32 binaries.
If you don't need the protected mode or new mnemonics added after 80286,
I recommend i86_x86 or i86_x64 binary.
The immediate reaction to these performance numbers is obvious: Let's just put DOSBox-X's dynamic recompiler into MS-DOS Player, right?! 🙌 Except that once you look at DOSBox-X, you immediately get why Takeda Toshiya might have preferred to start from scratch. Its codebase is a historically grown tangled mess, requiring intimate familiarity and a significant engineering effort to isolate the dynamic core in the first place. I did spend a few days trying to untangle and copy it all over into MS-DOS Player… only to be greeted with an infinite loop as soon as everything compiled for the first time. 😶 Yeah, no, that's bound to turn into a budget-exceeding maintenance nightmare.
Instead, let's look at squeezing at least some additional performance out of what we already have. A generic emulator for the entire CISCy instruction set of the 80386, with complete support for Protected Mode, but it's only supposed to run the subset of instructions and features used by a specific compiler and linker as fast as possible… wait a moment, that sounds like a use case for profile-guided optimization! This is the first time I've encountered a situation that would justify the required 2-phase build process and lengthy profile collection – after all, writing into some sort of database for every function call does slow down MS-DOS Player by roughly 15×. However, profiling just the compilation of our most complex translation unit (📝 TH01 YuugenMagan) and the linking of our largest executable (TH01's REIIDEN.EXE) should be representative enough.
I'll get to the performance numbers later, but even the build output is quite intriguing. Based on this profile, Visual Studio chooses to optimize only 104 out of MS-DOS Player's 1976 functions for speed and the rest for size, shaving off a nice 109 KiB from the binary. Presumably, keeping rare code small is also considered kind of fast these days because it takes up less space in your CPU's instruction cache once it does get executed?
With PGO as our foundation, let's run a performance profile and see if there are any further code-level optimizations worth trying out:
Removing redundant memset() calls: MS-DOS Player is written in a very C-like style of C++, and initializes a bunch of its statically allocated data by memset()ing it with 00 bytes at startup. This is strictly redundant even in C; Section 6.7.9/10 of the C standard mandates that all static data is zero-initialized by default. In turn, the program loaders of modern operating systems employ all sorts of paging tricks to reduce the CPU cost (and actual RAM usage!) of this initialization as much as possible. If you manually memset() afterward, you throw all these advantages out of the window.
Of course, these calls would only ever show up among the top CPU consumers in a performance profile if a program uses a large amount of static data, but the hardcoded 32 MiB of emulated RAM in ≥i386-supporting builds definitely qualifies. Zeroing 32.8 MiB of memory makes up a significant chunk of the runtime of some of the shorter build steps and quickly adds up; a full rebuild of the ReC98 codebase currently spawns a total of 361 MS-DOS Player instances, totaling 11.5 GiB of needless memory writes.
Limiting the emulated instruction set: NP21/W's x86 core emulates everything up to the SSE3 extension from 2004, but Turbo C++ 4.0J's x86 instruction set usage doesn't stretch past the 386. It doesn't even need the x87 FPU for compiling code that involves floating-point constants. Disabling all these unneeded extensions speeds up x86's infamously annoying instruction decoding, and also reduces the size of the MS-DOS Player binary by another 149.5 KiB. The source code already had macros for this purpose, and only needed a slight fix for the code to compile with these macros disabled.
Removing x86 paging: Borland's DOS extender uses segmented memory addressing even in Protected Mode. This allows us to remove the MMU emulation and the corresponding "are we paging" check for every memory access.
Removing cycle counting: When emulating a whole system, counting the cycles of each instruction is important for accurately synchronizing the CPU with other pieces of hardware. As hinted above, MS-DOS Player does emulate and periodically update a few pieces of hardware outside the CPU, but we need none of them for a build tool.
Testing Takeda Toshiya's optimizations: In a nice turn of events, Takeda Toshiya merged every single one of my bugfixes and optimization flags into his upstream codebase. He even agreed with my memset() and cycle counting removal optimizations, which are now part of all upstream builds as of 2024-06-24. For the 2024-06-27 build, he claims to have gone even further than my more minimal optimization, so let's see how these additional changes affect our build process.
Further risky optimizations: A lot of the remaining slowness of x86 emulation comes from the segmentation and protection fault checks required for every memory access. If we assume that the emulator only ever executes correct code, we can remove these checks and implement further shortcuts based on their absence.
The L[DEFGS]S group of instructions that load a segment and offset register from a 32-bit far pointer, for example, are both frequently used in Turbo C++ 4.0J code and particularly expensive to emulate. Intel specified their Real Mode operation as loading the segment and offset part in two separate 16-bit reads. But if we assume that neither of those reads can fault, we can compress them into a single 32-bit read and thus only perform the costly address translation once rather than twice. Emulator authors are probably rolling their eyes at this gross violation of Intel documentation now, but it's at least worth a try to see just how much performance we could get out of it.
Measured on a 6-year-old 6-core Intel Core i5 8400T on Windows 11. The first number in each column represents the codebase before the #include cleanup explained below, and the second one corresponds to this commit. All builds are 64-bit, 32-bit builds were ≈5% slower across the board. I kept the fastest run within three attempts; as Tup parallelizes the build process across all CPU cores, it's common for the long-running full build to take up to a few seconds longer depending on what else is running on your system. Tup's standard output is also redirected to a file here; its regular terminal output and nice progress bar will add more slowdown on top.
The key takeaways:
By merely disabling certain x86 features from MS-DOS Player and retaining the accuracy of the remaining emulation, we get speedups of ≈60% (full build), ≈70% (median TU), and ≈80% (largest TU).
≈25% (full build), ≈29% (median TU), and ≈41% (largest TU) of this speedup came from Visual Studio's profile-guided optimization, with no changes to the MS-DOS Player codebase.
The effects of removing cycle counting are the biggest surprise. Between ≈17% and ≈23%, just for removing one subtraction per emulated instruction? Turns out that in the absence of a "target cycle amount" setting, the x86 emulation loop previously ran for only a single cycle. This caused the PIC check to run after every instruction, followed by PIT, serial I/O, keyboard, mouse, and CRTC update code every millisecond. Without cycle counting, the x86 loop actually keeps running until a CPU exception is raised or the emulated process terminates, skipping the hardware code during the vast majority of the program's execution time.
While Takeda Toshiya's changes in the 2024-06-27 build completely throw out the cycle counter and clean up process termination, they also reintroduce the hardware updates that made up the majority of the cycle removal speedup. This explains the results we're getting: The small speedup for full rebuilds is too insignificant to bother with and might even fall within a statistical margin of error, but the build slows down more and more the longer the emulated process runs. Compiling and linking YuugenMagan takes a whole 14% longer on generic builds, and ≈9-12% longer on PGO builds. I did another in-between test that just removed the x86 loop from the cycle removal version, and got exactly the same numbers. This just goes to show how much removing two writes to a fixed memory address per emulated instruction actually matters. Let's not merge back this one, and stay on top of 2024-06-24 for the time being.
The risky optimizations of ignoring segment limits and speeding up 32-bit segment+offset pointer load instructions could yield a further speedup. However, most of these changes boil down to removing branches that would never be taken when emulating correct x86 code. Consequently, these branches get recorded as unlikely during PGO training, which then causes the profile-guided rebuild to rearrange the instructions on these branches in a way that favors the common case, leaving the rest of their effective removal to your CPU's branch predictor. As such, the 10%-15% speedup we can observe in generic builds collapses down to 2%-6% in PGO builds. At this rate and with these absolute durations, it's not worth it to maintain what's strictly a more inaccurate fork of Neko Project 21/W's x86 core.
The redundant header inclusions afforded by #include guards do in fact have a measurable performance cost on Turbo C++ 4.0J, slowing down compile times by 5%.
But how does this compare to DOSBox-X's dynamic core? Dynamic recompilers need some kind of cache to ensure that every block of original ASM gets recompiled only once, which gives them an advantage in long-running processes after the initial warmup. As a result, DOSBox-X compiles and links YuugenMagan in , ≈92% faster than even our optimized MS-DOS Player build. That percentage resembles the slowdown we were initially getting when comparing full rebuilds between DOSBox-X and MS-DOS Player, as if we hadn't optimized anything.
On paper, this would mean that DOSBox-X barely lost any of its huge advantage when it comes to single-threaded compile+link performance. In practice, though, this metric is supposed to measure a typical decompilation or modding workflow that focuses on repeatedly editing a single file. Thus, a more appropriate comparison would also have to add the aforementioned constant 28,130 syscalls that my old build system required to detect that this is the one file/binary that needs to be recompiled/relinked. The video at the top of this blog post happens to capture the best time () I got for the detection process on DOSBox-X. This is almost as slow as the compilation and linking itself, and would have only gotten slower as we continue decompiling the rest of the games. Tup, on the other hand, performs its filesystem scan in a near-constant , matching the claim in Section 4.7 of its paper, and thus shrinking the performance difference to ≈14% after all. Sure, merging the dynamic core would have been even better (contribution-ideas, anyone?), but this is good enough for now.
Just like with Tup, I've also placed this optimized binary directly into the ReC98 repo and added the specific build instructions to the GitHub release page.
I do have more far-reaching ideas for further optimizing Neko Project 21/W's x86 core for this specific case of repeated switches between Real Mode and Protected Mode while still retaining the interpreted nature of this core, but these already strained the budget enough.
The perhaps more important remaining bottleneck, however, is hiding in the actual DOS emulation. Right now, a Tup-driven full rebuild spawns a total of 361 MS-DOS Player processes, which means that we're booting an emulated DOS 361 times. This isn't as bad as it sounds, as "booting DOS" basically just involves initializing a bunch of internal DOS structures in conventional memory to meaningful values. However, these structures also include a few environment variables like PATH, APPEND, or TEMP/TMP, which MS-DOS Player seamlessly integrates by translating them from their value on the Windows host system to the DOS 8.3 format. This could be one of the main reasons why MS-DOS Player is a native Windows program rather than being cross-platform:
On Windows, this path translation is as simple as calling GetShortPathNameA(), which returns a unique 8.3 name for every component along the path.
Also, drive letters are an integralpart of the DOS INT 21h API, and Windows still uses them as well.
However, the NT kernel doesn't actually use drive letters either, and views them as just a legacy abstraction over its reality of volume GUIDs. Converting paths back and forth between these two views therefore requires it to communicate with a
mount point manager service, which can coincidentally also be observed in debug builds of Tup.
As a result, calling any path-retrieving API is a surprisingly expensive operation on modern Windows. When running a small sprite through our 📝 sprite converter, MS-DOS Player's boot process makes up 56% of the runtime, with 64% of that boot time (or 36% of the entire runtime) being spent on path translation. The actual x86 emulation to run the program only takes up 6.5% of the runtime, with the remaining 37.5% spent on initializing the multithreaded C++ runtime.
But then again, the truly optimal solution would not involve MS-DOS Player at all. If you followed general video game hacking news in May, you'll probably remember the N64 community putting the concept of statically recompiled game ports on the map. In case you're wondering where this seemingly sudden innovation came from and whether a reverse-engineered decompilation project like ReC98 is obsolete now, I wrote a new FAQ entry about why this hype, although justified, is at least in part misguided. tl;dr: None of this can be meaningfully applied to PC-98 games at the moment.
On the other hand, recompiling our compiler would not only be a reasonable thing to attempt, but exactly the kind of problem that recompilation solves best. A 16-bit command-line tool has none of the pesky hardware factors that drag down the usefulness of recompilations when it comes to game ports, and a recompiled port could run even faster than it would on 32-bit Windows. Sure, it's not as flashy as a recompiled game, but if we got a few generous backers, it would still be a great investment into improving the state of static x86 recompilation by simply having another open-source project in that space. Not to mention that it would be a great foundation for improving Turbo C++ 4.0J's code generation and optimizations, which would allow us to simplify lots of awkward pieces of ZUN code… 🤩
That takes care of building ReC98 on 64-bit platforms, but what about the 32-bit ones we used to support? The previous split of the build process into a Tup-driven 32-bit part and a Makefile-driven 16-bit part sure was awkward and I'm glad it's gone, but it did give you the choice between 1) emulating the 16-bit part or 2) running both parts natively on 32-bit Windows. While Tup's upstream Windows builds are 64-bit-only, it made sense to 📝 compile a custom 32-bit version and thus turn any 32-bit Windows ≥Vista into the perfect build platform for ReC98. Older Windows versions that can't run Tup had to build the 32-bit part using a separately maintained dumb batch script created by tup generate, but again, due to Make being trash, they were fully rebuilding the entire codebase every time anyway.
Driving the entire build via Tup changes all of that. Now, it makes little sense to continue using 32-bit Tup:
We need to DLL-inject into a 64-bit MS-DOS Player. Sure, we could compile a 32-bit build of MS-DOS Player, but why would we? If we look at current marketshares, nobody runs 32-bit Windows anymore, not even by accident. If you run 32-bit Windows in 2024, it's because you know what you're doing and made a conscious choice for the niche use case of natively running DOS programs. Emulating them defeats the whole point of setting up this environment to begin with.
It would make sense if Tup could inject into DOS programs, but it can't.
Also, as we're going to see later, requiring Windows ≥Vista goes in the opposite direction of what we want for a 32-bit build. The earlier the Windows version, the better it is at running native DOS tools.
This means that we could now only support 32-bit Windows via an even larger tup generated batch file. We'd have to move the MS-DOS Player prefix of the respective command lines into an environment variable to make Tup use the same rules for both itself and the batch file, but the result seems to work…
…but it's really slow, especially on Windows 9x. 🐌 If we look back at the theory behind my previous custom build system, we can already tell why: Efficiently building ReC98 requires a completely different approach depending on whether you're running a typical modern multi-core 64-bit system or a vintage single-core 32-bit system. On the former, you'd want to parallelize the slow emulation as much as you can, so you maximize the amount of TCC processes to keep all CPU cores as busy as possible. But on the latter, you'd want the exact opposite – there, the biggest annoyance is the repeated startup and shutdown of the VDM, TCC, and its DOS extender, so you want to continue batching translation units into as few TCC processes as possible.
CMake fans will probably feel vindicated now, thinking "that sounds exactly like you need a meta build system 🤪". Leaving aside the fact that the output vomited by all of CMake's Makefile generators is a disgusting monstrosity that's far removed from addressing any performance concerns, we sure could solve this problem by adding another layer of abstraction. But then, I'd have to rewrite my working Lua script into either C++ or (heaven forbid) Batch, which are the only options we'd have for bootstrapping without adding any further dependencies, and I really wouldn't want to do that. Alternatively, we could fork Tup and modify tup generate to rewrite the low-level build rules that end up in Tup's database.
But why should we go for any of these if the Lua script already describes the build in a high-level declarative way? The most appropriate place for transforming the build rules is the Lua script itself…
… if there wasn't the slight problem of Tup forbidding file writes from Lua. 🥲 Presumably, this limitation exists because there is no way of replicating these writes in a tup generated dumb shell script, and it does make sense from that point of view.
But wait, printing to stdout or stderr works, and we always invoke Tup from a batch file anyway. You can now tell where this is going. Hey, exfiltrating commands from a build script to the build system via standard I/O streams works for Rust's Cargo too!
Just like Cargo, we want to add a sufficiently unique prefix to every line of the generated batch script to distinguish it from Tup's other output. Since Tup only reruns the Lua script – and would therefore print the batch file – if the script changed between the previous and current build run, we only want to overwrite the batch file if we got one or more lines. Getting all of this to work wasn't all too easy; we're once again entering the more awful parts of Batch syntax here, which apparently are so terrible that Wine doesn't even bother to correctly implement parts of it. 😩
Most importantly, we don't really want to redirect any of Tup's standard I/O streams. Redirecting stdout disables console output coloring and the pretty progress bar at the bottom, and looping over stderr instead of stdout in Batch is incredibly awkward. Ideally, we'd run a second Tup process with a sub-command that would just evaluate the Lua script if it changed - and fortunately, tup parse does exactly that. 😌
In the end, the optimally fast and ERRORLEVEL-preserving solution involves two temporary files. But since creating files between two Tup runs causes it to reparse the Lua code, which would print the batch file to the unfiltered stdout, we have to hide these temporary files from Tup by placing them into its .tup/ database directory. 🤪
On a more positive note, programmatically generating batches from single-file TCC rules turned out to be a great idea. Since the Lua code maps command-line flags to arrays of input files, it can also batch across binaries, surpassing my old system in this regard. This works especially well on the debloated and anniversary branches, which replace ZUN's little command-line flag inconsistencies with a single set of good optimization flags that every translation unit is compiled with.
Time to fire up some VMs then… only to see the build failing on Windows 9x with multiple unhelpful Bad command or file name errors. Clearly, the long echo lines that write our response files run up against some length limit in command.com and need to be split into multiple ones. Windows 9x's limit is larger than the 127 characters of DOS, that's for sure, and the exact number should just be one search away…
…except that it's not the 1024 characters recounted in a surviving newsgroup post. Sure, lines are truncated to 1023 bytes and that off-by-one error is no big deal in this context, but that's not the whole story:
: This not unrealistic command line is 137 bytes long and fails on Windows 9x?!
> echo -DA=1 2 3 a/b/c/d/1 a/b/c/d/2 a/b/c/d/3 a/b/c/d/4 a/b/c/d/5 a/b/c/d/6 a/b/c/d/7 a/b/c/d/8 a/b/c/d/9 a/b/c/d/10 a/b/c/d/11 a/b/c/d/12
Bad command or file name
Wait, what, something about / being the SWITCHAR? And not even just that…
: Down to 132 bytes… and 32 "assignments"?
> echo a=0 b=1 c=2 d=3 e=4 f=5 g=6 h=7 i=8 j=9 k=0 l=1 m=2 n=3 o=4 p=5 q=6 r=7 s=8 t=9 u=0 v=1 w=2 x=3 y=4 z=5 a=0 b=1 c=2 d=3 e=4 f=5
Bad command or file name
And what's perhaps the worst example:
: 64 slashes. Works on DOS, works on `cmd.exe`, fails on 9x.
> echo ////////////////////////////////////////////////////////////////
Bad command or file name
My complete set of test cases: 2024-07-09-Win9x-batch-tokenizer-tests.bat
So, time to load command.com into DOSBox-X's debugger and step through some code. 🤷 The earliest NT-based Windows versions were ported to a variety of CPUs and therefore received the then-all-new cmd.exe shell written in C, whereas Windows 9x's command.com was still built on top of the dense hand-written ASM code that originated in the very first DOS versions. Fortunately though, Microsoft open-sourced one of the later DOS versions in April. This made it somewhat easier to cross-reference the disassembly even though the Windows 9x version significantly diverged in the parts we're interested in.
And indeed: After truncating to 1023 bytes and parsing out any redirectors, each line is split into tokens around whitespace and = signs and before every occurrence of the SWITCHAR. These tokens are written into a statically allocated 64-element array, and once the code tries to write the 65th element, we get the Bad command or file name error instead.
#
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
String
echo
-DA
1
2
3
a
/B
/C
/D
/1
a
/B
/C
/D
/2
Switch flag
🚩
🚩
🚩
🚩
🚩
🚩
🚩
🚩
The first few elements of command.com's internal argument array after calling the Windows 9x equivalent of parseline with my initial example string. Note how all the "switches" got capitalized and annotated with a flag, whereas the = sign no longer appears in either string or flag form.
Needless to say, this makes no sense. Both DOS and Windows pass command lines as a single string to newly created processes, and since this tokenization is lossy, command.com will just have to pass the original string anyway. If your shell wants to handle tokenization at a central place, it should happen after it decided that the command matches a builtin that can actually make use of a pointer to the resulting token array – or better yet, as the first call of each builtin's code. Doing it before is patently ridiculous.
I don't know what's worse – the fact that Windows 9x blindly grinds each batch line through this tokenizer, or the fact that no documentation of this behavior has survived on today's Internet, if any even ever existed. The closest thing I found was this page that doesn't exist anymore, and it also just contains a mere hint rather than a clear description of the issue. Even the usual Batch experts who document everything else seem to have a blind spot when it comes to this specific issue. As do emulators: DOSBox and FreeDOS only reimplement the sane DOS versions of command.com, and Wine only reimplements cmd.exe.
Oh well. 71 lines of Lua later, the resulting batch file does in fact work everywhere:
The clear performance winner at 11.15 seconds after the initial tool check, though sadly bottlenecked by strangely long TASM32 startup times. As for TCC though, even this performance is the slowest a recompiled port would be. Modern compiler optimizations are probably going to shave off another second or two, and implementing support for #pragma once into the recompiled code will get us the aforementioned 5% on top.
If you run this on VirtualBox on modern Windows, make sure to disable Hyper-V to avoid the slower snail execution mode. 🐢
Building in Windows XP under Hyper-V exchanges Windows 98's slow TASM32 startup times for slightly slower DOS performance, resulting in a still decent 13.4 seconds.
29.5 seconds?! Surely something is getting emulated here. And this is the best time I randomly got; my initial preview recording took 55 seconds which is closer to DOSBox-X's dynamic core than it is to Windows 9x. Given how poorly 32-bit Windows 10 performs, Microsoft should have probably discontinued 32-bit Windows after 8 already. If any 16-bit program you could possibly want to run is either too slow or likely to exhibit other compatibility issues (📝 Shuusou Gyoku, anyone?), the existence of 32-bit Windows 10 is nothing but a maintenance burden. Especially because Windows 10 simultaneously overhauled the console subsystem, which is bound to cause compatibility issues anyway. It sure did for me back in 2019 when I tried to get my build system to work…
But wait, there's more! The codebase now compiles on all 32-bit Windows systems I've tested, and yields binaries that are equivalent to ZUN's… except on 32-bit Windows 10. 🙄 Suddenly, we're facing the exact same batched compilation bug from my custom build system again, with REIIDEN.EXE being 16 bytes larger than it's supposed to be.
Looks like I have to look into that issue after all, but figuring out the exact cause by debugging TCC would take ages again. Thankfully, trial and error quickly revealed a functioning workaround: Separating translation unit filenames in the response file with two spaces rather than one. Really, I couldn't make this up. This is the most ridiculous workaround for a bug I've encountered in a long time.
The TCC response file generation code for all current decompiled TH04 code, split into multiple echo calls based on the Windows 9x batch tokenizer rules and with double spaces between each parameter for added "safety". Would this also have been the solution for the batched compilation bugs I was experiencing with my old build system in DOSBox? I suddenly was unable to reproduce these bugs, so we won't know for the time being…
Hopefully, you've now got the impression that supporting any kind of 32-bit Windows build is way more of a liability than an asset these days, at least for this specific project. "Real hardware", "motivating a TCC recompilation", and "not dropping previous features" really were the only reasons for putting up with the sheer jank and testing effort I had to go through. And I wouldn't even be surprised if real-hardware developers told me that the first reason doesn't actually hold up because compiling ReC98 on actual PC-98 hardware is slow enough that they'd rather compile it on their main machine and then transfer the binaries over some kind of network connection.
I guess it also made for some mildly interesting blog content, but this was definitely the last time I bothered with such a wide variety of Windows versions without being explicitly funded to do so. If I ever get to recompile TCC, it will be 64-bit only by default as well.
Instead, let's have a tier list of supported build platforms that clearly defines what I am maintaining, with just the most convincing 32-bit Windows version in Tier 1. Initially, that was supposed to be Windows 98 SE due to its superior performance, but that's just unreasonable if key parts of the OS remain undocumented and make no sense. So, XP it is.
*nix fans will probably once again be disappointed to see their preferred OS in Tier 2. But at least, all we'd need for that to move up to Tier 1 is a CI configuration, contributed either via funding me or sending a PR. (Look, even more contribution-ideas!)
Getting rid of the Wine requirement for a fully cross-platform build process wouldn't be too unrealistic either, but would require us to make a few quality decisions, as usual:
Do we run the DOS tools by creating a cross-platform MS-DOS Player fork, or do we statically recompile them?
Do we replace 32-bit Windows TASM with the 16-bit DOS TASM.EXE or TASMX.EXE, which we then either run through our forked MS-DOS Player or recompile? This would further slow down the build and require us to get rid of these nice long non-8.3 filenames… 😕 I'd only recommend this after the looming librarization of ZUN's master.lib fork is completed.
Or do we try migrating to JWasm again? As an open-source assembler that aims for MASM compatibility, it's the closest we can get to TASM, but it's not a drop-in replacement by any means. I already tried in late 2014, but encountered too many issues and quickly abandoned the idea. Maybe it works better now that we have less ASM? In any case, this migration would only get easier the less ASM code we have remaining in the codebase as we get closer to the 100% finalization mark.
Y'know what I think would be the best idea for right now, though? Savoring this new build system and spending an extended amount of time doing actual decompilation or modding for a change.
Now that even full rebuilds are decently fast, let's make use of that productivity boost by doing some urgent and far-reaching code cleanup that touches almost every single C++ source file. The most immediately annoying quirk of this codebase was the silly way each translation unit #included the headers it needed. Many years ago, I measured that repeatedly including the same header did significantly impact Turbo C++ 4.0J's compilation times, regardless of any include guards inside. As a consequence of this discovery, I slightly overreacted and decided to just not use any include guards, ever. After all, this emulated build process is slow enough, and we don't want it to needlessly slow down even more! This way, redundantly including any file that adds more than just a few #define macros won't even compile, throwing lots of Multiple definition errors.
Consequently, the headers themselves #included almost nothing. Starting a new translation unit therefore always involved figuring and spelling out the transitive dependencies of the headers the new unit actually wants to use, in a short trial-and-error process. While not too bad by itself, this was bound to become quite counterproductive once we get closer to porting these games: If some inlined function in a header needed access to, let's say, PC-98-specific I/O ports as an implementation detail, the header would have externalized this dependency to the top-level translation unit, which in turn made that that unit appear to contain PC-98-native code even if the unit's code itself was perfectly portable.
But once we start making some of these implicit transitive dependencies optional, it all stops being justifiable. Sometimes, a.hpp declared things that required declarations from b.hpp but these things are used so rarely that it didn't justify adding #include "b.hpp" to all translation units that #include "a.hpp". So how about conditionally declaring these things based on previously #included headers?
#if (defined(SUBPIXEL_HPP) && defined(PLANAR_H))
// Sets the [tile_ring] tile at (x, y) to the given VRAM offset.
void tile_ring_set_vo(subpixel_t x, subpixel_t y, vram_offset_t image_vo);
#endif
You can maybe do this in a project that consistently sorts the #include lists in every translation unit… err, no, don't do this, ever, it's awful. Just separate that declaration out into another header.
Now that we've measured that the sane alternative of include guards comes with a performance cost of just 5% and we've further reduced its effective impact by parallelizing the build, it's worth it to take that cost in exchange for a tidy codebase without such surprises. From now on, every header file will #include its own dependencies and be a valid translation unit that must compile on its own without errors. In turn, this allows us to remove at least 1,000 #includes of transitive dependencies from .cpp files. 🗑️
However, that 5% number was only measured after I reduced these redundant #includes to their absolute minimum. So it still makes sense to only add include guards where they are absolutely necessary – i.e., transitively dependent headers included from more than one other file – and continue to (ab)use the Multiple definition compiler errors as a way of communicating "you're probably #including too many headers, try removing a few". Certainly a less annoying error than Undefined symbol.
Since all of this went way over the 7-push mark, we've got some small bits of RE and PI work to round it all out. The .REC loader in TH04 and TH05 is completely unremarkable, but I've got at least a bit to say about TH02's High Score menu. I already decompiled MAINE.EXE's post-Staff Roll variant in 2015, so we were only missing the almost identical MAIN.EXE variant shown after a Game Over or when quitting out of the game. The two variants are similar enough that it mostly needed just a small bit of work to bring my old 2015 code up to current standards, and allowed me to quickly push TH02 over the 40% RE mark.
Functionally, the two variants only differ in two assignments, but ZUN once again chose to copy-paste the entire code to handle them. This was one of ZUN's better copy-pasting jobs though – and honestly, I can't even imagine how you would mess up a menu that's entirely rendered on the PC-98's text RAM. It almost makes you wonder whether ZUN actually used the same #if ENDING preprocessor branching that my decompilation uses… until the visual inconsistencies in the alignment of the place numbers and the and labels clearly give it away as copy-pasted:
Next up: Starting the big Seihou summer! Fortunately, waiting two more months was worth it: In mid-June, Microsoft released a preview version of Visual Studio that, in response to my bug report, finally, finally makes C++ standard library modules fully usable. Let's clean up that codebase for real, and put this game into a window.
Technical debt, part 10… in which two of the PMD-related functions came
with such complex ramifications that they required one full push after
all, leaving no room for the additional decompilations I wanted to do. At
least, this did end up being the final one, completing all
SHARED segments for the time being.
The first one of these functions determines the BGM and sound effect
modes, combining the resident type of the PMD driver with the Option menu
setting. The TH04 and TH05 version is apparently coded quite smartly, as
PC-98 Touhou only needs to distinguish "OPN- /
PC-9801-26K-compatible sound sources handled by PMD.COM"
from "everything else", since all other PMD varieties are
OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's
AH=09h function. I'll leave a comprehensive, fully documented
enum to interested contributors, since that would involve research into
basically the entire history of the PC-9800 series, and even the clearly
out-of-scope PC-88VA. After all, distinguishing between more versions of
the PMD driver in the Option menu (and adding new sprites for them!) is
strictly mod territory.
The honor of being the final decompiled function in any SHARED
segment went to TH04's snd_load(). TH04 contains by far the
sanest version of this function: Readable C code, no new ZUN bugs (and
still missing file I/O error handling, of course)… but wait, what about
that actual file read syscall, using the INT 21h, AH=3Fh DOS
file read API? Reading up to a hardcoded number of bytes into PMD's or
MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in
TH05… that's kind of weird. About time we looked closer into this.
Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one
memory segment for these, as especially TH05's code might suggest to
anyone unfamiliar with these drivers. Instead,
you can customize the size of these buffers on its command line. In
GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound
effects, and 12 KiB for MMD files in TH02… which means that the hardcoded
sizes in snd_load() are completely wrong, no matter how you
look at them. Consequently, this read syscall
will overflow PMD's or MMD's song or sound effect buffer if the
given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT
instead, and it would have been fine. As it also turns out though,
PMD has an API function (AH=22h) to retrieve the actual
buffer sizes, provided for exactly that purpose. There is little excuse
not to use it, as it also gives you PMD's default sizes if you don't
specify any yourself.
(Unless your build process enumerates all PMD files that are part of the
game, and bakes the largest size into both snd_load() and
GAME.BAT. That would even work with MMD, which doesn't have
an equivalent for AH=22h.)
What'd be the consequence of loading a larger file then? Well, since we
don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single
memory segment. As a result, the limit for the combined size of the song,
instrument, and sound effect buffer is determined by the amount of
code in the driver itself. In PMD86 version 4.8o (bundled with TH04
and TH05) for example, the remaining size for these buffers is exactly
45,555 bytes. Being an actually good programmer who doesn't blindly trust
user input, KAJA thankfully validates the sizes given via the
/M, /V, and /E command-line options
before letting the driver reside in memory, and shuts down with an error
message if they exceed 40 KiB. Would have been even better if he calculated
the exact size – even in the current
PMD version 4.8s from
January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect
is down to the INT 21h, AH=3Fh implementation in the
underlying DOS version. DOS 3.3 treats the destination address as linear
and reads past the end of the segment,
DOS
5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining
space in the segment, and maybe there's even a DOS that wraps around
and ends up overwriting the PMD driver code. In any case: You will
overwrite what's after the driver in memory – typically, the game .EXE and
its master.lib functions.
It almost feels like a happy accident that this doesn't cause issues in
the original games. The largest PMD file in any of the 4 games, the -86
version of 幽夢 ~ Inanimate Dream, takes up 8,099 bytes,
just under the 8,192 byte limit for BGM. For modders, I'd really recommend
implementing this properly, with PMD's AH=22h function and
error handling, once position independence has been reached.
Whew, didn't think I'd be doing more research into KAJA's drivers during
regular ReC98 development! That's probably been the final time though, as
all involved functions are now decompiled, and I'm unlikely to iterate
over them again.
And that's it! Repaid the biggest chunk of technical debt, time for some
actual progress again. Next up: Reopening the store tomorrow, and waiting
for new priorities. If we got nothing by Sunday, I'm going to put the
pending [Anonymous] pushes towards some work on the website.
P0137
Separating translation units, part 8/10 (focused around TH03) + Segment alignment research
💰 Funded by:
[Anonymous]
🏷️ Tags:
Whoops, the build was broken again? Since
P0127 from
mid-November 2020, on TASM32 version 5.3, which also happens to be the
one in the DevKit… That version changed the alignment for the default
segments of certain memory models when requesting .386
support. And since redefining segment alignment apparently is highly
illegal and absolutely has to be a build error, some of the stand-alone
.ASM translation units didn't assemble anymore on this version. I've only
spotted this on my own because I casually compiled ReC98 somewhere else –
on my development system, I happened to have TASM32 version 5.0 in the
PATH during all this time.
At least this was a good occasion to
get rid of some
weird segment alignment workarounds from 2015, and replace them with the
superior convention of using the USE16 modifier for the
.MODEL directive.
ReC98 would highly benefit from a build server – both in order to
immediately spot issues like this one, and as a service for modders.
Even more so than the usual open-source project of its size, I would say.
But that might be exactly
because it doesn't seem like something you can trivially outsource
to one of the big CI providers for open-source projects, and quickly set
it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular
64-bit Windows 10 and DOSBox running the exact build tools from the DevKit.
Ideally, though, such a server should really run the optimal configuration
of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step
to run natively, which already is something that no popular CI service out
there offers. Then, we'd optimally expand to Linux, every other Windows
version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd
be a lot. An experimental project all on its own, with additional hosting
costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest
there is once the store reopens (which will be at the beginning of May, at
the latest). That aside, it would 📝 also be
a great project for outside contributors!
So, technical debt, part 8… and right away, we're faced with TH03's
low-level input function, which
📝 once📝 again📝 insists on being word-aligned in a way we
can't fake without duplicating translation units.
Being undecompilable isn't exactly the best property for a function that
has been interesting to modders in the past: In 2018,
spaztron64 created an
ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human
multiplayer mode: 2021-04-04-TH03-WASD-2player.zip
However, this remapping attempt remained quite limited, since we hadn't
(and still haven't) reached full position independence for TH03 yet.
There's quite some potential for size optimizations in this function, which
would allow more BIOS key groups to already be used right now, but it's not
all that obvious to modders who aren't intimately familiar with x86 ASM.
Therefore, I really wouldn't want to keep such a long and important
function in ASM if we don't absolutely have to…
… and apparently, that's all the motivation I needed? So I took the risk,
and spent the first half of this push on reverse-engineering
TCC.EXE, to hopefully find a way to get word-aligned code
segments out of Turbo C++ after all.
And there is! The -WX option, used for creating
DPMI
applications, messes up all sorts of code generation aspects in weird
ways, but does in fact mark the code segment as word-aligned. We can
consider ourselves quite lucky that we get to use Turbo C++ 4.0, because
this feature isn't available in any previous version of Borland's C++
compilers.
That allowed us to restore all the decompilations I previously threw away…
well, two of the three, that lookup table generator was too much of a mess
in C. But what an abuse this is. The
subtly different code generation has basically required one creative
workaround per usage of -WX. For example, enabling that option
causes the regular PUSH BP and POP BP prolog and
epilog instructions to be wrapped with INC BP and
DEC BP, for some reason:
a_function_compiled_with_wx proc
inc bp ; ???
push bp
mov bp, sp
; [… function code …]
pop bp
dec bp ; ???
ret
a_function_compiled_with_wx endp
Luckily again, all the functions that currently require -WX
don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty
close: snd_se_reset(void) is one of the functions that require
word alignment. Previously, it shared a translation unit with the
immediately following snd_se_play(int new_se), which does take
a parameter, and therefore would have had its prolog and epilog code messed
up by -WX.
Since the latter function has a consistent (and thus, fakeable) alignment,
I simply split that code segment into two, with a new -WX
translation unit for just snd_se_reset(void). Problem solved –
after all, two C++ translation units are still better than one ASM
translation unit. Especially with all the
previous #include improvements.
The rest was more of the usual, getting us 74% done with repaying the
technical debt in the SHARED segment. A lot of the remaining
26% is TH04 needing to catch up with TH03 and TH05, which takes
comparatively little time. With some good luck, we might get this
done within the next push… that is, if we aren't confronted with all too
many more disgusting decompilations, like the two functions that ended this
push.
If we are, we might be needing 10 pushes to complete this after all, but
that piece of research was definitely worth the delay. Next up: One more of
these.
P0099
TH01 decompilation (Pellets, part 1)
P0100
TH01 decompilation (Pellets, part 2)
P0101
TH01 decompilation (Pellets, part 3)
P0102
TH01 decompilation (Pellets, part 4)
💰 Funded by:
Ember2528, Yanga
🏷️ Tags:
Well, make that three days. Trying to figure out all the details behind
the sprite flickering was absolutely dreadful…
It started out easy enough, though. Unsurprisingly, TH01 had a quite
limited pellet system compared to TH04 and TH05:
The cap is 100, rather than 240 in TH04 or 180 in TH05.
Only 6 special motion functions (with one of them broken and unused)
instead of 10. This is where you find the code that generates SinGyoku's
chase pellets, Kikuri's small spinning multi-pellet circles, and
Konngara's rain pellets that bounce down from the top of the playfield.
A tiny selection of preconfigured multi-pellet groups. Rather than
TH04's and TH05's freely configurable n-way spreads, stacks, and rings,
TH01 only provides abstractions for 2-, 3-, 4-, and 5- way spreads (yup,
no 6-way or beyond), with a fixed narrow or wide angle between the
individual pellets. The resulting pellets are also hardcoded to linear
motion, and can't use the special motion functions. Maybe not the best
code, but still kind of cute, since the generated groups do follow a
clear logic.
As expected from TH01, the code comes with its fair share of smaller,
insignificant ZUN bugs and oversights. As you would also expect
though, the sprite flickering points to the biggest and most consequential
flaw in all of this.
Apparently, it started with ZUN getting the impression that it's only
possible to use the PC-98 EGC for fast blitting of all 4 bitplanes in one
CPU instruction if you blit 16 horizontal pixels (= 2 bytes) at a time.
Consequently, he only wrote one function for EGC-accelerated sprite
unblitting, which can only operate on a "grid" of 16×1 tiles in VRAM. But
wait, pellets are not only just 8×8, but can also be placed at any
unaligned X position…
… yet the game still insists on using this 16-dot-aligned function to
unblit pellets, forcing itself into using a super sloppy 16×8 rectangle
for the job. 🤦 ZUN then tried to mitigate the resulting flickering in two
hilarious ways that just make it worse:
An… "interlaced rendering" mode? This one's activated for all Stage 15
and 20 fights, and separates pellets into two halves that are rendered on
alternating frames. Collision detection with the Yin-Yang Orb and the
player is only done for the visible half, but collision detection with
player shots is still done for all pellets every frame, as are
motion updates – so that pellets don't end up moving half as fast as they
should.
So yeah, your eyes weren't deceiving you. The game does effectively
drop its perceived frame rate in the Elis, Kikuri, Sariel, and Konngara
fights, and it does so deliberately.
📝 Just like player shots, pellets
are also unblitted, moved, and rendered in a single function.
Thanks to the 16×8 rectangle, there's now the (completely unnecessary)
possibility of accidentally unblitting parts of a sprite that was
previously drawn into the 8 pixels right of a pellet. And this
is where ZUN went full and went "oh, I
know, let's test the entire 16 pixels, and in case we got an entity
there, we simply make the pellet invisible for this frame! Then
we don't even have to unblit it later!"
Except that this is only done for the first 3 elements of the player
shot array…?! Which don't even necessarily have to contain the 3 shots
fired last. It's not done for the player sprite, the Orb, or, heck,
other pellets that come earlier in the pellet array. (At least
we avoided going 𝑂(𝑛²) there?)
Actually, and I'm only realizing this now as I type this blog post:
This test is done even if the shots at those array elements aren't
active. So, pellets tend to be made invisible based on comparisons
with garbage data.
And then you notice that the player shot
unblit/move/render function is actually only ever called from the
pellet unblit/move/render function on the one global instance
of the player shot manager class, after pellets were unblitted. So, we
end up with a sequence of
which means that we can't ever unblit a previously rendered shot
with a pellet. Sure, as terrible as this one function call is from
a software architecture perspective, it was enough to fix this issue.
Yet we don't even get the intended positive effect, and walk away with
pellets that are made temporarily invisible for no reason at all. So,
uh, maybe it all just was an attempt at increasing the
ramerate on lower spec PC-98 models?
Yup, that's it, we've found the most stupid piece of code in this game,
period. It'll be hard to top this.
I'm confident that it's possible to turn TH01 into a well-written, fluid
PC-98 game, with no flickering, and no perceived lag, once it's
position-independent. With some more in-depth knowledge and documentation
on the EGC (remember, there's still
📝 this one TH03 push waiting to be funded),
you might even be able to continue using that piece of blitter hardware.
And no, you certainly won't need ASM micro-optimizations – just a bit of
knowledge about which optimizations Turbo C++ does on its own, and what
you'd have to improve in your own code. It'd be very hard to write
worse code than what you find in TH01 itself.
(Godbolt for Turbo C++ 4.0J when?
Seriously though, that would 📝 also be a
great project for outside contributors!)
Oh well. In contrast to TH04 and TH05, where 4 pushes only covered all the
involved data types, they were enough to completely cover all of
the pellet code in TH01. Everything's already decompiled, and we never
have to look at it again. 😌 And with that, TH01 has also gone from by far
the least RE'd to the most RE'd game within ReC98, in just half a year! 🎉
Still, that was enough TH01 game logic for a while.
Next up: Making up for the delay with some
more relaxing and easy pieces of TH01 code, that hopefully make just a
bit more sense than all this garbage. More image formats, mainly.
Sadly, we've already reached the end of fast triple-speed TH01 progress
with 📝 the last push, which decompiled the
last segment shared by all three of TH01's executables. There's still a
bit of double-speed progress left though, with a small number of code
segments that are shared between just two of the three executables.
At the end of the first one of these, we've got all the code for the .GRZ
format – which is yet another run-length encoded image format, but this
time storing up to 16 full 640×400 16-color images with an alpha bit. This
one is exclusively used to wastefully store Konngara's sword slash and
kuji-in kill
animations. Due to… suboptimal code organization, the code for the format
is also present in OP.EXE, despite not being used there. But
hey, that brings TH01 to over 20% in RE!
Decoupling the RLE command stream from the pixel data sounds like a nice
idea at first, allowing the format to efficiently encode a variety of
animation frames displayed all over the screen… if ZUN actually made
use of it. The RLE stream also has quite some ridiculous overhead,
starting with 1 byte to store the 1-bit command (putting a single 8×1
pixel block, or entering a run of N such blocks). Run commands then store
another 1-byte run length, which has to be followed by another
command byte to identify the run as putting N blocks, or skipping N blocks.
And the pixel data is just a sequence of these blocks for all 4 bitplanes,
in uncompressed form…
Also, have some rips of all the images this format is used for:
To make these, I just wrote a small viewer, calling the same decompiled
TH01 code: 2020-03-07-grzview.zip
Obviously, this means that it not only must to be run on a PC-98, but also
discards the alpha information.
If any backers are really interested in having a proper converter
to and from PNG, I can implement that in an upcoming push… although that
would be the perfect thing for outside contributors to do.
Next up, we got some code for the PI format… oh, wait, the actual files
are called "GRP" in TH01.