Here we go, the finale of the Shuusou Gyoku Linux port, culminating in packages for the Arch Linux AUR and Flathub! No intro, this is huge enough as it is.
Before we could compile anything for Linux, I still needed to add GCC/Clang support to my Tup building blocks, in what's hopefully the last piece of build system-related work for a while. Of course, the decision to use one compiler over the other for the Linux build hinges entirely on their respective support for C++ standard library modules. I 📝 rolled out import std; for the Windows build last time and absolutely do not want to code without it anymore. According to the cppreference compiler support table at the time I started development, we had the choice between
experimental support in the not-yet-released GCC 15, and
partial support as of Clang 17, two versions ago.
GCC's current implementation does compile in current snapshot builds, but still throws lots of errors when used within the Shuusou Gyoku codebase. Clang's allegedly partial support, on the other hand, turned out just fine for our purposes. So for now, Clang it is, despite not being the preferred C/C++ compiler on most Linux distributions. In the meantime, please forgive the additional run-time dependency on libc++, its C++ standard library implementation. 🙇 Let's hope that it all will actually work in GCC 15 once that version comes out sometime in 2025.
At a high level, my Tup building blocks only have to do a single thing to support standard library modules with a given compiler: Finding the std and std.compat module interface units at the compiler's standard locations, and compiling them with the same compiler flags used for the rest of the project. Visual Studio got the right idea about this: If you compile on its command prompts, you're already using a custom shell with environment variables that define the necessary paths and parameters for your target platform. Therefore, it makes sense to store these module units at such an easily reachable path – and sure enough, you can reliably find the std module unit at %VCToolsInstallDir%\modules\std.ixx. While this is hands down the optimal way of locating this file, I can understand why GCC and Clang would want module lookup to work in generic shells without polluting environment variables. In this case, asking some compiler binary for that path is a decent second-best option.
Unfortunately, that would have been way too simple. Instead, these two compilers approached the problem from the angle of general module usage within the common build systems out there:
Using modules within a project introduces a new kind of dependency relation between C++ source files, forcing all such code to be compiled in an implicitly defined order. For Tup, this isn't much of a problem because it has always required 📝 order-relevant dependencies to be explicitly specified. So it's been quite amusing for me to hear all these CMake-entrenched CppCon speakers in recent years comment on how this aspect of modules places such a burden on build systems… 🤭
Then again, their goal is a world where devs just write import name_of_module; and the build system figures out a project's dependency graph on its own by scanning all source files prior to compilation. Or rather, asking the compiler to parse the source files and dump out this information, using the fdeps-* options on GCC, the separate clang-scan-deps tool for Clang, or the cl /scanDependencies option for MSVC.
Because each of the three major compilers has its own implementation of modules, it's understandable why the options and tools are different. Obviously though, CMake is interested in at least getting all three to output the dependency information in the same format. So they got onto the C++ committee's SG15 working group and proposed a JSON format, which GCC and Clang subsequently implemented.
But wait! The source files for the std and std.compat modules don't lie inside the source tree and couldn't be found by such a scan over the declared project files. So SG15 later simply proposed using the same JSON format for this purpose and installing such a JSON file together with the standard library implementation.
But wait! That only shifted the problem, because now we need to find that JSON file. What does the paper have to say on that issue?
For the Standard Library:
The build system should be able to query the toolchain (either the compiler or relevant packaging tools) for the location of that metadata file.
Wonderful. Just what we wanted to do all along, only with an additional layer of indirection that now forces every build system to include a JSON parser somewhere in its architecture. 🤦
In CMake's defense, they did try to get other build systems, including Tup, involved in these proposals. Can't really complain now if that was the consensus of everybody who wanted to engage in this discussion at the time. Still, what a sad irony that they reached out to Tup users on the exact day in 2019 at which I retired from thcrap and shelved all my plans of using Tup for modern C++ code…
So, to locate the interface units of standard library modules on Clang and GCC, a build system must do the following:
Ask the compiler for the path to the modules.json file, using the 30-year-old-print-file-name option.
GCC and Clang implement this option in the worst possible way by basically conditionally prepending a path to the argument and then printing it back out again. If the compiler can't find the given file within its inscrutable list of paths or you made a typo, you can only detect this by string-comparing its output with your parameter. I can't imagine any use case that wouldn't prefer an error instead.
Clang was supposed to offer the conceptually saner -print-library-module-manifest-path option, but of course, this is modern C++, and every single good idea must be accompanied by at least one other half-baked design or implementation decision.
Load the JSON file with the returned file name.
Parse the JSON file.
Scan the "modules" array for an entry whose "logical-name" matches the name of the standard module you're looking for.
Discover that the "source-path" is actually relative and will need to be turned into an absolute one for your compilation command line. Thankfully, it's just relative to the path of the JSON file we just parsed.
Sure, you can turn everything into a one-liner on Linux shells, but at what cost?
You might argue that Tup rules are a rather contrived case. Tup by itself can't store the output of processes in variables because rule generation and rule execution are two separate phases, so we need to call clang -print-file-name at both of the places in the command line where we need the file name. But, uh, CMake's implementation is 170 lines long…
At least it's pretty straightforward to then use these compiled modules. As far as our Tup building blocks are concerned, it's just another explicit input and a set of command-line flags, indistinguishable from a library. For Clang, the -fmodule-file=module_name=path option is all that's required for mapping the logical module names to the respective compiled debug or release version.
GCC, however, decided to tragically over-engineer this mapping by devising a plaintext protocol for a microservice like it's 2014. Reading the usage documentation is truly soul-crushing as GCC tries everything in its power to not be like Clang and just have simple parameters. Fortunately, this mapper does support files as the closest alternative to parameters, which we can just echo from Tup for some 📝 90's response file nostalgia. At least I won't have to entertain this folly for a moment longer after the Lua code is written and working…
So modules are justifiably hard and we should cut compiler writers some slack for having to come up with an entirely new way of serializing C++ code that still works with headers. But surely, there won't be any problems with the smaller new C++ features I've started using. If they've been working in MSVC, they surely do in Clang as well, right? Right…?
Once again, C++ standard versions are proven to be utterly meaningless to anyone outside the committee and the CppCon presenters who try to convince you they matter. Here's the list of features that still don't work in Clang in early 2025:
C++20's std::jthread, which fixes an important design flaw of C++'s regular thread class. This would have been very unfortunate if I hadn't coincidentally already rewritten my threading code to use SDL's more portable thread API as part of the Windows 98 backport. Thus, I could adopt that work into this delivery, gifting a much-needed extra 0.3 pushes of content to the Windows 98 backport. 🙌
C++17's std::from_chars() for floating-point values, which we use to parse 📝 gain factors for waveform BGM out of Vorbis comment tags. This one is a medium-sized tragedy: Since it's not worth it to polyfill this function with a third-party library for just a single call, the best thing we can do is to fall back on strtof() from the C standard library. Why wasn't I using this function all along, you may ask? Well, as we all know by now, the C standard library is complete and utter trash, and strtof() is no exception by suffering from locale braindeath.
A good chunk() (ha) of the C++23 range adaptors. As a rather new addition to the language, I've only made sporadic use of them so far to get a feel for their optimal usage. But as it turns out, sporadic use of range adaptors makes very little sense because the code is much simpler and easier to read without them. And this is what the C++ committee has been demanding our respect for all this time? They have played us for absolute fools.
The -2 might look slightly cryptic at first, but since this code is part of a constinit block, we'd get a compiler error if we either wrote too few elements (and left parts of the array uninitialized) or wrote too many (and thus out of the array's bounds). Therefore, the number can't be anything else.
It almost looked like it'd finally be time for my long-drafted rant about the state of modern C++, but the language just barely redeemed itself with the last two sentences there. Some other time, then…
On the bright side, all my portability work on game logic code had exactly the effect I was hoping for: Everything just worked after the first successful compilation, with zero weird run-time bugs resulting from the move from a 32-bit MSVC build to 64-bit Clang. 🎉
Before we can tackle text rendering as the last subsystem that still needs to be ported away from Windows, we need to take a quick look at the font situation. Even if we don't care about pixel-perfectly matching the game's text rendering on Windows, MS Gothic seems to be the only font that fits the game's design at all:
All text areas are dimensioned around the exact metrics of MS Gothic's embedded bitmaps. In menus, each half-width character is expected to be exactly 7×14 pixels large because most of the submenu items are aligned with spaces. In text boxes and the Music Room, glyphs can be smaller than the intended 8×16 pixels per half-width character, but they can't be larger without cutting off something somewhere.
Only bitmap fonts can deliver the sharp and pixelated look the game goes for. Subpixel rendering techniques are crucial for making vector fonts look good, but quickly get ugly when applied to drop-shadowed text rendered at these small sizes:
That's MS Gothic in both pictures. The smoothed rendering on the help text might arguably look nicer, but it clashes very badly with the drop shadow in the menus.
However, MS Gothic is non-free and any use of the font outside of a Windows system violates Microsoft's EULA. In spite of that, the AUR offers three ways of installing this font regardless:
The ttf-ms-*auto-* packages download a Windows 10 or 11 ISO from a somewhat official download link on Microsoft's CDN and extract the font files from there. Probably good enough if downloading 5 GB only to scrape a single 9 MB font file out of that image doesn't somehow feel wrong to you.
The regular, non-auto or -cdnttf-ms-win* packages leave it up to you where exactly you get the files from. While these are the clearest options in how they let you manually perform the EULA infringement, this manual nature breaks automated AUR helpers. And honestly, requiring you to copy over all 141 font files shipped with modern Windows is massively overkill when we only need a single one of them. At that point, you might as well just copy msgothic.ttc to ~/.local/share/fonts and don't bother with any package. Which, by the way, works for every distro as well as Flatpaks, which can freely access fonts on the host system.
You might want to go the extra mile and use any of these methods for perfectly accurate text rendering on Linux, and supporting MS Gothic should definitely be part of the intended scope of this port. But we can't expect this from everyone, and we need to find something that we can bundle as part of the Flatpak.
So, we need an alternative free Japanese font that fits the metric constraints of MS Gothic, has embedded bitmaps at the exact sizes we need, and ideally looks somewhat close. Checking all these boxes is not too easy; Japanese fonts with a full set of all Kanji in Shift-JIS are a niche to begin with, and nobody within this niche advertises embedded bitmaps. As the DPI resolutions of all our screens only get higher, well-designed modern fonts are increasingly unlikely to have them, thus further limiting the pool to old fonts that have long been abandoned and probably only survived on websites that barely function anymore.
Ultimately, the ideal alternative turned out to be a font named IPAMonaGothic, which I found while digging through the Winetricks source code. While its embedded bitmaps only cover MS Gothic's first half for font heights between 10 and 16 pixels rather than going all the way to 22 pixels, it happens to be exactly the range we need for this game.
If you're a PC-98 hardware fan, the difference between these two fonts is probably already reminding you of the stylistic difference between NEC's and Epson's versions of the ROM font.
Both of these screenshots were made on Windows. Obviously, the Linux port shouldn't settle for anything less than pixel-perfectly matching these reference renderings with both fonts.
Alright then, how are we going to get these fonts onto the screen with something that isn't GDI? With all the emphasis on embedded bitmaps, you might come to the conclusion that all we want to do is to place these bitmap glyphs next to each other on a monospaced grid. Thus, all we'd need is a TTF/OTF library that gives us the bitmap for a given Unicode code point. Why should we use any potentially system-specific API then?
But if we instead approach this from the point of view of GDI's feature set, it does seem better to match a standard Windows text rendering API with the equivalent stack of text rendering libraries that are typically used by Linux desktop environments. And indeed, there are also solid reasons why this is a better idea for now:
There actually is a single instance where this game uses MS Gothic at a height of 24 pixels, which is too large to be covered by its embedded bitmaps and thus requires rasterization of vector outlines. Whenever the SCL parser encounters an unknown opcode, it shows this error message:
Modders may very well end up seeing this one as a result of bugs in SCL compilers.
You might see debug text as not worth bothering with, but then there's Kioh Gyoku. Not only does that game display its text at much bigger sizes throughout, but it also renders every string at 3× the size it is ultimately downscaled to, similar to the 2× scale factor used by the 640×480 Windows Touhou games. Going for a full-featured solution that works with both embedded bitmaps and outlines saves us time later.
We'd be ready for translations into even the most complex-to-render non-ASCII scripts.
Since our fonts might not support these scripts, having the API fall back on other fonts installed in the system as necessary would allow us to add these translations independently of figuring out the font situation for them.
In fact, text rendering must technically already support glyph fallback because 📝 the BGM pack selection just displays path names, which count as user input. If people use code points in their BGM pack folder names that aren't covered by either of our two fonts, they probably have some font installed on their system that can display them. Also, the missing .DAT file screen further below in that post shows that GDI already does glyph fallback with emoji, so wouldn't it be lame if the Linux version didn't have at least feature parity in this regard? Instead, the Linux stack would actually outperform GDI thanks to the former's natural support for color emoji. 🎨
Since we're explicitly porting to desktop Linux here, using the standard Linux text rendering stack is the least bloated option because Linux users will have it installed anyway. We can still reach for more minimalistic alternatives later once we do port this game to something other than Linux.
Let's look at what this stack consists of and how the libraries interact with each other:
FreeType provides access to everything related to the rendering of TTF and OTF fonts, including their embedded bitmaps, as well as a rasterizer for regular vector glyphs. It's completely obvious why we need this library.
GLib2 is a collection of various general utility functions that modern non-C languages would have in their standard libraries. Most notably, it provides the tables and APIs for Unicode character data, but its iconv wrapper also comes in quite handy for converting the Shift-JIS text from the original .DAT files to UTF-8 without additional dependencies.
FriBidi implements the Unicode Bidirectional Algorithm, just in case you've thrown some Arabic or Hebrew into your string.
HarfBuzz implements shaping, i.e., the translation of raw Unicode into a sequence of glyph indices and positions depending on what's supported by the font. We might not strictly need this library right now, but it's completely obvious why we will eventually need it for translations.
Fontconfig manages all fonts installed on the system, maps user-friendly font names to file names, tracks their Unicode coverage, and offers a central place for storing various font tweaking options.
Normally, games wouldn't need this library because they just bundle all the fonts they need and hardcode any required tweaking settings to make them look as intended. Looking back at our font situation though, installing MS Gothic in a system-wide way through a package that puts the font into a standard location will be the simplest method of meeting that optional dependency. This is a reasonable assumption in a neatly packaged Linux system where the font is just another item on the game's dependency list, but also within a Flatpak, where "system-wide" includes any fonts shipped with the image. If we now assume that IPAMonaGothic is installed in the same way, we can let Fontconfig handle the actual selection. All we need to do is to specify a preference for MS Gothic over IPAMonaGothic, and Fontconfig will take care of the rest, without us a single line of TTF-loading code.
Pango combines the three libraries above into an API that somewhat matches GDI's simplicity, laying out text in one or multiple lines based on the shaped output of HarfBuzz and substituting glyphs as necessary based on Fontconfig information. The actual rendering, however, is delegated to…
Cairo, a… "2D graphics library"? Now why would we need one of those if all we want is a buffer filled with pixels? Wikipedia's description emphasizes its vector graphics capabilities, which seems to describe the library better than the nondescript blurb on its official website, but doesn't FreeType already do this for text? After looking at it for way too long, the best summary I can come up with is "a collection of font rasterization code that should have maybe been part of FreeType, plus the aforementioned general 2D vector graphics code we don't need". Just like Pango wraps HarfBuzz and Fontconfig to lay out the individual glyphs, Cairo wraps FreeType and raw pixel buffers to actually place these glyphs on its surface abstraction. (And also Fontconfig because of all its configuration settings that can influence the rendering.) Ultimately, this means that each font is represented by a HarfBuzz+FreeType handle, a Pango+Cairo handle, and a Cairo+FreeType handle, which I'm sure won't be relevant later on. 👀
Pango does have a raw FreeType backend that could render text entirely without Cairo, but it's not really maintained and supports neither embedded bitmaps nor color emoji. So we don't have much of a choice in the matter.
Created using pango-view -t 'effective. Power لُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ🌈冗' --font='MS Gothic 16px' --backend=cairo.
Created using pango-view -t 'effective. Power لُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ🌈冗' --font='MS Gothic 16px' --backend=ft2.
Fun fact: Since Cairo also manages the temporary CPU image buffer we draw on and then hand to SDL, our backend for Shuusou Gyoku ends up with 3× the amount of Cairo function calls than Pango function calls.
In the end, a typical desktop Linux program requires every single one of these 8 libraries to end up with a combined API that resembles Ye Olde Win32 GDI in terms of functionality and abstraction level. Sure, the combination of these eight is more powerful than GDI, offering e.g. affine transformations and text rendering along a curved path. But you can't remove any of these libraries without falling behind GDI.
Even then, my Linux implementation of text rendering for Shuusou Gyoku still ended up slightly longer than the GDI one due to all the Pango and Cairo contexts we have to manually manage. But I did come up with a nice trick to reduce at least our usage of Cairo: Since GDI needs to be used together with DirectDraw, the GDI implementation must keep a system-memory copy of the entire 📝 text surface due to 📝 DirectDraw's possibility of surface loss. But since we only use Cairo with SDL, the Cairo surface in system memory does not actually need to match the SDL-managed GPU texture. Thus, we can reduce the Cairo surface to the role of a merely temporary system-memory buffer that only is as large as the single largest text rectangle, and then copy this single rectangle to the intended packed place within the texture. I probably wouldn't have realized this if the seemingly most simple way to limit rendering to a fixed rectangle within a Cairo surface didn't involve creating another Cairo surface, which turned out to be quite cumbersome.
But can this stack deliver the pixel-perfect rendering we'd like to have? Well, almost:
Cue hours of debugging to find the cause behind these vertical shifts. The overview above already suggested it, but this bug hunt really drove home how this entire stack of libraries is a huge pile of redundantly implemented functionality that interacts with and overrides each other in undocumented and mostly unconfigurable ways. Normally, I don't have much of a problem with that as long as I can step through the code, but stepping through Cairo and especially Pango is a special kind of awful. Both libraries implement dynamic typing and object-oriented paradigms in C, thus hiding their actually interesting algorithms under layers and layers of "clean" management functions. But the worst part is a particularly unexpected piece of recursion: To layout a paragraph of text, Pango requires a few font metrics, which it calculates by laying out a language-specific paragraph of example text. No, I do not like stepping through functions that much, please don't put a call to the text layout function into the text layout function to make me debug while I debug, dawg…
It'll probably take many more years until most of this stack has been displaced with the planned Rust rewrites. But honestly, I don't have great hopes as long as they stay with this pile-of-libraries approach. This pile doesn't even deserve to be called a stack given the circular dependency between FreeType and HarfBuzz…
Ultimately, these are the bugs we're seeing here:
When rendering strings that contain both Japanese and Latin characters with MS Gothic, the Japanese characters are pushed down by about 1/8th of the font height. This one was already reported in June 2023 and is a bug in either HarfBuzz, Pango, or MS Gothic. With the main HarfBuzz developer confused and without an idea for a clean solution, the bug has remained unfixed for 1½ years.
For now, the best workaround would be to revert the commit that introduced the baseline shift. Since the Flatpak release can bundle whatever special version of whatever library it needs, I can patch this bug away there, but distro-specific packages or self-compiled builds would have to patch Pango themselves. LD_LIBRARY_PATH is a clean way of opting into the patched library without interfering with the regular updates of your distro, but there's still a definite hurdle to setting it up.
The remaining 1-pixel vertical shift is, weirdly enough, caused by hinting. Now why would a technique intended for improving the sharpness of outline fonts even apply to bitmap fonts to begin with? As you might have guessed, the pile-of-libraries approach strikes once more:
We can override Cairo's metric hinting defaults with the API documented in the page I linked above. But we must only do so conditionally because 16-pixel MS Gothic does require metric hinting for its glyph placement to match GDI. The resulting hack is very much not pretty.
Cairo's font options can only be really changed at the level of a Cairo context. Any Pango font handle created from a Pango layout mapped to a Cairo context will get a copy of that context's font options at creation time. And of course, the Pango level treats these options as an implementation detail that cannot be modified from the outside. So, we need to figure out the font using raw Fontconfig calls instead of Pango's abstraction. Oh, and this copy also forces us to recreate the Pango layout if we change between 14- and 16-pixel MS Gothic, which is not necessary with IPAMonaGothic.
Don't you love it when the concerns are so separated that they end up overlapping again? I'm so looking forward to writing my own bitmap font renderer for the multilingual PC-98 translations, where the memory constraints of conventional DOS RAM make it infeasible to use any libraries of this pile to begin with 😛
Before we can package this port for Flathub, there's one more obstacle we have to deal with. Flathub mandates that any published and publicly listed app must come with an icon that's at least 128×128 pixels in size. pbg did not include the game's original 32×32 icon in the MIT-licensed source code release, but even if he did, just taking that icon and upscaling it by 4× would simultaneously look lame and more official than it perhaps should.
So, the backers decided to commission a new one, depicting VIVIT in her title screen pose but drawn in a different style as to not look too official. Mr. Tremolo Measure quickly responded to our search and Ember2528 liked his PC-98-esque pixel art style, so that's what we went for:
However, the problem with pixel art icons is that they're strongly tied to specific resolutions. This clashes with modern operating system UIs that want to almost arbitrarily scale icons depending on the context they appear in. You can still go for pixel art, and it sure looks gorgeous if their resolution exactly matches the size a GUI wants to display them at. But that's a big if – if the size doesn't match and the icon gets scaled, the resulting blurry mess lacks all the definition you typically expect from pixel art. Even nearest-neighbor integer upscaling looks more cheap rather than stylized as the coarse pixel grid of the icon clashes with the finer pixel grid of everything surrounding it.
So you'd want multiple versions of your icon that cover all the exact sizes it will appear at, which is definitely more expensive than a single smooth piece of scalable vector artwork. On a cursory look through Windows 11, I found no fewer than 7 different sizes that icons are displayed at:
16×16 in the title bar and all of Explorer's list views
24×24 in the taskbar
28×28 in the small icon next to the file name in Explorer's detail pane (which is never sharp for some reason, even if you provide a 28×28 variant?!)
32×32 in the old-style Properties window
48×48 in Explorer's Medium icons view
96×96 in Explorer's Large icons view, and the large icon its detail pane
256×256 in Explorer's Extra large icons view
And that's just at 1× display scaling and the default zooming factors in Explorer.
But it gets worse. Adding our commissioned multi-resolution icon to an .exe seems simple enough:
Bundle the individual images into a single .ico file using magick in1.png in2.png … out.ico
Write a small resource script, call rc, and add the resulting .res file to the link command line
Be amazed as that icon appears in the title and task bars without you writing a single line of code, thanks to SDL's window creation code automatically setting the first icon it finds inside the executable
But what's going on in Explorer?
Same Extra large icons setting for both.
That's the 48×48 variant sitting all tiny in the center of a 256×256 box, in a context where we expect exactly what we get for the .ico file. Did I just stumble right into the next underdocumented detail? What was the point of having a different set of rules for icons in .exe files? Make that 📝 another Raymond Chen explanation I'm dying to hear…
Until then, here's what the rules appear to be:
256×256 is the one and only mandatory size for high-res program icons on Windows.
48×48 is the next smallest supported size, as unbelievable as that sounds. Windows will never use any other icon variant in between. Some sites claim that 64×64 is supported as well, but I sure couldn't confirm that in my tests.
Those 96×96 use cases from the list above? Yup, Windows will never actually display an embedded 96×96 icon at its native resolution, and either scale up the 48×48 variant (in the Large icons view) or scale down the 256×256 variant (in the detail pane).
You only ever see an embedded icon with a size between 48×48 and 256×256 if it's the only icon available – and then it still gets scaled to 48×48. Or to 96×96, depending on how Explorer feels like.
Getting different results in your tests? Try rebuilding the icon cache, because of course Windows still struggles with cache invalidation. This must have caused unspeakable amounts of miscommunication with artists over the decades.
Oh well, let's nearest-neighbor-scale our 128×128 icon by 2× and move on to Linux, where we won't have such archaic restrictions…
…which is not to say that pixel art icons don't come with their own issues there. 🥲
On Linux, this kind of metadata is not part of the ELF format, but is typically stored in separate Desktop Entry files, which are analogous to .lnk shortcuts on Windows. Their plaintext nature already suggests that icon assignment is refreshingly sane compared to the craziness we've seen above, and indeed, you simply refer to PNG or even SVG files in a separate directory tree that supports arbitrary size variants and even different themes. For non-SVG icons, menus and panels can then pick the best size variant depending on how many pixels they allot to an icon. The overwhelming majority of the ones I've seen do a good job at picking exactly the icon you'd expect, and bugs are rare.
But how would this work for title and task bars once you started the app? If you launched it through a Desktop Entry, a smart window manager might remember that you did and automatically use the entry's icon for every window spawned by the app's process. Apparently though, this feature is rather rare, maybe because it only covers this single use case. What about just directly starting an app's binary from a shell-like environment without going through a Desktop Entry? You wouldn't expect window managers to maintain a reverse mapping from binaries to Desktop Entries just to also support icons in this other case.
So, there must be some way for a program to tell the window manager which icon it's supposed to use. Let's see what SDL has to offer… and the documentation only lists a single function that takes a single image buffer and transfers its pixels to the X11 or Wayland server, overriding any previous icon. 😶
Well great, another piece of modern technology that works against pixel art icons. How can we know which size variant we should pick if icon sizing is the job of the window manager? For the same reason, this function used to be unimplemented in the Wayland backend until the committee of Wayland stakeholders agreed on the xdg-toplevel-icon protocol last year.
Now, we could query the size of the window decorations at all four edges to at least get an approximation, but that approach creates even more problems:
Which edge do we pick? The top one? The largest one? How can we possibly be sure that the one we pick is the one that will show the icon?
Even if we picked the correct edge, the icon will likely be smaller and not cover the full area. Again, anything less than an exact match isn't good enough for pixel art.
This function is not implemented on Wayland because client windows aren't supposed to care about how the server is decorating them.
But even among X11 window managers, there's at least one that doesn't report back the border sizes immediately after window creation. 🙄
Most importantly though: What if that icon is also used in a taskbar whose icons have a different size than the ones in title bars? Both X11's _NET_WM_ICON property and Wayland's xdg-toplevel-icon-v1 protocol support multiple size variants, but SDL's function does not expose this possibility. It might look as if SDL 3 supports this use case via its new support for alternate images in surfaces, but this feature is currently only used for mouse cursors. That sounds like a pull request waiting to happen though, I can't think of a reason not to do the same for icons. contribution-ideas?
But if SDL 2's single window icon function used to be unsupported on Wayland, did SDL 2 apps just not have icons on Wayland before October 2024?
Digging deeper reveals the tragically undocumented SDL_VIDEO_X11_WMCLASS environment variable, which does what we were hoping to find all along. If you set it to the name of your program's Desktop Entry file, the window manager is supposed to locate the file, parse it, read out the Icon value, and perform the usual icon and size lookup. Window class names are a standard property in both X11 and Wayland, and since SDL helpfully falls back on this variable even on Wayland, it will work on both of them.
Or at least it should. Ultimately, it's up to the window manager to actually implement class-derived icons, and sadly, correct support is not as widespread as you would expect.
How would I know this? Because I've tested them all. 🥲 That is, all non-AUR options listed on the Arch Wiki's Desktop environment and Window manager pages that provide something vaguely resembling a desktop you can launch arbitrary programs from:
Manually transferred pixels
Class-derived icons
Does not report border sizes back to SDL immediately after window creation
No title bars
Title bars have no icons. Taskbar falls back on the icon from the Desktop Entry file the app was launched with.
Title bars have no icons, but they work fine in the taskbar. Points out the difference between native and Flatpak apps!
Title bars have no icons, but they work fine in the taskbar. Points out the difference between native and Flatpak apps!
Title bars have no icons. The status bar only seems to support the X11 _NET_WM_ICON property, and not the older XWMHints mechanism used by e.g. xterm.
Did not start
Taskbar falls back on the icon from the Desktop Entry file the app was launched with. Only picks the correctly scaled icon variant in about half of the places, and just scales the largest one in the other half.
GNOME Flashback / Metacity
Title bars have no icons
Title bars have no icons
GNOME Classic
How do you get this running? The variables just start regular GNOME.
Taskbar only supports manually transferred icons. Scaling of class-derived icons in title bars is broken.
No title bars
I tested all window managers, compositors, and/or desktop environments at their latest version as of January 2025 in their default configuration. There were no differences between the X11 and Wayland versions for the ones that offer both.
Yes, you can probably rice title bars and icons onto WMs that don't have them by default. I don't have the time.
That's only 6 out of 33 window managers with a bug-free implementation of class-derived icons, and still 6 out of 28 if we disregard all the tiling window managers where icons are not in scope. If you actually want icons in the title bar, the number drops to just 2, KDE and Pantheon. I'm really impressed by IceWM there though, beating all other similarly old and minimal window managers by shipping with an almost correct implementation.
For now, we'll stay with class-derived icons for budget reasons, but we could add a pixel transfer solution in the future. And that was the 2,000-word story behind this single line of code… 📕
On to packaging then, starting with Arch! Writing my first PKGBUILD was a breeze; as you'd expect from the Arch Wiki, the format and process are very well documented, and the AUR provides tons of examples in case you still need any.
The PKGBUILD guidelines have some opinions about how to handle submodules, but applying them would complicate the PKGBUILD quite a bit while bringing us nowhere close to the 📝 nirvana of shallow and sparse submodules I've scripted earlier. But since PKGBUILDs are just shell scripts that can naturally call other shell scripts, we can just ignore these guidelines, run, and end up with a simpler PKGBUILD and the intended shorter and less bloated package creation process.
Sadly, PKGBUILDs don't easily support specifying a dependency on either one of two packages, which we would need to codify the font situation. Due to the way the AUR packages both IPAMonaGothic and MS Gothic together with their Mincho and proportional variants, either of them would be Shuusou Gyoku's largest individual dependency. So you'd only want to install one or the other, but probably not both. We could resolve this by editing the PKGBUILDs of both font packages and adding a provides entry for a new and potentially controversial virtual package like ttf-japanese-14-and-16-pixel-bitmap that Shuusou Gyoku could then depend on. But with both of the packages being exclusive to the AUR, this dependency would still be annoying to resolve and you'd have no context about the difference.
Thus, the best we can do is to turn both MS Gothic and IPAMonaGothic into optional dependencies with a short one-line description of the difference, and elaborating on this difference in a comment at the top of the PKGBUILD. Thankfully, the culture around Arch makes this a non-issue because you can reasonably expect people to read your PKGBUILD if they build something from the AUR to begin with. You do always read the PKGBUILD, right?
Flatpak, on the other hand… I'm not at all opposed to the fundamental idea of installing another distro on top of an already existing distro for wider ABI compatibility; heck, Flatpak is basically no different from Wine or WSL in this regard. It's just that this particular ABI-widening distro works in a rather… unnatural way that crosses the border into utter cringe at times.
There are enough rants about Flatpak from a user's perspective out there, criticizing the bloat relative to native packages, the security implications of bundling libraries, and the questionable utility of its sandbox. But something I rarely see people talk about is just how awful Flatpak is from a developer's point of view:
The documentation is written in this weird way that presents Flatpak and its concepts in complete isolation. Without drawing any connections to previous packaging and dependency management systems you might have worked with, it left a lot of my seemingly basic questions unanswered. While it is important to explain your concepts with example code, the lack of a simple and complete reference of the manifest format doesn't exactly inspire confidence in what you're doing. Eventually, I just resorted to cross-checking features in the JSON Schema to get a better idea of what's actually possible.
The ABI-expanding distro part of Flatpak is actually called the Freedesktop platform, a currently 680 MB large stack of typical GUI application libraries updated once a year. It's accompanied by the Freedesktop SDK containing the matching development libraries and tools in another 1.7 GB. As the name implies, this distro is maintained by a separate entity with a homepage that makes the entire thing look deeply self-important and unprofessional. A blurry 25 FPS logo video, a front page full of spelling mistakes, a big focus on sponsors and events… come on, you have one job, and it's compiling and packaging a bunch of open-source libraries. Was this a result of the usual corporate move of creating more departments in order to shift blame and responsibility?
Optics aside, their documentation is even more bizarrely useless. The single bit of actual useful information I was looking for – i.e., the concrete list of packages bundled as part of their runtimes and their versions, is best found by going straight to their code repo.
The manifest of a Flatpak app can be written in your preferred lesser evil of the two most popular markup languages: JSON (slightly ugly for humans and machines alike), or YAML, the underspecified mess that uses syntactically significant whitespace while outlawing the closest thing we have to a semantic indentation character. Oh well, YAML at least supports comments, and we sure sorely need them to justify our bleeding-edge C++ module setup to the Flathub maintainers.
Adding more dependencies on top of the basic runtime can be done by either using runtime extensions or BaseApps. That's two entirely separate concepts that appear to do the same thing on the surface, except that you can only have one BaseApp. The documentation then waffles on and tries to explain both concepts with words that have meaning in isolation but once again answer exactly zero of my questions. Must a BaseApp contain a collection of at least two dependencies or why would anyone ever write the sentence that raises this question? Why do they judge BaseApps to be a "specialized concept" without elaborating, as if to suggest that their audience is too dumb to understand them? Why does a page named Dependencies document extensions as if I wanted to prepare my own package for extension by others? Why be all weird and require "extension points" to be defined when it all just comes down to overlaying another filesystem? Who cares about the special significance of the .Debug, .Locale, and .Sources conventions in the context of dependencies?
In the end, you once again get a clearer idea by simply looking at how existing code uses these concepts. Basically, SDK extensions = build-time dependencies, BaseApps = run-time dependencies, and extension points don't matter at all for our purposes because you can just arbitrarily extend the org.freedesktop.Sdk anyway. 🤷
Speaking of extensions: This exact architectural split between build-time and run-time dependencies is why the org.freedesktop.Sdk.Extension.llvm19 extension packages Clang, but not libc++. When questioned about this omission, one of the maintainers responded with the lamest of excuses: Copying the library would be inconvenient (for them), and something we can't even imagine a use case for. Um, guys? Here's a table. Compare the color of each cell between GCC and Clang. There's your use case.
Thankfully, you can build libc++ without building LLVM as a whole. Seeing how building libc++ takes basically no time at all compared to the rest of LLVM just raises even more questions about not simply providing some kind of script to copy it over.
Speaking of XDG directories, why do they create the .flatpak-builder cache directory in the current working directory and not under $XDG_CACHE_HOME where it belongs?
The modules in a Flatpak work in a similarly layered way as the commands in a Dockerfile, causing edits to a lower layer to evict previous builds of all successive layers from the cache. Any tweaking work in the lower layers therefore suffers from the same disruptive workflow you might already know from Docker, where you constantly shift the layers around to minimize unnecessary rebuilds because there's never an optimal order. Will we ever see container bros move on from layers to a proper build graph of the entire system? The stagnation in this space is saddening.
The --ccache option sort of mitigates the layering by at least caching object files in .flatpak-builder/ccache, which reduces repeated C compilation to a mere file copy from the cache to the package. But not only is this option not enabled by default, it also doesn't appear in any of the flatpak-builder example command lines in the documentation.
Also, it only appears to work with GCC, and setting CCACHE_COMPILERTYPE=clang seems to have no effect. Fortunately, my investment into C++ modules pays off here as well and keeps compile times decently short.
flatpak-builder doesn't validate the manifest schema? Misspelled or misplaced properties just silently do nothing?
Speaking of validation, why does flatpak-builder-lint take 8 seconds to validate a manifest, even if it just consists of a single line? Sure, it's written in Python, but that's an order of magnitude too slow for even that language.
No tab completion for any of the org.flatpak.Builder tools. Sandbox working as designed, I guess 🤷
Git submodule handling. Oh my goodness.
Flatpak recursively clones and checks out all of a repository's submodules. This might be necessary for some codebases, but not for this one: The Linux build doesn't need the SDL submodule, and nothing needs the second miniaudio submodule that the dr_libs use for its testing code. And if these recursive submodules didn't opt into shallow clones, you end up with lots of disk space wasted for no reason; 166.1 MiB in our case.
Except that it's actually twice that amount. There's the download cache that persists across multiple flatpak-builder runs, and then there's the temporary directory the build runs in, which gets a full second clone of the entire tree of submodules. This isn't Windows 8, there are no excuses for not using read-only symlinks.
None of this would be too bad if we could just do the same thing we did with Arch, ignore the default or recommended submodule processing, and let our shell script run the show and selectively download and check out the submodules required for the Linux build. But no – the build process of a Flatpak is strictly separated into a download stage and a build stage, and the build stage cannot access the network. Once again, Flatpak would have the option to allow build-time network access, but enabling it would mean no hosting and discoverability on Flathub for you.
I guess it makes sense from a security point of view, as reviewers would only have to audit a fixed set of declaratively specified sources rather than all code run by the build commands? But even this can only ever apply to the initial review. Allowing app developers to push updates independently from the Flathub maintainers is one of Flathub's biggest selling points. Once you're in, you or your supply chain can just simply hide the malware in an updated version of a module source. 🤷
Getting Tup to work within the Flatpak build environment is slightly tricky. The build sandbox doesn't provide access to the kernel's FUSE module, which Tup uses to track syscalls by default. Thankfully, Tup also supports syscall tracking via LD_PRELOAD, which allows us to still build Shuusou Gyoku in a parallelized way with a regular Tup binary. Imagine compiling FUSE from source only to make Tup compile, but then having to build the game via a tup generated single-threaded shell script…
One common user complaint about Flatpak is that it allows Windows app developers to stick to their beloved and un-Linux-y way of bundling all dependencies, as if they actually ever enjoyed doing that. In reality, it's not the app authors, but the Flathub maintainers and submission reviewers who do everything in their power to prevent Flathub from turning into a typical package manager. Since they ended up with a system where every new extension to the Freedesktop SDK somehow places a burden on the maintainers, they're quick to shut down everything they consider a bad idea, including a Tup package I submitted. What a great job for people who always wanted to be gatekeepers and arbiters of good ideas. If your system treats CMake as one of two blessed build systems that get first-class support, we already fundamentally disagree on basic questions of good taste.
Because even the build stages of individual modules are sandboxed from each other, the only way to persist a module's build outputs for further modules is by installing them into the same /app/ path that the final application is supposed to live in. Since most of these foundational modules will be libraries, /app/ will be full of C header files, static library files, and library-related tooling that you don't want to bloat your shipped package. Docker solves this with multi-stage builds: After building your app into an image full of all build-time dependencies and other artifacts vomited out by your build system, you can start from a fresh, minimal base image and selectively copy over only the files your app actually needs to run. Flatpak solves this in the opposite way, merely letting you manually clean up after your dependencies in the end. At least they support wildcards…
So you've built your Flatpak, but it has an issue that your native build doesn't have and it's time for some debugging. You open up a shell into the image, fire up gdb… and don't get debug symbols despite your build definitely emitting them. The documentation mentions that debug symbols are placed into a separate package, just like Arch Linux's makepkg does it, but the suggested command line to install them doesn't work:
error: No remote refs found for ‘$FLATPAK_ID’
The apparently correct command line can only be found in third-party blog posts. Pulling the package directly out of the builder cache is as random as it gets for someone not deeply familiar with the system.
Before you publish your package, you might want to inspect the bundle to make sure that your --cleanup entries actually covered all the library bloat you suddenly have to care about. Flatpak also adds a few slight annoyances there:
You could look into the build directory (not the repo directory! Very important difference! 🤪) you pass to flatpak-builder, but it also contains all the debug files and source code.
You could open the --devel shell and inspect the contents of /app/. This shell environment is rather minimal and misses both a lot of typical Linux userland tools and (of course) a package manager, but ls and find work and can do the job.
So if all of Flatpak feels like Docker anyway, why isn't it built on top of Docker to begin with? Instead, we got what amounts to a worse copy that doesn't innovate in any way I can notice. Why throw away compatibility with all of Docker's existing tooling just to gain hash-based deduplication at the file level for a couple of images? How can they seriously use a tagline like "Git for apps", which only makes sense for very, very loose definitions of "Git"?
Or maybe all the innovation went into the portals that make this thing work at all, and have at least this little game work indistinguishably from a native build past the initial load time…
… except when parts of it don't! 🤣 Audio is only supported through PulseAudio, which you might not have installed on Arch Linux. Thus, Flatpak ironically enforces another dependency on the host system that the app itself might not have needed.
Alright, you've submitted your app, incorporated the changes requested by the reviewers, waited a while, and now your app is live and has its own page on Flathub. You'd think I'd be done ranting at this point, but no:
You give them nice lossless PNG screenshots and icons, and they convert both of them to lossy WebP with clearly visible compression artifacts. How about some trust in the fact that people who give you small PNG files know what they're doing? Verified by a programmatic check whether such a lossy recompression even noticeably improves the file size, instead of blindly blowing up our icon to 4.58× the size of the original PNG. Source-quality images are way more important to me than brand colors.
The screenshot area on the app pages has a fixed height of 468 pixels. Is this some kind of a sick joke? How could anyone look at that height and not go "nah, that looks wrong, 12 more pixels and we'd be VGA-compatible, barely makes a difference anyway"?
That leaves us with two choices:
Crop those 12 pixels out of the raw game screenshots I originally wanted to have there, or
The latter probably isn't the worst idea as it also gives us a chance to show off the 16×16 variant of the icon at its intended size. But I sure didn't immediately find a KDE theme that both has 16-pixel window icons (unlike Breeze's 15 pixels at the Small size) and doesn't have obscenely large and asymmetric shadows (unlike Materia or Klassy). Shoutout to the Arc theme for matching all these constraints!
Might as well try converting these images to lossless WebP while I'm at it, in the hope that they then leave them alone… but nope, they still get lossily recompressed! 🤪 You know what, I'm not gonna bother with the rest of their guidelines, this is an embarrassment.
Finally, game controller support comes with a very similar asterisk. By default, it's disabled just like any other piece of hardware, and the documentation tells you to specify --device=input to activate it. However, this specific permission is a fairly recent development in Flatpak terms and thus isn't widely available yet? Therefore, the reviewers don't yet allow it in manifests, and your only alternative is a blanket permission for all devices in the user's system. But then, Flathub lists your app as having potentially unsafe user device (and even webcam!) access, even though you had no alternative except for disabling game controller support. What a nice sandbox they have there… 🙄
If that's the supposed future of shipping programs on Linux, they've sure made this dev look back into the past with newfound fondness. I'm now more motivated than ever to separately package Shuusou Gyoku for every distribution, if only to see whether there's just a single distro out there whose packaging system is worse than Flatpak. But then again, packaging this game for other distros is one of the most obvious contribution-ideas there is.
In the end though, the fact that we need to patch Pango to correctly render MS Gothic means that there is a point to shipping Shuusou Gyoku as a Flatpak, beyond just having a single package that works on every distro. And with a download size of 3.4 MiB and an installed size of 6.4 MiB, Shuusou Gyoku almost exemplifies the ideal use case of Flatpak: Apart from miniaudio, BLAKE3, the IPAMonaGothic font, the temporary libc++, and the patched Pango, all other dependencies of the Linux port happen to be part of the Freedesktop runtime and don't add more bloat to the system.
And so, we finally have a 100% native Linux port of Shuusou Gyoku, working and packaged, after 36 pushes! 🎉 But as usual, there's always that last bit of optional work left. The three biggest remaining portability gaps are
guaranteed support for ARM CPUs, which currently fail to build the project on Flathub due to a Tup issue, and who knows what other issues there might be,
Despite 📝 spending 10 pushes on accurate waveform BGM, MIDI support seems to be the most worthwhile feature out of the three. The whole point of the BGM work was that Linux doesn't have a native MIDI synth, so why should packagers or even the users themselves jump through the hoops of setting up some kind of softsynth if it most likely won't sound remotely close to a SC-88Pro? But if you already did, the lack of support might indeed seem unexpected.
But as described in the issue, MIDI support can also mean "a Windows-like plug-and-play" experience, without downloading a BGM pack. Despite the resulting unauthentic sound, this might also be a worthwhile thing to fund if we consider that 14 of the 17 YouTube channels that have uploaded Shuusou Gyoku videos since P0275 still had MIDI playing through the Microsoft GS Wavetable Synth and didn't bother to set up a BGM pack.
Finally, we might want to patch IPAMonaGothic at some point down the line. While a fix for the ascent and descent values that achieves perfect glyph placement without relying on hinting hacks would merely be nice to have, matching the Unicode coverage of its embedded bitmaps with MS Gothic will be crucial for non-ASCII Latin script translations. IPAMonaGothic's outlines do cover the entire Latin-1 Supplement block, but the font is missing embedded bitmaps for all of this block's small letters. Since the existing outlines prevent any glyph fallback in both Fontconfig and GDI, letters like ä, ö, ü, and ñ currently render as spaces.
Like most Japanese fonts from the Shift-JIS era, IPAMonaGothic also suffers from Greek and Cyrillic glyphs being full-width. But we'd probably just hunt for a different font to use with translations into those scripts. But it's not worth doing that for Latin scripts that are only missing a few special characters.
Ideally, I'd like to apply these edits by modifying the embedded bitmaps in a more controlled, documented, and diffable way and then recompiling the font using a pipeline of some sort. The whole field of fonts often feels impenetrable because the usual editing workflow involves throwing a binary file into a bulky GUI tool and writing out a new binary file, and it doesn't have to be this way. But it looks like I'd have to write key parts of that pipeline myself:
The venerable ttx provides no comfort features for embedded bitmaps and simply dumps their binary representation as hex strings.
The more modern UFO format does specify embedded images, but both of the biggest implementations (defcon and ufoLib2) just throw away any embedded bitmaps, and thus, the whole selling point of such tools.
That would increase the price of translations by about one extra push if you all agree that this is a good idea. If not, then we just go for the usual way of patching the .ttf file after all. In any case, we then get to host the edited font at a much nicer place than the Wayback Machine.
tupblocks (import std; support)
Seihou / Shuusou Gyoku (Code cleanup + Game logic portability, part 2/? + Fixes for bugs and landmines)
Seihou / Shuusou Gyoku (Getting pbg's code through static analysis)
Seihou / Shuusou Gyoku (Game logic portability, part 3/? + Graphics refactoring, part 3/5: Preparations and colors)
Seihou / Shuusou Gyoku (Graphics refactoring, part 4/5: Geometry, enumeration, and software rendering)
Seihou / Shuusou Gyoku (Graphics refactoring, part 5/5: Clipping, sprites, and initialization)
Seihou / Shuusou Gyoku (Cross-platform APIs, part 3/?: Main loop + Main menu refactoring)
Seihou / Shuusou Gyoku (Cross-platform APIs, part 4/?: SDL_Renderer backend)
Seihou / Shuusou Gyoku (Window and scaling modes, part 1/2)
Seihou / Shuusou Gyoku (Window and scaling modes, part 2/2 + Hotkeys) + Website (Adding missing money amounts to the log)
💰 Funded by:
Ember2528, [Anonymous]
🏷️ Tags:
And then, the Shuusou Gyoku renderer rewrite escalated to another 10-push monster that delayed the planned Seihou Summer™ straight into mid-fall. Guess that's just how things go these days at my current level of quality. Testing and polish made up half of the development time of this new build, which probably doesn't surprise anyone who has ever dealt with GPUs and drivers…
But first, let's finally deploy C++23 Standard Library Modules! I've been waiting for the promised compile-time improvements of modules for 4 years now, so I was bound to jump at the very first possible opportunity to use them in a project. Unfortunately, MSVC further complicates such a migration by adding one particularly annoying proprietary requirement:
Our own code wants to use both static analysis and modules.
MSVC therefore insists that the modules are also compiled with static analysis enabled.
But this in turn forces every other translation unit that consumes these modules, including pbg's code, to be built with static analysis enabled as well, …
… which means we're now faced with hundreds of little warnings and C++ Core Guideline violations from pbg's code. Sure, we could just disable all warnings when compiling pbg's source files and get on with rolling out modules, because they would still count as "statically analyzed" in this case. But that's silly. As development continues and we write more of our own modern code, more and more of it will invariably end up within pbg's files, merging and intertwining with original game code. Therefore, not analyzing these files is bound to leave more and more potential issues undetected. Heck, I've already committed a static initialization order fiasco by accident that only turned into an actual crash halfway through the development of these 10 pushes. Static analysis would have caught that issue.
So let's meet in the middle. Focus on a sensible subset of warnings that we would appreciate in our own code or that could reveal bugs or portability issues in pbg's code, but disable anything that would lead to giant and dangerous refactors or that won't apply to our own code. For example, it would sure be nice to rewrite certain instances of goto spaghetti into something more structured, but since we ourselves won't use goto, it's not worth worrying about within a porting project.
After deduplicating lots of code to reduce the sheer number of warnings, the single biggest remaining group of issues were the C-style casts littered throughout the code. These combine the unconstrained unsafety of C with the fact that most of them use the classic uppercase integer types from <windows.h>, adding a further portability aspect to this class of issues.
The perhaps biggest problem about them, however, is that casts are a unary operator with its own place in the precedence hierarchy. If you don't surround them with even more brackets to indicate the exact order of operations, you can confuse and mislead the hell out of anyone trying to read your code. This is how we end up with the single most devious piece of arithmetic I've found in this game so far:
If you don't look at vintage C code all day, this cast looks redundant at first glance. Why would you separately cast the result of this expression to the type of the receiving variable? However, casting has higher precedence than division, so the code actually downcasts the dividend, (t->d+4), not the result of the division. And why would pbg do that? Because the regular, untyped 4 is implicitly an int, C promotes t->d to int as well, thus avoiding the intended 8-bit overflow. If t->d is 252, removing the cast would therefore result in
((int{ 252 } + int{ 4 }) / 8) =
256 / 8 =
32, not the 0 we wanted to have. And since this line is part of the sprite selection for VIVIT-captured-'s feather bullets, omitting the cast has a visible effect on the game:
The first file in GRAPH.DAT explains what we're seeing here.
So let's add brackets and replace the C-style cast with a C++ static_cast to make this more readable:
const auto d = (static_cast<uint8_t>(t->d + 4) / 8);
But that only addresses the precedence pitfall and doesn't tell us why we need that cast in the first place. Can we be more explicit?
const auto d = (((t->d + 4) & 0xFF) / 8);
That might be better, but still assumes familiarity with integer promotion for that mask to not appear redundant. What's the strongest way we could scream integer promotion to anyone trying to touch this code?
const auto d = (Cast::down_sign<uint8_t>(t->d + 4) / 8);
Of course, I also added a lengthy comment above this line.
Now we're talking! Cast::down_sign() uses static_asserts to enforce that its argument must be both larger and differently signed than the target type inside the angle brackets. This unmistakably clarifies that we want to truncate a promoted integer addition because the code wouldn't even compile if the argument was already a uint8_t. As such, this new set of casts I came up with goes even further in terms of clarifying intent than the gsl::narrow_cast() proposed by the C++ Core Guidelines, which is purely informational.
OK, so replacing C-style casts is better for readability, but why care about it during a porting project? Wouldn't it be more efficient to just typedef the <windows.h> types for the Linux code and be done with it? Well, the ECL and SCL interpreters provide another good reason not to do that:
In these instances, the DWORD type communicates that this codebase originally targeted Windows, and implies that the cmd buffer stores these 32-bit values in little-endian format. Therefore, replacing DWORD with the seemingly more portable uint32_t would actually be worse as it no longer communicates the endianness assumption. Instead, let's make the endianness explicit:
No surprises once we port this game to a big-endian system – and much fewer characters than a pedantic reinterpret_cast, too.
With that and another pile of improvements for my Tup building blocks, we finally get to deploy import std; across the codebase, and improve our build times by…
…not exactly the mid-three-digit percentages I was hoping for. Previously, a full parallel compilation of the Debug build took roughly 23.9s on my 6-year-old 6-core Intel Core i5-8400T. With modules, we now need to compile the C++ standard library a single time on every from-scratch rebuild or after a compiler version update, which adds an unparallelizable ~5.8s to the build time. After that though, all C++ code compiles within ~12.4s, yielding a still decent 92% speedup for regular development. 🎉 Let's look more closely into these numbers and the resulting state of the codebase:
Expecting three-digit speedups was definitely a bit premature as there were still several game-code translation units that #include <windows.h>. The subsequent graphics work removed a few more of these instances, which did bring the speedup into the three-digit range with a compilation time of ~11.6s by the end of P0295.
Supporting import-then-#include is crucial for supporting gradual migrations from headers to modules, but this is one of the most challenging features for compilers to implement, with both MSVC and Clang struggling. By now, MSVC admirably seems to handle all of the cases I ran into, except for this one:
import std.compat;
inline bool LaserHITCHK(/* … */)
// […]
// Causes the compiler to instantiate the overloaded C++ version of
// std::abs() via the global namespace re-export in `std.compat`,
// not the C version.
w = abs(-sinl(d,tx) + cosl(d,ty));
// […]
// Later, in another header file included via <windows.h>…
// This header defines the C version of abs(), thus causing a duplicate
// definition error.
#include <stdlib.h>
The best solution here is to simply not define functions in headers. We could also blame this one on the std.compat module which re-exports the C standard library into the global namespace and thus creates these duplicated definitions in the first place, but come on, std::uint32_t is 13 characters. That is way too much typing and screen space for referring to basic fixed-size integer types.
📝 As we've thoroughly explored last time, Tup still ain't batching. Could it be that Tup's paradigm of spawning one cl.exe process per translation unit prevents us from using modules to their full throughput potential? The /cgthreads1 flag seems to help in this regard. Let's do some profiling using cl.exe's undocumented /Bt flag to find out how the compilation times are distributed between the parsing and semantic analysis frontend (c1*.dll) and the code generation backend (c2.dll):
Game code (60 TUs around migration, 58 TUs at end of P0295)
Cumulative frontend and backend compilation times of a Debug build on my system, as reported by /Bt, together with the total real time. Since the library code is all C and therefore unaffected by modules, the numbers are the average of the builds at all three tested commits.
So yes, the Tup tax is real and adds somewhere between 30 and 40 ms per translation unit to the compilation time. cl.exe is simply better at parallelizing itself than any attempt to parallelize it from the outside. It feels inevitable that I'll eventually just fork Tup and add this batching functionality myself; the entire trajectory of my development career has been pointing towards that goal, and it would be the logical conclusion of my C++ build frustrations. But certainly not any time soon; the cost is not too high all things considered, I update libraries maybe once every second push, and I'll have done enough build system work for the foreseeable future after the Linux port is done.
These numbers also explain why /cgthreads1 has no measurable performance benefit for this codebase. You might think it's a good idea because Tup spawns one parallel cl.exe process per CPU core and we can't get any more real parallelism in such a situation. However, that's not what this option does – it only limits the number of code generation threads, and as the numbers show, code generation is the opposite of our bottleneck.
However, these compile time improvements come at the cost of modules completely breaking any of the major LSPs at this point in time:
The C++ extension for Visual Studio Code crashes with this error in any file that includes several headers in addition to modules:
IntelliSense process crash detected: handle_initialize
Quick info operation failed: FE: 'Compiler exited with error - No IL available'
Consequently, it no longer provides any IntelliSense for either header or standard library code.
The big Visual Studio IDE politely remarks that C++ IntelliSense support for C++20 Modules is currently experimental and then silently doesn't provide IntelliSense for anything either.
When given a compile_commands.json from Tup via tup compiledb, clangd does continue to provide IntelliSense for both header code and the C++ standard library, but its actual lack of module support puts so many false-positive squiggly lines all over the code that it's not worth using either.
But in the end, the halved compile times during regular development are well worth sacrificing IntelliSense for the time being… especially given that I am the only one who has to live in this codebase. 🧠 And besides, modules bring their own set of productivity boosts to further offset this loss: We can now freely use modern C++ standard library features at a minuscule fraction of their usual compile time cost, and get to cut down the number of necessary #include directives. Once you've experienced the simplicity of import std;, headers and their associated micro-optimization of #include costs immediately feels archaic. Try the equally undocumented /d1reportTime flag to get an idea of the compile time impact of function definitions and template instantiations inside headers… I've definitely been moving quite a few of those to .cpp files within these 10 pushes.
However, it still felt like the earliest possible point in time where doing this was feasible at all. Without LSP support, modules still feel way too bleeding-edge for a feature that was added to the C++ standard 4 years ago. This is why I only chose to use them for covering the C++ standard library for now, as we have yet to see how well GCC or Clang handle it all for the Linux port. If we run into any issues, it makes sense to polyfill any workarounds as part of the Tup building blocks instead of bloating the code with all the standard library header inclusions I'm so glad to have gotten rid of.
Well, almost all of them, because we still have to #include <assert.h> and <stdlib.h> because modules can't expose preprocessor macros and C++23 has no macro-less alternative for assert() and offsetof(). 🤦 [[assume()]] exists, but it's the exact opposite of assert(). How disappointing.
As expected, static analysis also brought a small number of pbg code pearls into focus. This list would have fit better into the static analysis section, but I figured that my audience might not necessarily care about C++ all that much, so here it is:
Shuusou Gyoku only ever seeds its RNG in three places:
At program startup (with 0),
immediately before the game picks a random attract replay after 10 seconds of no input in the top level of the menu (with the current system time in milliseconds), and, obviously,
when starting a replay (with the replay's recorded seed), which ironically counteracts the above seed immediately after the game selected the replay.
Since neither the main menu nor any of the three weapon previews utilize the RNG, any new unrecorded round started immediately after launching the .exe will always start with a seed of 0. Similarly, recorded rounds calculate their seed from the next two RNG numbers, and will always start with a seed of 347 in the same situation. RNG manipulation is therefore as simple as crafting a replay file with the intended seed, starting its playback, and immediately quitting back to the main menu. The stage of the crafted replay only matters insofar as Stage 6 starts out by reading 320 numbers from the RNG to initialize its wavy clock and shooting star animations, so you'd preferably use any other stage as all of them take a while until they read their first random number.
Of course, even a shmup with a fixed seed is only as deterministic as the input it receives from the player, and typical human input deviations will quickly add more randomness back into the game.
The effective cap of stage enemies, player shots, enemy bullets, lasers, and items is 1 entity smaller than their static array sizes would suggest. pbg did this to work around a potential out-of-bounds write in a generic management function.
The in-game score display no longer overflows into negative numbers once the score exceeds (231 - 1) points. Shuusou Gyoku did track the score using a signed 64-bit integer, but pbg accidentally used a 32-bit specifier for sprintf().
Alright, on to graphics! With font rendering and surface management mostly taken care of last year, the main focus for this final stretch was on all the geometric shapes and color gradients. pbg placed a bunch of rather game-specific code in the platform layer directly next to the Direct3D API calls, including point generation for circles and even the colors of gradient rectangles, gradient polygons, and the Music Room's spectrum analyzer. We don't want to duplicate any of this as part of the new SDL graphics layer, so I moved it all into a new game-level geometry system. By placing both the 8-bit and 16-bit approaches next to each other, this new system also draws more attention to the different approaches used at each bit depth.
So far, so boring. Said differences themselves are rather interesting though, as this refactor uncovered all of the remaining inconsistencies between the two modes:
In 8-bit mode, the game draws circles by writing pixels along the accurate outline into the framebuffer. The hardware-accelerated equivalent for the 16-bit mode would be a large unwieldy point list, so the game instead approximates circles by drawing straight lines along a regular 32-sided polygon:
It's not like the APIs prevent the 16-bit mode from taking the same approach as the 8-bit mode, so I suppose that pbg profiled this and concluded that lines offloaded to the GPU performed better than locking the framebuffer and writing pixels? Then again, given Shuusou Gyoku's comparatively high system requirements…
There's an off-by-one error in the playfield clipping region for Direct3D-rendered shapes, which ends at (511, 479) instead of (512, 480):
The fix is obvious.
There's an off-by-one error in the 8-bit rendering code for opaque rectangles that causes them to appear 1 pixel wider than in 16-bit mode. The red backgrounds behind the currently entered score are the only such boxes in the entire game; the transparent rectangles used everywhere else are drawn with the same width in both modes.
The game code also clearly asks for 400 and 14 pixels, respectively.
If we move the nice and accurate 8-bit circle outlines closer to the edge of the playfield, we discover, you guessed it, yet another off-by-one error:
No circle pixels at the right edge of the playfield. Obviously, I had to fix bug #2 in order for the line approximation to not also get clipped at the same coordinate.
The final off-by-one clipping error can be found in the filled circle part of homing lasers in 8-bit mode, but it's so minor that it doesn't deserve its own screenshot.
Now that all of the more complex geometry is generated as part of game code, I could simplify most of the engine's graphics layer down to the classic immediate primitives of early 3D rendering: Line strips, triangle strips, and triangle fans, although I'm retaining pbg's dedicated functions for filled boxes and single gradient lines in case a backend can or needs to use special abstractions for these. (Hint, hint…)
So, let's add an SDL graphics backend! With all the earlier preparation work, most of the SDL-specific sprite and geometry code turned out as a very thin wrapper around the, for once, truly simple function calls of the DirectMedia layer. Texture loading from the original color-keyed BMP files, for example, turned into a sequence of 7 straight-line function calls, with most of the work done by SDL_LoadBMP_RW(), SDL_SetColorKey(), and SDL_CreateTextureFromSurface(). And although SDL_LoadBMP_RW() definitely has its fair share of unnecessary allocations and copies, the whole sequence still loads textures ~300 µs faster than the old GDI and DirectDraw backend.
Being more modern than our immediate geometry primitives, SDL's triangle renderer only either renders vertex buffers as triangle lists or requires a corresponding index buffer to realize triangle strips and fans. On paper, this would require an additional memory allocation for each rendered shape. But since we know that Shuusou Gyoku never passes more than 66 vertices at once to the backend, we can be fancy and compute two constant index buffers at compile time. 🧠 SDL_RenderGeometryRaw() is the true star of the show here: Not only does it allow us to decouple position and color data compared to SDL's default packed vertex structure, but it even allows the neat size optimization of 8-bit index buffers instead of enforcing 32-bit ones.
By far the funniest porting solution can be found in the Music Room's spectrum analyzer, which calls for 144 1-pixel gradient lines of varying heights. SDL_Renderer has no API for rendering lines with multiple colors… which means that we have to render them as 144 quads with a width of 1 pixel.
The wireframe was generated via a raw glPolygonMode(GL_FRONT_AND_BACK, GL_LINE);
But all these simple abstractions have to be implemented somehow, and this is where we get to perhaps the biggest technical advantage of SDL_Renderer over pbg's old graphics backend. We're no longer locked into just a single underlying graphics API like Direct3D 2, but can choose any of the APIs that the team implemented the high-level renderer abstraction for. We can even switch between them at runtime!
On Windows, we have the choice between 3 Direct3D versions, 2 OpenGL versions, and the software renderer. And as we're going to see, all we should do here is define a sensible default and then allow players to override it in a dedicated menu:
Huh, we default to OpenGL 2.1? Aren't we still on Windows?
Since such a menu is pretty much asking for people to try every GPU ever with every one of these APIs, there are bound to be bugs with certain combinations. To prevent the potentially infinite workload, these bugs are exempt from my usual free bugfix policy as long as we can get the game working on at least one API without issues. The new initialization code should be resilient enough to automatically fall back on one of SDL's other driver APIs in case the default OpenGL 2.1 fails to initialize for whatever reason, and we can still fight about the best default API.
But let's assume the hopefully usual case of a functional GPU with at least decently written drivers where most of the APIs will work without visible issues. Which of them is the most performant/power-saving one on any given system? With every API having a slightly different idea about 3D rendering, there are bound to be some performance differences, and maybe these even differ between GPUs. But just how large would they be?
The answer is yes:
FPS (lowest | median) / API
Intel Core i5-2520M (2011) Intel HD Graphics 3000 (2011)
Computed using pbg's original per-second debugging algorithm. Except for the Intel i7-4790 test, all of these use SDL's default geometry scaling mode as explained further below. The GeForce GTX 1070 could probably be twice as fast if it weren't inside a laptop that thermal-throttles after about 10 seconds of unlimited rendering.
The two tested replays decently represent the entire game: In Stage 6, the software renderer frequently drops into low 1-digit FPS numbers as it struggles with the blending effects used by the Laser shot type's bomb, whereas GPUs enjoy the absence of background tiles. In the Extra Stage, it's the other way round: The tiled background and a certain large bullet cancel emphasize the inefficiency of unbatched rendering on GPUs, but the software renderer has a comparatively much easier time.
And that's why I picked OpenGL as the default. It's either the best or close to the best choice everywhere, and in the one case where it isn't, it doesn't matter because the GPU is powerful enough for the game anyway.
If those numbers still look way too low for what Shuusou Gyoku is (because they kind of do), you can try enabling SDL's draw call batching by setting the environment variable SDL_RENDER_BATCHING to 1. This at least doubles the FPS for all hardware-accelerated APIs on the Intel UHD 630 in the Extra Stage, and astonishingly turns Direct3D 11 from the slowest API into by far the fastest one, speeding it up by 22× for a median FPS of 1617. I only didn't activate batching by default because it causes stability issues with OpenGL ES 2.0 on the same system. But honestly, if even a mid-range laptop from 13 years ago manages a stable 60 FPS on the default OpenGL driver while still scaling the game, there's no real need to spend budget on performance improvements.
If anything, these numbers justify my choice of not focusing on a specific one of these APIs when coding retro games. There are only very few fields that target a wider range of systems with their software than retrogaming, and as we've seen, each of SDL's supported APIs could be the optimal choice on some system out there.
📝 Last year, it seemed as if the 西方Project logo screen's lens ball effect would be one of the more tricky things to port to SDL_Renderer, and that impression was definitely accurate.
The effect works by capturing the original 140×140 pixels under the moving lens ball from the framebuffer into a temporary buffer and then overwriting the framebuffer pixels by shifting and stretching the captured ones according to a pre-calculated table. With DirectDraw, this is no big deal because you can simply lock the framebuffer for read and write access. If it weren't for the fact that you need to either generate or hand-write different code for every support bit depth, this would be one of the most natural effects you could implement with such an API. Modern graphics APIs, however, don't offer this luxury because it didn't take long for this feature to become a liability. Even 20 years ago, you'd rather write this sort of effect as a pixel shader that would directly run on the GPU in a much more accelerated way. Which is a non-starter for us – we sure ain't breaking SDL's abstractions to write a separate shader for every one of SDL_Renderer's supported APIs just for a single effect in the logo screen.
As such, SDL_Renderer doesn't even begin to provide framebuffer locking. We can only get close by splitting the two operations:
Writing can only be done by getting the new pixels onto a texture first. Which in turn can either be done by updating a rectangular area with prepared pixel data from system memory, or locking a rectangular area and writing the pixels into a buffer. However, even SDL_LockTexture() is explicitly labeled as write-only. By returning an effectively uninitialized texture, you're forced to software-render your entire scene onto this texture anyway after locking.
This little detail in the API contract makes locking entirely unusable for this lens effect. Its code does not write to every pixel within the 140×140 area and relies on the unwritten pixels retaining their rendered color, just as you would expect regular memory to behave. If we are forced to prepare the full 140×140 pixels on the CPU, we might as well just go for the simpler and fasterSDL_UpdateTexture().
Also, if SDL says "write-only access", does this mean we can't even be sure that the locked buffer is readable after we wrote some pixels and before we unlock the texture again? We'd only have to look at the PC-98's GRCG for an example of memory-mapped I/O where reading and writing can work fundamentally differently depending on the mode register. The OpenGL driver implements texture locking by allocating a separate buffer in main memory and then uploading this modified buffer to the GPU via glTexSubImage2D() upon unlocking, but the docs do leave open the possibility for a driver to return a pointer to GPU memory we can't or shouldn't read from.
In fact, the only sanctioned way of reading pixels back from a texture involves turning the texture into a render target and calling SDL_RenderReadPixels().
Within these API limitations, we can now cobble together a first solution:
Rely on render-to-texture being supported. This is the case for all APIs that are currently implemented for SDL 2's renderer and SDL 3 even made support mandatory, but who knows if we ever get our hands on one of the elusive SDL 2 console ports under NDA and encounter one of them that doesn't support it…
Create a 640×480 texture that serves as our editable framebuffer.
Create a 140×140 buffer in main memory, serving as the input and output buffer for the effect. We don't need the full 640×480 here because the effect only modifies the pixels below the magnified 140×140 area and doesn't push them further outside.
Retain the original main-memory 140×140 buffer from the DirectDraw implementation that captures the current frame's pixels under the lens ball before we modify the pixels.
Each frame, we then
render the scene onto 2),
capture the magnified area using SDL_RenderReadPixels(), reading from 2) and writing to 3),
copy 3) to 4) using a regular memcpy(),
apply the lens effect by shifting around pixels, reading from 4) and writing to 3),
write 3) back to 2), and finally
use 2) as the texture for a quad that scales the texture to the size of the window.
Compared to the DirectDraw approach, this adds the technical insecurity of render-to-texture support, one additional texture, one additional fullscreen blit, at least one additional buffer, and two additional copies that comprise a round-trip from GPU to CPU and back. It surely would have worked, but the documentation suggestions and horror stories surrounding SDL_RenderReadPixels() put me off even trying that approach. Also, it would turn out to clash with an implementation detail we're going to look at later.
However, our scene merely consists of a 320×42 image on top of a black background. If we need the resulting pixels in CPU-accessible memory anyway, there's little point in hardware-rendering such a simple scene to begin with, especially if SDL lets you create independent software renderers that support the same draw calls but explicitly write pixels to buffers in regular system memory under your full control.
This simplifies our solution to the following:
Create a 640×480 surface in main memory, acting as the target surface for SDL_CreateSoftwareRenderer(). But since the potentially hardware-accelerated renderer drivers can't render pixels from such surfaces, we still have to
create an additional 640×480 texture in write-only GPU memory.
Retain the original main-memory 140×140 buffer from the DirectDraw implementation that captures the current frame's pixels under the lens ball before we modify the pixels.
Each frame, we then
software-render the scene onto 1),
capture the magnified area using a regular memcpy(), reading from 1) and writing to 3),
apply the lens effect by shifting around pixels, reading from 3) and writing to 1),
upload all of 1) onto 2), and finally
use 2) as the texture for a quad that scales the texture to the size of the window.
This cuts out the GPU→CPU pixel transfer and replaces the second lens pixel buffer with a software-rendered surface that we can freely manipulate. This seems to require more memory at first, but this memory would actually come in handy for screenshots later on. It also requires the game to enter and leave the new dedicated software rendering mode to ensure that the 西方Project image gets loaded as a system-memory "texture" instead of a GPU-memory one, but that's just two additional calls in the logo and title loading functions.
Also, we would now software-render all of these 256 frames, including the fades. Since software rendering requires the 西方Project image to reside in main memory, it's hard to justify an additional GPU upload just to render the 127 frames surrounding the animation.
Still, we've only eliminated a single copy, and SDL_UpdateTexture() can and will do even more under the hood. Suddenly, SDL having its own shader language seems like the lesser evil, doesn't it?
When writing it out like this, it sure looks as if hardware rendering adds nothing but overhead here. So how about full-on dropping into software rendering and handling the scaling from 640×480 to the window resolution in software as well? This would allow us to cut out steps 2) and d), leaving 1) as our one and only framebuffer.
It sure sounds a lot more efficient. But actually trying this solution revealed that I had a completely wrong idea of the inefficiencies here:
We do want to hardware-render the rest of the game, so we'd need to switch from software to hardware at the end of the logo animation. As it turns out, this switch is a rather expensive operation that would add an awkward ~500 ms pause between logo and title screen.
Most importantly, though: Hardware-accelerating the final scaling step is kind of important these days. SDL's CPU scaling implementation can get really slow if a bilinear filter is involved; on my system, software-scaling 62.5 frames per second by 1.75× to 1120×840 pixels increases CPU usage by ~10%-20% in Release mode, and even drops FPS to 50 in Debug mode.
This was perhaps the biggest lesson in this sudden 25-year jump from optimizing for a PC-98 and suffering under slow DirectDraw and Direct3D wrappers into the present of GPU rendering. Even though some drivers technically don't need these redundant CPU copies, a slight bit of added CPU time is still more than worth it if it means that we get to offload the actually expensive stuff onto the GPU.
But we all know that 4-digit frame rates aren't the main draw of rendering graphics through SDL. Besides cross-platform compatibility, the most useful aspect for Shuusou Gyoku is how SDL greatly simplifies the addition of the scaled window and borderless fullscreen modes you'd expect for retro pixel graphics on modern displays. Of course, allowing all of these settings to be changed in-engine from inside the Graphic options menu is the minimum UX comfort level we would accept here – after all, something like a separate DPI-aware dialog window at startup would be harder to port anyway.
For each setting, we can achieve this level of comfort in one of two ways:
We could simply shut down SDL's underlying render driver, close the window, and reopen/reinitialize the window and driver, reloading any game graphics as necessary. This is the simplest way: We can just reuse our backend's full initialization code that runs at startup and don't need any code on top. However, it would feel rather janky and cheap.
Or we could use SDL's various setter functions to only apply the single change to the specific setting… and anything that setting depends on. This would feel really smooth to use, but would require additional code with a couple of branches.
pbg's code already geared slightly towards 2) with its feature to seamlessly change the bit depth. And with the amount of budget I'm given these days, it should be obvious what I went with. This definitely wasn't trivial and involved lots of state juggling and careful ordering of these procedural, imperative operations, even at the level of "just" using high-level SDL API calls for everything. It must have undoubtedly been worse for the SDL developers; after all, every new option for a specific parameter multiplies the amount of potential window state transitions.
In the end though, most of it ended up working at our preferred high level of quality, leaving only a few cases where either SDL or the driver API forces us to throw away and recreate the window after all:
When changing rendering APIs, because certain API transitions would fail to initialize properly and only leave a black window,
when changing from borderless fullscreen into exclusive fullscreen on any API. This one is fixed in SDL 3, and they may or may not backport a fix in response to my bug report.
As for the actual settings, I decided on making the windowed-mode scale factor customizable at intervals of 0.25, or 160×120 pixels, up to the taskbar-excluding resolution of the current display the game window is placed on. Sure, restricting the factor to integer values is the idealistically correct thing to do, but 640×480 is a rather large source resolution compared to the retro consoles where integer scaling is typically brought up. Hence, such a limitation would be suboptimal for a large number of displays, most notably any old 720p display or those laptop screens with 1366×768 resolutions.
In the new borderless fullscreen mode, the configurable scaling factor breaks down into all three possible interpretations of "fitting the game window onto the whole screen":
A [Integer] fit that applies the largest possible integer scaling factor and windowboxes the game accordingly,
a [4:3] fit that stretches the game as large as possible while maintaining the original aspect ratio and either pillarboxes the game on landscape displays or letterboxes it on portrait ones,
and the cursed, aspect ratio-ignoring [Stretch] fit that may or may not improve gameplay for someone out there, but definitely evokes nostalgia for stretching Game Boy (Color) games on a Game Boy Advance.
What currentlycan't be configured is the image filter used for scaling. The game always uses nearest-neighbor at integer scaling factors and bilinear filtering at fractional ones.
The three scaling options available in borderless fullscreen mode as rendered on a 1280×720 display, which is one of the worst display resolutions you could play this game on.
And yes – as the presence of the FullScr[Borderless] option implies, the new build also still supports exclusive, display mode-changing 640×480 boomer fullscreen. 🙌
That ScaleMode, though…
And then, I was looking for one more small optional feature to complete the 9th push and came up with the idea of hotkeys that would allow changing any of these settings at any point. Ember2528 considered it the best one of my ideas, so I went ahead… but little did I know that moving these graphics settings out of the main menu would not only significantly reshape the architecture of my code, but also uncover more bugs in my code and even a replay-related one from the original game. Paraphrasing the release notes:
The original game had three bugs that affected the configured difficulty setting when playing the Extra Stage or watching an Extra Stage replay. When returning to the main menu from an Extra Stage replay, the configured difficulty would be overridden with either
the difficulty selected before the last time the Extra Stage's Weapon Select screen was entered, or
Easy, when watching the replay before having been to the Extra Stage's Weapon Select screen during one run of the program.
Also, closing the game window during the Extra Stage (both self-played and replayed) would override the configured difficulty with Hard (the internal difficulty level of the Extra Stage).
But the award for the greatest annoyance goes to this SDL quirk that would reset a render target's clipping region when returning to raw framebuffer rendering, which causes sprites to suddenly appear in the two black 128-pixel sidebars for the one frame after such a change. As long as graphics settings were only available from the unclipped main menu, this quirk only required a single silly workaround of manually backing up and restoring the clipping region. But once hotkeys allowed these settings to be changed while SDL_Renderer clips all draw calls to the 384×480 playfield region, I had to deploy the same exact workaround in three additional places… 🥲 At least I wrote it in a way that allows it to be easily deleted if we ever update to SDL 3, where the team fixed the underlying issue.
In the end, I'm not at all confident in the resulting jumbled mess of imperative code and conditional branches, but at least it proved itself during the 1½ months this feature has existed on my machine. If it's any indication, the testers in the Seihou development Discord group thought it was fine at the beginning of October when there were still 8 bugs left to be discovered.
As for the mappings themselves: F10 and F11 cycle the window scaling factor or borderless fullscreen fit, F9 toggles the ScaleMode described below, and F8 toggles the frame rate limiter. The latter in particular is very useful for not only benchmarking, but also as a makeshift fast-forward function for replays. Wouldn't rewinding also be cool?
So we've ported everything the game draws, including its most tricky pixel-level effect, and added windowed modes and scaling on top. That only leaves screenshots and then the SDL backend work would be complete. Now that's where we just call SDL_RenderReadPixels() and write the returned pixels into a file, right? We've been scaling the game with the very convenient SDL_RenderSetLogicalSize(), so I'd expect to get back the logical 640×480 image to match the original behavior of the screenshot key…
…except that we don't? Why do we only get back the 640×480 pixels in the top-left corner of the game's scaled output, right before it hits the screen? How unfortunate – if SDL forces us to save screenshots at their scaled output resolution, we'd needlessly multiply the disk space that these uncompressed .BMP files take up. But even if we did compress them, there should be no technical reason to blow up the pixels of these screenshots past the logical size we specified…
Taking a closer look at SDL_RenderSetLogicalSize() explains what's going on there. This function merely calculates a scale factor by comparing the requested logical size with the renderer's output size, as well as a viewport within the game window if it has a different aspect ratio than the logical size. Then, it's up to the SDL_Renderer frontend to multiply and offset the coordinates of each incoming vertex using these values.
Therefore, SDL_RenderReadPixels() can't possibly give us back a 640×480 screenshot because there simply is no 640×480 framebuffer that could be captured. As soon as the draw calls hit the render API and could be captured, their coordinates have already been transformed into the scaled viewport.
The solution is obvious: Let's just create that 640×480 image ourselves. We'd first render every frame at that resolution into a texture, and then scale that texture to the window size by placing it on a single quad. From a preservation standpoint, this is also the academically correct thing to do, as it ensures that the entire game is still rendered at its original pixel grid. That's why this framebuffer scaling mode is the default, in contrast to the geometry scaling that SDL comes with.
With integer scaling factors and nearest-neighbor filtering, we'd expect the two approaches to deliver exactly identical pixels as far as sprite rendering is concerned. At fractional resolutions though, we can observe the first difference right in the menu. While geometry scaling always renders boxes with sharp edges, it noticeably darkens the text inside the boxes because it separately scales and alpha-blends each shadowed line of text on top of the already scaled pixels below – remember, 📝 the shadow for each line is baked into the same sprite. Framebuffer scaling, on the other hand, doesn't work on layers and always blurs every edge, but consequently also blends together all pixels in a much more natural way:
Look closer, and you can even see texture coordinate glitches at the edges of the individual text line quads.
Surprisingly though, we don't see much of a difference with the circles in the Weapon Select screen. If geometry scaling only multiplies and offsets vertices, shouldn't the lines along the 32-sided polygons still be just one pixel thick? As it turns out, SDL puts in quite a bit of effort here: It never actually uses the API's line primitive when scaling the output, but instead takes the endpoints, rasterizes the line on the CPU, and turns each point on the resulting line into a quad the size of the scale factor. Of course, this completely nullifies pbg's original intent of approximating circles with lines for performance reasons.
The result looks better and better the larger the window is scaled. On low fractional scale factors like 1.25×, however, lines end up looking truly horrid as the complete lack of anti-aliasing causes the 1.25×1.25-pixel point quads to be rasterized as 2 pixels rather than a single one at regular intervals:
Also note how you can either have bright circle colors or bright text colors, but not both.
But once we move in-game, we can even spot differences at integer resolutions if we look closely at all the shapes and gradients. In contrast to lines, software-rasterizing triangles with different vertex colors would be significantly more expensive as you'd suddenly have to cover a triangle's entire filled area with point quads. But thanks to that filled nature, SDL doesn't have to bother: It can merely scale the vertex coordinates as you'd expect and pass them onto the driver. Thus, the triangles get rasterized at the output resolution and end up as smooth and detailed as the output resolution allows:
Note how the HP gauge, being a gradient, also looks smoother with geometry scaling, whereas the Evade gauge, being 9 additively-blended red boxes with decreasing widths, doesn't differ between the modes.
For an even smoother rendering, enable anti-aliasing in your GPU's control panel; SDL unfortunately doesn't offer an API-independent way of enabling it.
You might now either like geometry scaling for adding these high-res elements on top of the pixelated sprites, or you might hate it for blatantly disrespecting the original game's pixel grid. But the main reasons for implementing and offering both modes are technical: As we've learned earlier when porting the lens ball effect, render-to-texture support is technically not guaranteed in SDL 2, and creating an additional texture is technically a fallible operation. Geometry scaling, on the other hand, will always work, as it's just additional arithmetic.
If geometry scaling does find its fans though, we can use it as a foundation for further high-res improvements. After all, this mode can't ever deliver a pixel-perfect rendition of the original Direct3D output, so we're free to add whatever enhancements we like while any accuracy concerns would remain exclusive to framebuffer scaling.
Just don't use geometry scaling with fractional scaling factors. These look even worse in-game than they do in the menus: The glitching texture coordinates reveal both the boundaries of on-screen tiles as well as the edge pixels of adjacent tiles within the set, and the scaling can even discolor certain dithered transparency effects, what the…?!
That green color is supposed to be the color key of this sprite sheet… 🤨
With both scaling paradigms in place, we now have a screenshot strategy for every possible rendering mode:
Software-rendering (i.e., showing the 西方Project logo)?
This is the optimal case. We've already rendered everything into a system-memory framebuffer anyway, so we can just take that buffer and write it to a file.
Hardware-rendering at unscaled 640×480?
Requires a transfer of the GPU framebuffer to the system-memory buffer we initially allocate for software rendering, but no big deal otherwise.
Hardware-rendering with framebuffer scaling?
As we've seen with the initial solution for the lens ball effect, flagging a texture as a render target thankfully always allows us to read pixels back from the texture, so this is identical to the case above.
Hardware-rendering with geometry scaling?
This is the initial case where we must indeed bite the bullet and save the screenshot at the scaled resolution because that's all we can get back from the GPU. Sure, we could software-scale the resulting image back to 640×480, but:
That would defeat the entire point of geometry scaling as it would throw away all the increased detail displayed in the screenshots above. Maybe that is something you'd like to capture if you deliberately selected this scale mode.
If we scaled back an image rendered at a fractional scaling factor, we'd lose every last trace of sharpness.
The only sort of reasonable alternative: We could respond to the keypress by setting up a parallel 640×480 software renderer, rendering the next frame in both hardware and software in parallel, and delivering the requested screenshot with a 1-frame lag. This might be closer to what players expect, but it would make quite a mess of this already way too stateful graphics backend. And maybe, the lag is even longer than 1 frame because we simultaneously have to recreate all active textures in CPU-accessible memory…
Now that we can take screenshots, let's take a few and compare our 640×480 output to pbg's original Direct3D backend to see how close we got. Certain small details might vary across all the APIs we can use with SDL_Renderer, but at least for Direct3D 9, we'd expect nothing less than a pixel-perfect match if we pass the exact same vertices to the exact same APIs. But something seems to be wrong with the SDL backend at the subpixel level with any triangle-based geometry, regardless of which rendering API we choose…
As if each polygon was shifted slightly up and to the left…
The other, much trickier accuracy issue is the line rendering. We saw earlier that SDL software-rasterizes any lines if we geometry-scale, but we do expect it to use the driver's line primitive if we framebuffer-scale or regularly render at 640×480. And at one point, it did, until the SDL team discovered accuracy bugs in various OpenGL implementations and decided to just always software-rasterize lines by default to achieve identical rendered images regardless of the chosen API. Just like with the half-pixel offset above, this is the correct choice for new code, but the wrong one for accurately porting an existing Direct3D game.
Thankfully, you can opt into the API's native line primitive via SDL's hint system, but the emphasis here is on API. This hint can still only ensure a pixel-perfect match if SDL renders via any version of Direct3D and you either use framebuffer scaling or no scaling at all. OpenGL will draw lines differently, and the software renderer just uses the same point rasterizing algorithm that SDL uses when scaling.
Pixels written into the framebuffer along the accurate outline, as we've covered above. Also note the slightly brighter color compared to the 3D-rendered variants.
The original Direct3D line rendering used in pbg's original code, touching a total of 568 pixels.
OpenGL's line rendering gets close, but still puts 16 pixels into different positions. Still, 97.2% of points are accurate to the original game.
The result of SDL's software line rasterizer, which you'd still see in the P0295 build when using either the software renderer or geometry scaling with any API. Slightly more accurate than OpenGL in this particular case with only 14 diverging pixels, matching 97.5% of the original circle.
As another alternative, SDL also offers a mode that renders each line as two triangles. This method naturally scales to any scale factor, but ends up drawing slightly thicker diagonals. You can opt into this mode via SDL's hint system by setting the environment variable SDL_RENDER_LINE_METHOD to 3.
The triangle method would also fit great with the spirit of geometry scaling, rendering smooth high-res circles analogous to the laser examples we saw earlier. This is how it would look like with the game scaled to 3200×2400… yeah, maybe we do want the point list after all, you can clearly see the 32 corners at this scale.
Replacing circles with point lists, as mentioned earlier, won't solve everything though, because Shuusou Gyoku also has plenty of non-circle lines:
6884 pixels touched by the Direct3D line renderer, a 98.3% match by the OpenGL rasterizer with 119 diverging pixels, and a 97.9% match by the SDL rasterizer with 147 diverging pixels. Looks like OpenGL gets better the longer the lines get, making line render method #2 the better choice even on non-Direct3D drivers.
So yeah, this one's kind of unfortunate, but also very minor as both OpenGL's and SDL's algorithms are at least 97% accurate to the original game. For now, this does mean that you'll manually have to change SDL_Renderer's driver from the OpenGL default to any of the Direct3D ones to get those last 3% of accuracy. However, I strongly believe that everyone who does care at this level will eventually read this sentence. And if we ever actually want 100% accuracy across every driver, we can always reverse-engineer and reimplement the exact algorithm used by Direct3D as part of our game code.
That completes the SDL renderer port for now! As all the GitHub issue links throughout this post have already indicated, I could have gone even further, but this is a convincing enough state for a first release. And once I've added a Linux-native font rendering backend, removed the few remaining <windows.h> types, and compiled the whole thing with GCC or Clang as a 64-bit binary, this will be up and running on Linux as well.
If we take a step back and look at what I've actually ended up writing during these SDL porting endeavors, we see a piece of almost generic retro game input, audio, window, rendering, and scaling middleware code, on top of SDL 2. After a slight bit of additional decoupling, most of this work should be reusable for not only Kioh Gyoku, but even the eventual cross-platform ports of PC-98 Touhou.
Perhaps surprisingly, I'm actually looking forward to Kioh Gyoku now. That game seems to require raw access to the underlying 3D API due to a few effects that seem to involve a Z coordinate, but all of these are transformed in software just like the few 3D effects in Shuusou Gyoku. Coming from a time when hardware T&L wasn't a ubiquitous standard feature on GPUs yet, both games don't even bother and only ever pass Z coordinates of 0 to the graphics API, thus staying within the scope of SDL_Renderer. The only true additional high-level features that Kioh Gyoku requires from a renderer are sprite rotation and scaling, which SDL_Renderer conveniently supports as well. I remember some of my backers thinking that Kioh Gyoku was going to be a huge mess, but looking at its code and not seeing a separate 8-bit render path makes me rather excited to be facing a fraction of Shuusou Gyoku's complexity. The 3D engine sure seems featureful at the surface, and the hundreds of source files sure feel intimidating, but a lot of the harder-to-port parts remained unused in the final game. Kind of ironic that pbg wrote a largely new engine for this game, but we're closer to porting it back to our own enhanced, now almost fully cross-platform version of the Shuusou Gyoku engine.
Speaking of 8-bit render paths though, you might have noticed that I didn't even bother to port that one to SDL. This is certainly suboptimal from a preservation point of view; after all, pbg specifically highlights in the source code's README how the split between palettized 8-bit and direct-color 16-bit modes was a particularly noteworthy aspect of the period in time when this game was written:
Times have changed though, and SDL_Renderer doesn't even expose the concept of rendering bit depth at the API level. 📝 If we remember the initial motivation for these Shuusou Gyoku mods, Windows ≥8 doesn't even support anything below 32-bit anymore, and neither do most of SDL_Renderer's hardware-accelerated drivers as far as texture formats are concerned. While support for 24-bit textures without an alpha channel is still relatively common, only the Linux DirectFB drivermight support 16-bit and 8-bit textures, and you'd have to go back to the PlayStation Vita, PlayStation 2, or the software renderer to find guaranteed 16-bit support.
Therefore, full software rendering would be our only option. And sure enough, SDL_Renderer does have the necessary palette mapping code required for software-rendering onto a palettized 8-bit surface in system memory. That would take care of accurately constraining this render path to its intended 256 colors, but we'd still have to upconvert the resulting image to 32-bit every frame and upload it to GPU for hardware-accelerated scaling. This raises the question of whether it's even worth it to have 8-bit rendering in the SDL port to begin with if it will be undeniably slower than the GPU-accelerated direct-color port. If you think it's still a worthwhile thing to have, here is the issue to invest in.
In the meantime though, there is a much simpler way of continuing to preserve the 8-bit mode. As usual, I've kept pbg's old DirectX graphics code working all the way through the architectural cleanup work, which makes it almost trivial to compile that old backend into a separate binary and continue preserving the 8-bit mode in that way.
This binary is also going to evolve into the upcoming Windows 98 backport, and will be accompanied by its own SDL DLL that throws out the Direct3D 11, 12, OpenGL 2, and WASAPI backends as they don't exist on Windows 98. I've already thrown out the SSE2 and AVX implementations of the BLAKE3 hash function in preparation, which explains the smaller binary size. These Windows 98-compatible binaries will obviously have to remain 32-bit, but I'm undecided on whether I should update the regular Windows build to a 64-bit binary or keep it 32-bit:
Going 64-bit would give Windows users easy access to both builds and could help with testing and debugging rare issues that only occur in either the 64-bit or the 32-bit build, whereas
staying 32-bit would make it less likely for us to actually break the 32-bit Windows build because all Windows users (and developers) would continue using it.
I'm open to strong opinions that sway me in one or the other direction, but I'm not going to do both – unless, of course, someone subscribes for the continued maintenance of three Windows builds. 😛
Speaking about SDL, we'll probably want to update from SDL 2 to SDL 3 somewhere down the line. It's going to be the future, cleans up the API in a few particularly annoying places, and adds a Vulkan driver to SDL_Renderer. Too bad that the documentation still deters me from using the audio subsystem despite the significant improvements it made in other regards…
For now, I'm still staying on SDL 2 for two main reasons:
While SDL 3 is bound to be more available on Linux distributions in the future, that's not the case right now. Everyone is still waiting for its first stable release, and so it currently isn't packaged in any distribution repo outside the AUR from what I can tell. Wide Linux compatibility is the whole point of this port.
The funding for a Windows 98 port of SDL 2 was obviously intended to help with other existing SDL 2 games and not just Shuusou Gyoku.
Finally, I decided against a Japanese translation of the new menu options for now because the help text communicates too much important information. That will have to wait until we make the whole game translatable into other languages.
📝 I promised to recreate the Sound Canvas VA packs once I know about the exact way real hardware handles the 📝 invalid Reverb Macro messages in ZUN's MIDI files, and what better time to keep that promise than to tack it onto the end of an already long overdue delivery. For some reason, Sound Canvas VA exhibited several weird glitches during the re-rendering processes, which prompted some rather extensive research and validation work to ensure that all tracks generally sound like they did in the previous version of the packages. Figuring out why this patch was necessary could have certainly taken a push on its own…
Interestingly enough, all these comparisons of renderings against each other revealed that the fix only makes a difference in a lot fewer than the expected 34 out of 39 MIDIs. Only 19 tracks – 11 in the OST and 8 in the AST – actually sound different depending on the Reverb Macro, because the remaining 15 set the reverb effect's main level to 0 and are therefore unaffected by the fix.
And then, there is the Stage 1 theme, which only activates reverb during a brief portion of its loop:
Thus, this track definitely counts toward the 11 with a distinct echo version. But comparing that version against the no-echo one reveals something truly mind-blowing: The Sound Canvas VA rendering only differs within exactly the 8 bars of the loop, and is bit-by-bit identical anywhere else. 🤯 This is why you use softsynths.
This is the OST version, but it works just as well with the AST.
This is the OST version, but it works just as well with the AST.
Since the no-echo and echo BGM packs are aligned in both time and volume, you can reproduce this result – and explore the differences for any other track across both soundtracks – by simply phase-inverting a no-echo variant file and mixing it into the corresponding echo file. Obviously, this works best with the FLAC files.
Since the no-echo and echo BGM packs are aligned in both time and volume, you can reproduce this result – and explore the differences for any other track across both soundtracks – by simply phase-inverting a no-echo variant file and mixing it into the corresponding echo file. Obviously, this works best with the FLAC files. Trying it with the lossy versions gets surprisingly close though, and simultaneously reveals the infamous Vorbis pre-echo on the drums.
So yeah, the fact that ZUN enabled reverb by suddenly increasing the level for just this 8-bar piano solo erases any doubt about the panning delay having been a quirk or accident. There is no way this wasn't done intentionally; whether the SC-88Pro's default reverb is at 0 or 40 barely makes an audible difference with all the notes played in this section, and wouldn't have been worth the unfortunate chore of inserting another GS SysEx message into the sequence. That's enough evidence to relegate the previous no-echo Sound Canvas VA packs to a strictly unofficial status, and only preserve them for reference purposes. If you downloaded the earlier ones, you might want to update… or maybe not if you don't like the echo, it's all about personal preference at the end of the day.
While we're that deep into reproducibility, it makes sense to address another slight issue with the March release. Back then, I rendered 📝 our favorite three MIDI files, the AST versions of the three Extra Stage themes, with their original long setup area and then trimmed the respective samples at the audio level. But since the MIDI-only BGM pack features a shortened setup area at the MIDI level, rendering these modified MIDI files yourself wouldn't give you back the exact waveforms. 📝 As PCM behaves like a lollipop graph, any change to the position of a note at a tempo that isn't an integer factor of the sampling rate will most likely result in completely different samples and thus be uncomparable via simple phase-cancelling.
In our case though, all three of the tracks in question render with a slightly higher maximum peak amplitude when shortening their MIDI setup area. Normally, I wouldn't bother with such a fluctuation, but remember that シルクロードアリス is by far the loudest piece across both soundtracks, and thus defines the peak volume that every other track gets normalized to.
But wait a moment, doesn't this mean that there's maybe a setup area length that could yield a lower or even much lower peak amplitude?
And so I tested all setup area lengths at regular intervals between our target 2-beat length and ZUN's original lengths, and indeed found a great solution: When manipulating the setup area of the Extra Stage theme to an exact length of 2850 MIDI pulses, the conversion process renders it with a peak amplitude of 1.900, compared to its previous peak amplitude of 2.130 from the March release. That translates to an extra +0.56 dB of volume tricked out of all other tracks in the AST! Yeah, it's not much, but hey, at least it's not worse than what it used to be. The shipped MIDIs of the Extra Stage themes still don't correspond to the rendered files, but now this is at least documented together with the MIDI-level patch to reproduce the exact optimal length of the setup area.
Still, all that testing effort for tracks that, in my subjective opinion, don't even sound all that good… The resulting shrill resonant effects stick out like a sore thumb compared to the more basic General MIDI sound of every other track across both soundtrack variants. Once again, unofficial remixes such as Romantique Tp's one edit to 二色蓮花蝶 ~ Ancients can be the only solution here.
As far as preservation is concerned, this is as good as it gets, and my job here is done.
Then again, now that I've further refined (and actually scripted) the loop construction logic, I'd love to also apply it to Kioh Gyoku's MIDI soundtrack once its codebase is operational. Obviously, there's much less of an incentive for putting SC-88Pro recordings back into that game given that Kioh Gyoku already comes with an official (and, dare I say, significantly more polished) waveform soundtrack. And even if there was an incentive, it might not extend to a separate Sound Canvas VA version: As frustrating as ZUN's sequencing techniques in the final three Shuusou Gyoku Extra Stage arrangements are when dealing with rendered output, the fact that he reserved a lot more setup space to fit the more detailed sound design of each Kioh Gyoku track is a good thing as far as real-hardware playback is concerned. Consequently, the Romantique Tp recordings suffer far less from 📝 the SC-88Pro's processing lag issues, and thus might already constitute all the preservation anyone would ever want.
Once again though, generous MIDI setup space also means that Kioh Gyoku's MIDI soundtrack has lots of long and awkward pauses at the beginning of stages before the music starts. The two worst offenders here are
天鵞絨少女戦 ~ Velvet Battle and 桜花之恋塚 ~ Flower of Japan, with a 3:429s pause each. So, preserving the MIDI soundtrack in its originally intended sound might still be a worthwhile thing to fund if only to get rid of those pauses. After all, we can't ever safely remove these pauses at the MIDI level unless users promise that they use a GS-supporting device.
What we can do as part of the game, however, is hotpatch the original MIDI files from Shuusou Gyoku's MUSIC.DAT with the Reverb Macro fix. This way, the fix is also available for people who want to listen to the OST through their own copy of Sound Canvas VA or a SC-8850 and don't want to download recordings. This isn't necessary for the AST because we can simply bake the fix into the MIDI-only BGM pack, but we can't do this for the OST due to copyright reasons. This hotpatch should be an option just because hotpatching MIDIs is rather insidious in principle, but it's enabled by default due to the evidence we found earlier.
The game currently pauses when it loses focus, which also silences any currently playing MIDI notes. Thus, we can verify the active reverb type by switching between the game and VST windows:
Maximum volume recommended.
Still saying Panning Delay, even though we obviously hear the default reverb. A clear bug in the Sound Canvas VA UI.
Next up: You decide! This delivery has opened up quite a bit of budget, so this would be a good occasion to take a look at something else while we wait for a few more funded pushes to complete the Shuusou Gyoku Linux port. With the previous price increases effectively increasing the monetary value of earlier contributions, it might not always be exactly obvious how much money is needed right now to secure another push. So I took a slight bit out of the Anything funds to add the exact € amount to the crowdfunding log.
In the meantime, I'll see how far I can get with porting all of the previous SDL work back to Windows 98 within one push-equivalent microtransaction, and do some internal website work to address some long-standing pain points.
📝 Over two years since the previous largest delivery, we've now got a new record in every regard: 12 pushes across 5 repos, 215 commits, and a blog post with over 14,000 words and 48 pieces of media. 😱 Who would have thought that the superficially simple task of putting SC-88Pro recordings into Shuusou Gyoku would actually mainly focus on deep research into the underlying MIDI files? I don't typically cover much music-related content because it's a non-issue as far as PC-98 Touhou code is concerned, so it's quite fitting how extensive this one turned out. So here we go, the result of virtually unlimited funding and patience:
So where's the controversy? Romantique Tp obviously made the best and most careful real-hardware SC-88Pro recordings of all of ZUN's old MIDIs, including the original (OST) and arranged (AST) soundtrack of Shuusou Gyoku, right? Surely all I have to do now is to cut them into seamless loops to save a bit of disk space, and then put them into the game? Let's start at the end of the track list with the name registration theme, since it's light on instruments and has an obvious loop point that will be easy to spot in the waveform. But, um… wait a moment, that very first drum note comes a bit late, doesn't it?
At a notated tempo of 96 BPM, these first four beats should take exactly 2.5 seconds, which they do in this seamlessly looping softsynth rendering.
That's… not quite the accuracy and perfection I was expecting. But I think I know what we're seeing and hearing there. Let's look at the first few MIDI events across all channels:
Delta Pulse Beat Channel Event
+540 960 2:000 1 Controller { CC 0, value 0 }
+0 960 2:000 1 Controller { CC 32, value 0 }
+0 960 2:000 1 ProgramChange { 37 }
+0 960 2:000 2 Controller { CC 0, value 0 }
+0 960 2:000 2 Controller { CC 32, value 0 }
+0 960 2:000 2 ProgramChange { 19 }
+0 960 2:000 3 Controller { CC 0, value 0 }
+0 960 2:000 3 Controller { CC 32, value 0 }
+0 960 2:000 3 ProgramChange { 6 }
+0 960 2:000 4 Controller { CC 0, value 0 }
+0 960 2:000 4 Controller { CC 32, value 0 }
+0 960 2:000 4 ProgramChange { 2 }
Also, the fact that GS doesn't put its drums on a non-general voice bank and instead relies on external channel configuration to differentiate drums from pitched instruments is making this Yamaha kid uncontrollably furious. 🤬
Yup. That's the sound of a vintage hardware synth being slow and taking a two-digit number of milliseconds to process a barrage of simultaneous Program Change messages, playing a MIDI file that doesn't take this reality into account and expects program changes to happen instantly.
I can only speak from my own experience of writing MIDIs for hardware synths here, but having the first note displaced by 50 ms is very much not the way a composer would have intended the music to be heard if the note is clearly notated to occur on the beat. If you had told me about such an issue when playing one of my MIDIs on a certain synth, I would have thanked you for the bug report! And I would have promptly released a fixed version of the MIDI with the Program Change events moved back by a beat or two. In the case of Shuusou Gyoku's MIDIs, this wouldn't even have added any additional delay in-game, as all of these files already start with at least one beat of leading silence to make room for setting Roland-specific synth parameters.
OK, but that's just a single isolated bass drum hit. If we wanted to, we could even fix this issue ourselves by splicing the same note from around the loop end point. Maybe this is just an isolated case and the rest of Romantique Tp's recordings are fine? Well…
By the way, this seamless audio player is what consumed most of the two website pushes this time. The rest went to the slightly redesigned main page, whose progress bars now use the cap bar style and the GitHub badge colors.
This one is even worse. Here, the delay is so long relative to the tempo of the piece that the intended five drum hits pretty much turn into four.
This type of issue doesn't even have to be isolated to the very beginning of a piece. A few of the tracks in both the OST and AST start with an anacrusis on just one or two channels and leave the Program Change event barrage at the beginning of the first full measure. In 幻想科学 ~ Doll's Phantom for example, this creates a flam-like glitch where the bass on channel 2 is pretty much on time, but the crash hit on channel 10 only follows 50 ms later, after the SC-88Pro took its sweet time to process all the Program Change events on the channels between:
This is from the arranged soundtrack for a change. In that one, ZUN at least fixed the issue in the final three MIDIs (シルクロードアリス, 魔女達の舞踏会, and 二色蓮花蝶 ~ Ancients) that closed out this rearranging project in May 2001, which spread out their per-channel setup events over at least a single measure before playing any note.
Sure, all of this is barely noticeable in casual listening, but very noticeable if you're the one who now has to cut these recordings into seamless loops. And these are just the most obvious timing issues that can be easily pinpointed and documented – the actual worst aspects are all the minor tempo and timing fluctuations throughout most of the pieces. With recordings that deviate ever so slightly from the tempo defined in the MIDI files, you can no longer rely on mathematically exact sample positions when cutting loops. Even if those positions do work out from time to time, there'd pretty much always be a discontinuity in the waveform at both ends of the loop, manifesting as a clearly audible click. In the end, the only way of finding good loop points in existing recordings involves straining your ears and listening very, very closely to avoid any audible glitches. 😩
But if you've taken a look at the second tabs in the clips above, you will have noticed that we don't necessarily have to be stuck with recordings from real hardware. In late 2015, Roland released Sound Canvas VA, a VST plugin that emulates the classic core of Roland's old Sound Canvas lineup, including the SC-88Pro. As long as we run such a software synthesizer through a quality VST host, a purely software-based solution should be way superior for recording looped BGM:
By moving from real-time recording to an offline rendering paradigm, we get perfectly accurate note timing, as it no longer matters how long the synth takes to produce each output sample.
We stay entirely in the digital realm instead of going from digital (SC-88Pro) to analog (RCA cable) to digital (line-in recording) again, removing any chance for noise or distortion to ruin audio quality.
We get to directly render at 44,100 Hz instead of being limited to the 32,000 Hz signal coming out of the SC-88Pro's DAC. This can be easily noticed in the half-speed video above, whose SCVA version retains significantly more sibilant high-frequency content compared to the more muffled sound of Romantique Tp's recording.
Doing that also makes it feasible to preserve loudness differences between the pieces of a soundtrack instead of eradicating them by normalizing the volume of each individual track to the digital maximum.
Finally, it's much more time-efficient. We simply hit foobar2000's Convert button and get all MIDIs rendered within a few seconds each, instead of having to wait the entire length of a piece.
Any drawbacks? For our use case, all of them are found in the abysmal software quality of everything around the synth engine. As it's typical for the VST industry, Sound Canvas VA is excessively DRM'd – it takes multiple seconds to start up, and even then only allows a single process to run at any given time, immediately quitting every process beyond the first one with a misleading Parameter File1 Read Error message box. I totally believe anyone who claims that this makes SCVA more annoying than real hardware when composing new music. Retro gamers also dislike how Roland themselves no longer sells the 32-bit builds they used to offer for the first few versions. These old versions are now exclusively available through resellers, or on the seven seas.
But as far as the SC-88Pro emulation is concerned, there don't seem to be any technical reasons against it. There is a long thread over at VOGONS discussing all sorts of issues, but you have to dig quite deep to find any clear descriptions of bugs in SCVA's synth engine. Everything I found either only applies to the SC-55 emulation and not the SC-88Pro, was fixed by Roland in the meantime, or turned out to be a fixable bug in a MIDI file.
But wait, we've already heard one obvious difference between the real SC-88Pro and Sound Canvas VA. Let's listen to the very first clip again:
Ha! You can clearly hear a panning echo in the real-hardware recording that is missing from the Sound Canvas VA rendering. That's an obvious case of a core system effect not being reproduced correctly. If even that's undeniably broken, who knows which other subtle bugs SCVA suffers from, right? Case closed, Romantique Tp was right all along, SCVA is trash, real hardware reigns supreme
Actually, let's look closer into this one. Panning delay effects like this are typically reverb-related, but General MIDI only specifies a single controller to specify the per-channel reverb level from 0 to 127. Any specific characteristics of the reverb therefore have to be configured using vendor-specific system-exclusive messages, or SysEx for short.
So it's down to one of the four SysEx messages at the beginning of the MIDI file:
Since these byte strings represent Roland-specific instructions, we can't learn anything from a raw MIDI event dump alone here. No problem though, let's just load these files into some old MIDI sequencer that targeted Roland synths, open its MIDI event list, and then they will be automatically decoded into a human-readable representation…
…or at least that's what I expected. In Yamaha land, XGworks has done that for Yamaha's own XG SysEx messages ever since 1997:
No configuration required. You can even edit the textual Value1 representation and XGworks parses it back into the closest supported value!
But for Roland synths, there's… nothing similar? Seriously? 😶 Roland fanboys, how do you even live?! I mean, they are quick to recommend the typical bloated and sluggish big-name DAWs that take up multiple gigabytes of disk space, but none of the ones I tried seemed to have this feature. They can't have possibly been flinging around raw byte strings for the past 33 years?!
But once you look more into today's MIDI community, it becomes clear that this is exactly what they've been doing. Why else would so many people use the word complicated to describe Roland SysEx, or call it an old school/cryptic communication protocol in hexadecimal format? The latter is particularly hilarious because if you removed the word cryptic, this might as well describe all of MIDI, not just SysEx. Everything about this is a tooling issue, and Yamaha showed how easily it could have been solved. Instead, we get Sound Canvas experts, who should know more about the ecosystem than I do, making the incredible mental leap from "my DAW doesn't decode or easily generate SysEx" to "SysEx is antiquated" to "please just lift up these settings to the VST level and into my proprietary DAW's proprietary project format, that would be so much better"…
Thankfully that's not entirely true. After some more digging and configuration, I found a somewhat workable solution involving a comparatively modern sequencer called Domino:
Open the File → Preferences menu and associate your MIDI output device with a module map. This makes sense for SysEx encoding/generation since it can limit the options in the UI to what's actually available on your target hardware, but is also required for selecting the respective SysEx map into Domino's SysEx decoder. There is no technical reason for this because SC-88Pro SysEx messages can be uniquely identified by the three vendor, device, and model ID bytes that every message starts with, but would be too easy and user-friendly. The perception of SysEx being a black art must be upheld at all costs.
I've kept the garbled text of the partial translation to emphasize the sheer amount of jank involved in this entire process.
Load a MIDI file and let Domino "analyze" it:
Strangely enough, this will take quite a while – on my system, this analysis step runs at a speed of roughly 4.25 KB/s of MIDI data. Yes, kilobytes.
Unfortunately, "control change macro restoration" also seems to mean that you don't get to see any raw bytes when selecting the respective MIDI track in the UI, but at least we get what we were looking for:
…for the most part?
Alright, that's something we can work with. The GS Reset message is something that every Roland GS MIDI should start with, but it's immediately followed by a message that Domino failed to decode? The two subsequent reverb parameters make sense, but panning delays typically have more parameters than just a reverb level and time.
That unknown SysEx message shares much of the same bytes with the decoded ones though. So let's do what we maybe should have done all along, return to caveman, and check the SC-88Pro manual:
The relevant section from page 194. We can see how the address and value correspond to bytes 5-7 and 8 in the SysEx messages. Byte 9 is a checksum and byte 10 signals the end of the message.
And that's where we find what this particular issue boils down to. The missing SysEx message is clearly intended to be a Reverb Macro command, whose value can range from 0 to 7 inclusive on the SC-88Pro, but ZUN tries to specify Reverb Macro #14h, or 20 in decimal. The SC-88Pro manual does not specify what happens if a SysEx message wants to write an invalid value to a valid address, which means that we've firmly entered the territory of undefined behavior. Edit (2024-03-10):Romantique Tp confirmed that the real SC-88Pro clamps these Reverb Macro IDs to the supported range of 0-7. Therefore, the appropriate course of action for guaranteeing the same sound on other Roland synths would be to fix the MIDI file and specify Reverb Macro #7 instead. But since this behavior remains technically undefined, we can still argue about ZUN's intention behind specifying the Reverb Macro like this:
Clearly, ZUN did want to specify a valid Reverb Macro, but made a typo when manually entering the SysEx byte string, as he was forced to do thanks to terrible tooling. He clearly liked the resulting sound though, so the track should still be preserved with the panning reverb intact.
Clearly, the typical behavior for MIDI synths is to ignore invalid and unsupported SysEx messages, because validating user input is an important characteristic of quality software. This is what SCVA does, and what we hear in its rendering is the default hall reverb with ZUN's level and time adjustments. Therefore, SCVA is right, and the fact that we get a panning delay on the real SC-88Pro is a bug in real hardware.
Clearly, ZUN did not care enough about the reverb to specify a valid Reverb Macro. Whether we get the default reverb or a panning delay is an irrelevant performance detail, and does intentionally not matter when it comes to the intended sound of this track – especially since these four SysEx messages are the full extent of Roland GS-specific sound design in this piece, and the rest of it only uses standard MIDI features.
In fact, 32 out of the 39 MIDIs across both of Shuusou Gyoku's soundtrack use this invalid Reverb Macro. The only ones that don't are
both versions of Gates' theme (天空アーミー), which use the equally invalid Reverb Macro #11,
both versions of Milia's theme (プリムローズシヴァ), which use Reverb Macro #0 (Room 1),
and, again, the three arranged MIDIs that ZUN released last (シルクロードアリス, 魔女達の舞踏会, and 二色蓮花蝶 ~ Ancients), which feature a more detailed effect setup with custom chorus and EQ settings. In the case of Reimu's theme, these settings are even commented within the MIDI file.
And that's where this quest seemed to end, until Romantique Tp themselves came in and suggested that I take a closer look at the GS Advanced Editor, or GSAE for short.
Make sure to connect a MIDI input device before starting GSAE, or it will silently crash immediately after this splash screen. At least it accepts any controller, so this might just be a bug instead of the typical user-hostile kind of hardware dongle DRM that is pervasive in today's synth industry. 1999 would seem a bit too early for that, thankfully.
I was aware of this tool, but hadn't initially considered it because it's always described as just a SysEx generator/encoder. In fact, the very existence of such a tool made no sense to me at first, and seemed to prove my point that the usability of GS SysEx was wholly inferior to what I was used to in Yamaha land. Like, why not build at least a tiny and stripped-down MIDI sequencer around this functionality that would allow you to insert SC-88Pro-specific messages at any point within a sequence, and not just the beginning? I can see the need for such a tool in today's world of closed-source DAWs where hardware MIDI modules are niche and retro and are only kept alive by a small community of enthusiasts. But why would its developers guarantee that MIDI composers would have to hop between programs even back in 1997? I can only imagine that they saw how every just slightly advanced MIDI sequencer or DAW back then already used its own project format instead of raw Standard MIDI Files, and assumed that composers would therefore be program-hopping anyway?
However, GSAE does support the import of settings from a MIDI file and features a SysEx history window that decodes every newly processed Roland SysEx byte string, which is all I was looking for. So let's throw in that same MIDI and…
That's the result of sending just the single F0 41 10 42 12 40 01 30 14 7B F7 message at the top.
Now that's some wild numbers. An equally invalid Reverb Character, and Reverb Level and Time values that even exceed their defined range of 0-127? Could it be that GSAE emulates the real-hardware response to invalid Reverb Macros here, and gives us the exact reverb setting we can hear in Romantique Tp's recording? This could even be the reason why GSAE is still used and recommended within today's Roland MIDI sequencing scene, and hasn't been supplanted by some more modern open-source tool written by the community.
In any case, these values have to come from somewhere, so let's reverse-engineer GSAE and figure out the logic behind them. Shoutout to IDR for being a great help with its automatic generation of IDC debug symbols for the Delphi standard library, and even including a few names of application-level widget class methods by reading Delphi-specific type information from the binary. This little sub-project made me also come around to appreciating Ghidra, whose decompiler and data type manager helped a lot and allowed me to find the relevant code section within just a few hours.
A~nd it turns out that the values all come from out-of-bounds accesses into arrays on the stack. If we combine 25, 235, and 132 back into a 32-bit value, we get 0x19EB84, which is the virtual address of the relevant function's stack frame base pointer.
But it gets even more hilarious: If you enable debug text output via Option → Other Options → SMF → Insert text events to setup measures and export these imported settings back into a MIDI file, GSAE not only retains these invalid Reverb Macro IDs, but stringifies them via a simple lookup into a hardcoded string pointer array, again without any bounds checks. The effects of this are roughly what you would expect:
Reverb Macro IDs between 8 and 27 simply insert wrong strings from adjacent string pointer arrays
Reverb Macro 28 crashes GSAE
Reverb Macro 64 causes GSAE to vomit 65,512 bytes of garbage into the MIDI file
In the end, we have Domino not decoding the Reverb Macro message, and GSAE, the premier SysEx tool for Roland synths, responding to it in even more undefined and clearly bugged ways than real hardware apparently does. That's two programs confirming that whatever ZUN intended was never supposed to work reliably. And while we still don't know exactly what these reverb parameters are supposed to be, these observations solve the mystery as far as I'm concerned, and solidify my personal opinion on the matter.
So what do we do now, and which version do we go with? Optimally, I'd offer both versions and turn this controversy into a personal choice so that everybody wins… and Ember2528 agreed and generously provided all the funding to make it happen. 💸
If you haven't picked your favorite yet, here are some final arguments:
The Romantique Tp recordings certainly have something going for them with their provenance of coming from real hardware, and the care that Romantique Tp put into manually recording every single track, warts and all. I wholeheartedly agree that preserving the raw sound of playing the MIDI files into the hardware without thinking about bugs or quirks is an important angle to take when it comes to preservation. It's good that these recordings exist – after all, you wouldn't know which musical elements you'd possibly be missing in an emulation if you have nothing to compare it to. Even the muffled sound in the half-speed clip above can be an argument in their favor, as the SC-88Pro's DAC operates at 32 kHz and you wouldn't expect any meaningful frequency content between 16,000 and 22,050 Hz to begin with. Any frequency content in that range that does remain in Romantique Tp's recording is simply 📝 rolled-off imaging noise added during the ADC's resampling process.
All this is why they are a definite improvement over kaorin's 2007 recordings of only the AST, which used to be the previous reference recordings within the community. Those had all of the same timing issues and more, in addition to being so excessively volume-boosted that 0.15% of the samples across the entire soundtrack ended up clipped. That's 6.25 seconds out of 68:39m being lost to pure digital noise.
Most importantly though: ZUN himself said that only the real SC-88Pro will play back these files as he intended them to sound. This quote is likely where the tagline of Romantique Tp's entire recording project came from in the first place:
> 全てのデエタはSC-88ProもしくはSC-8850(ロオランド社)にて最適に聴けるように調整してあります
> それ以外の音源でも、作者の意図した音ではない場合があります。
— ZUN on 東方幻想的音楽, his old MIDI page
However. ZUN is not exactly known for accurately and carefully preserving the legacy of his series, or really doing anything beyond parading his old games as unobtainable showpieces at conventions. With all the issues we've seen, preferring real hardware is ultimately just that: an angle, and a preference. This is why I disagree with the heavy and uncritical advertising that is mainly responsible for elevating the Romantique Tp recordings to their current reference status within the community, especially if at least half of the alleged superiority of real hardware is founded on undefined behavior that can easily be fixed in the MIDI files themselves if people only bothered to look.
Here's where I stand: MIDI files are digital sheet music first and foremost, not an inferior version of tracker modules where the samples are sold separately. As such, the specific synth a MIDI file was written for is merely a secondary property of the composition – and even more so if the MIDI file contains little to nothing in terms of sound design and mostly restricts itself to the basic feature set of General MIDI. In turn, synth quirks and bugs are not a defined part of the composition either, unless they are clearly annotated and documented in the file itself. And most importantly: If the MIDI file specifies a certain timing and a recording fails to reproduce that timing, then that recording is not an accurate representation of the MIDI file.
In that regard, Sound Canvas VA is not only the closest alternative to the real thing, as a few people in the MIDI and retrogaming scene do have to admit, but superior to the real thing. I'll gladly take clarity and perfect timing accuracy in exchange for minor differences in effects, especially if the MIDI file does not explicitly and correctly define said effects to begin with. If I want a panning delay as part of the reverb, I add the respective and correct SysEx message to define one – and if I don't, I do not care about the reverb. You might still get a panning delay on a certain synth, and you might even prefer how it sounds, but it's ultimately a rendering artifact and not a consciously intended part of the composition. In that way, it's similar to the individual flavor a musician adds to a performance of a piece of classical music.
And as far as the differences in frequency response and resonant filters are concerned: In Yamaha land, these are exactly the main distinguishing factors between vintage WF-192XG sound cards (resembling the real SC-88Pro in these characteristics) and the S-YXG50 softsynth (resembling SCVA). Once I found out about that softsynth and how much clearer it sounded in comparison, I sold that old PCI sound card soon after.
In the interest of preservation though, there's still one more unexplored solution that could be the ideal middle ground between the two approaches:
Play the MIDIs through a real-hardware SC-88Pro again
Capture the actually observed system-exclusive settings that fall within the synth's supported and documented ranges
Insert them back into the MIDI file, creating a new bugfixed version
Re-record that bugfixed version through Sound Canvas VA
Edit (2024-03-10): And since Romantique Tp has confirmed what exactly happens on real hardware, I'm going to do exactly that. These bugfixed Sound Canvas VA renderings will be a free bonus of the single next Shuusou Gyoku push, and will add another angle to the preservation of these soundtracks. In the meantime though, the Sound Canvas VA packs will sound like they do in the preview videos above.
Just to be clear: I'm not suggesting that Romantique Tp should have been the one to cut their recordings into loops, or even just the one who defined where the loop points are supposed to be. On the surface, this seems to be a non-issue, and you'd just pick a point wherever each track appears to loop, right? But with 39 MIDIs to cut and all the financial support from Ember2528, it made sense to also solve this problem more thoroughly, and algorithmically detect provably correct loop points for all of these files. Who knows, maybe we even find some surprises that make it all worth it?
This is the algorithm I came up with:
At a basic level, we loop over the list of MIDI events and return the earliest and longest subrange that is immediately followed by an identical copy.
MIDI players, however, need loop point definitions that use MIDI pulse units rather than event list indices. This is especially necessary for multi-track/SMF Type 1 sequences, which would otherwise require one loop start/end index pair per track, and then it still wouldn't work because some of the tracks might not even have an event at the loop start/end point. This requires the detection algorithm and the player to agree on how to map event indices to time points and back, and simply going for the first event of each pulse (i.e., any event with a nonzero delta time) makes the most sense here. In turn, we can skip any potential start or end events that have a delta time of 0, speeding up the algorithm significantly for typical compositions with a high degree of polyphony.
Naively considering just the raw MIDI events works for MIDI playback. But as soon as we want to cut a recording based on the detected loop points, we need to account for the fact that MIDI playback is inherently stateful. Each of the 16 channels at the protocol level features at least the 128 continuous controllers (CCs) with a 7-bit state, the 14-bit pitch bend controller, and the 7-bit instrument program value, in addition to the global tempo of the piece. As a result, two ranges of events might look identical, but can still sound differently if the events before the first range changed one piece of state which is then only touched again near the end of that range. This requires us to track the full MIDI state at both the start and end of a loop, and reject any potential loop that differs in these states:
In this example, a naive event-level scan would detect a loop between beats 3 and 6 as the same events are immediately repeated between beats 6 and 9. However, the piece starts with the first four notes at a channel volume of 50, which is only set to its later value of 100 on beat 5. Therefore, the actual loop ranges from beat 5 to 8. In turn, the piece needed to be at least 11 beats long to include the full second copy of the looped events and prove the loop as such.
This check can be a bit too strict in some cases, though. A channel might start with one of its CCs at a specific value but then change the same CC to a different value at a later point before playing the first note. In such a case, the detected loop would be delayed to the second CC change even though the initial CC value has no impact on the sound. By filtering these redundant CC changes, we get to move the loop start point of a few tracks (original 夢機械 ~ Innocent Power and arranged 魔法少女十字軍) back by a few seconds, to the position you'd expect.
Finally, we reject any overlong loops that themselves fully consist of multiple successive copies of the first N events.
Shuusou Gyoku's original MIDI files hide the original game's lack of MIDI looping by simply duplicating the looping sections enough times so that a typical player won't notice. The algorithm we have so far, however, would return a much longer loop if a MIDI file contains more than three successive copies of a looping section. The original version of ハーセルヴズ in particular repeats its 8 looping bars a total of 15 times before the MIDI ends, and this condition is necessary to detect the actual 8-bar loop instead of a 56-bar one.
Of course, this algorithm isn't perfect and won't work for every MIDI file out there. It doesn't consider things like differently ordered events within the same MIDI pulse, (non-)registered parameter numbers, or the effect that SysEx messages can have on the state of individual channels. The latter would require the general SysEx decoding logic that I would have liked to have for the research above… actually, let's add an issue and add the project to the order form. I'd really like to see a comprehensive open-source cross-vendor SysEx decoder library in my lifetime.
As for the implementation, I was happy to write some Rust again for a change, as it's a great fit for these standalone greenfield command-line tools that don't have to directly interact with the legacy C++ code bases that this project usually deals with. It's even better if the foundational functionality is not just available in a crate, but in four, with the community already having gone through multiple iterations to arrive at a tried and tested winner. Who knows, maybe I even get to rewrite this website in it one day? Just for the sheer meme value of doing so, of course.
I also enjoyed this a lot from a technical point of view:
You might think that Rust's typical safety guarantees don't matter for the problem at hand. But then you accidentally write -= instead of += for a u32 that starts out at 0, and Rust immediately panics instead of silently underflowing to u32::MAX. This must have saved me at least 5 minutes of debugging the resulting logic error.
As it turns out, my loop detection algorithm is embarrassingly parallel. You might initially think about it in a sequential way because we always want the earliest occurrence of the longest repeating section of MIDI events, which means that each new loop candidate further into the track has to be longer than the previous one. But since we always iterate over the entire MIDI, it makes perfect sense to divide and conquer the problem. Let's split the list of possible loop end points into equal chunks, scan them all in parallel for the earliest and longest loop within that chunk, and then pick the earliest and longest loop among those intermediate results as the final one. In Rust, you don't even have to think much about the chunks, as all of that can be easily done by replacing the iteration with Rayon's parallel fold and adding a reduce() with the same condition for the final step. This sped up the algorithm by exactly the number of cores in my system.
This algorithm works well for the long MIDI files of Shuusou Gyoku's OST that all contain multiple duplicates of their loop section, but it quickly reaches its limit with the AST. Following the classic two-loop + fade-out format, that soundtrack was meant to be played back in generic MIDI players, and not to actually be put back into the game in looped form. Since the loop algorithm did, in fact, find inconsistencies even in the OST, two copies of the apparent loop are sometimes not enough to prove cases where the actual loop ends much later than you think it does. In a few cases, it would be enough to simply remove all volume change events from the fade-out to prove the actual loop, but in others, the algorithm would need MIDI event data far past the end of the fade-out.
However, just giving up and not looping any of these tracks would be equally unfortunate. So how about shifting the question, from what's the best loop in this MIDI file to what's the best loop if the MIDI didn't fade out and instead repeated its apparent second loop a third time? As long as the detected loop in such a pre-processed file ends before the repeated range, it's still a valid loop in terms of the unmodified original.
Ideally, we want to do this pre-processing programmatically with the same Rust library instead of manually editing the MIDI. Many sequencers (and especially XGworks) apply significant changes to a MIDI file's internal structure when saving its internal representation back to a MIDI file, which might even mess with our loop algorithm. So it would be very nice to have a more trustworthy tool that applies only the edit we actually want, and perfectly retains the rest of the MIDI.
And that's how this sub-project turned into a small suite of command-line MIDI operations in the classic Unix filter/pipeline style: Each command reads a MIDI file from stdin, transforms it, and outputs text or the resulting MIDI file on stdout. This way, we gain maximum transparency and reproducibility as I can document the unique pre-processing steps for each AST track by simply providing the command lines. And sure, we're re-encoding and re-decoding the full MIDI sequence at every step along such a pipeline, but computers are fast, Rust and the midly library in particular are ⚡ blazingly fast ⚡, and the usability benefits of this pipeline model far outweigh any theoretical performance drops.
Here's the full list of commands that made it into the resulting mly tool:
cut: Extremely basic removal of MIDI events within a certain range.
dump: Dumps all MIDI events into a textual table. All event lists in this blog post are based on this output.
duration: Shows the duration of a MIDI file in pulses, beats, seconds, and PCM samples.
filter-note: Removes all Note On events within a certain range, retaining all other events. This allows us to generate separate intro and loop MIDIs, whose renderings we can then splice back into a single loopable waveform with no discontinuities, which is not guaranteed when rendering a single MIDI file. This provides the last missing piece needed for rendering perfect, sample-accurate loops through Sound Canvas VA.
loop-find: The loop detection algorithm described above.
loop-unfold: Duplicates MIDI events from a given point to the end of the track. A budget solution for the problem of creating synthetic loops – arbitrary copying of arbitrary subranges to arbitrary destinations would have been undeniably nicer, but also much more complex, and I didn't need that full flexibility for the task at hand.
smf0: Flattening multi-track/SMF Type 1 MIDI sequences into single-track/SMF Type 0 ones. Having this conversion as a distinct operation in our toolset allows other operations to exclusively support SMF Type 0 if a Type 1 implementation would either take significant additional effort or just duplicate the Type 0 flattening algorithm. This group of operations includes loop-find, cut, and even the real-time output for duration because tempo events can theoretically occur on any track.
This feature set should strike a good balance between not spending too much of the Shuusou Gyoku budget on tangential problems, but still offering a decent solution for the problem at hand. As a counterexample, the obvious killer feature – deserializing a dump back into a Standard MIDI File – would have gone way past the budget. While there are crates that free you from the need to write manual parsing code for basic data structures, they would instead require a lot of attribute boilerplate – and if the library that provided the structures doesn't already come with these attributes, you now have to duplicate all the structures, and convert back and forth between the original structures and your copies. Not to mention that we'd still have to write code for the high-level structure of the dump output…
If we put it all together, this is what we can do:
The best loop found in the raw MIDI file spans 4 events and 200 milliseconds. Clearly, this is not the loop we're looking for.
Let's cut off all events from the start of the fade-out to the end, do a loop-unfold copy of all events from the position during the apparent second loop that corresponds to where the fade-out started, and try looking for a loop in that modified MIDI.
The resulting loop is 1:31m long, which is exactly what we were hoping to find.
The note space loop represents the earliest possible event range with equivalent per-channel controller and pitch bend state at both ends. This loop is only appropriate for MIDI players, as its bounds can fall into the middle of notes that are played with a different channel state at the start and end of the loop. This is why it doesn't show any sample positions.
The recording space loop ensures that this doesn't happen. It's also always placed on a Note On event with non-zero velocity, which eases the splicing of separate filter-note recordings. This way, it's enough to remove leading silence from the loop part and mix it exactly at the indicated sample position.
The detected loop is also nowhere close to the cut point at beat 466, matching our condition for validity. All events within the loop came from ZUN's original composition, and the cut/loop-unfold combo merely provided the remaining 63% of events necessary to prove this loop as such.
So, where are these loop quirks that justify why some of these audio files are longer than you'd think they should be? Just listing them as text wouldn't really communicate just how minor these are. It would be much nicer to visualize them in a way that highlights the exact inconsistencies within a fixed range of MIDI measures. Screenshots of MIDI sequencer or DAW windows won't capture these aspects all too well because these programs are geared toward fine-grained editing of single tracks, not visualization of details across all channels.
REAPER's piano roll nicely snaps to a certain range, but good luck picking out the individual lines from the single volume lane at the bottom of the screen, or spotting a 7-point difference. Not to mention that CC #11 (Expression) makes up an equal part of a channel's final perceived volume, which is the metric we'd actually want to visualize.
Typical MIDI visualizers, however, are on the complete opposite end of the spectrum. In recent years, MIDI visualization has become synonymous with the typical Synthesia style of YouTube videos with a big keyboard at the bottom, note bars flying in from the top, and optional fancy effects once those notes hit the top of the keyboard. The Black MIDI community has been churning out tons of identically looking MIDI visualizers in recent years that mainly seem to differ in the programming language they're written in, and in how well they can cope with the blackest of black MIDIs.
Thankfully, most of these visualizers are open-source and have small and manageable codebases. The project with the most GitHub stars and the most generic name seemed to be the best starting point for hacking in the missing features, despite using GLSL shaders which I had no prior experience with. It was long overdue that I did something with GLSL though – it added a nice educational aspect to these hacks, and it still was easier than deciphering whatever the fastest and hyper-optimized Rust visualizer is doing.
Still, this visualizer needed a total of 18 small features and bugfixes to be actually usable for demonstrating Shuusou Gyoku's loop quirks. As such, these hacks turned into yet another tangential sub-project that could have easily consumed another two pushes if I cleaned up the code and published the result. But that would have really gone way past the budget for something that people might not even care about. So here's what we're going to do:
I've added this MIDI visualizer as a new goal to the order form. This goal is eligible for microtransactions, so you don't have to fund a full push to see the first changes committed and released.
The upstream project seems to have been abandoned recently, which is the perfect excuse for not even trying to merge in my sweeping changes with a series of pull requests. The code sure needs a lot of cleanup and deduplication, and especially a more build system-friendly way of embedding its shader source code.
Every backer who supports this goal with at least 0.1 pushes or microtransactions will get a Windows binary with my current hacked-in changes as a preview, immediately after the purchase. Shoutout to the MIT license for letting me do this 😛
As usual, once the code is done, the final cleaned-up version will be available for free for everyone, in both source code and binary release form.
Alright then! Here's how to read the visualizations:
The transparency of each note represents its velocity multiplied by the channel volume and expression. To spot volume inconsistencies, you'd compare the opacity of equivalent notes in the two ranges.
The X-axis of these visualizations uses linear/real time, so the width of each measure represents the exact time it takes to be played relative to the other measures in the visualized range. To spot tempo inconsistencies, you'd compare the distance between the bar lines.
Notes that are duplicated on two or more channels may be colored differently in the loop start and end views. These are rendering order inconsistencies and don't communicate anything about the MIDI.
Stage 1 theme (フォルスストロベリー), original and arranged version: The string and harmonica channels are slightly louder on the apparent first loop than on the others.
Apparent loop:
0:01m – 1:31m
Actual loop:
1:04m – 2:34m
Mei and Mai's theme (ディザストラスジェミニ), arranged version: The one and only quirk that's caused by different notes – the first loop has an E♭ on the slap bass channel in measure 32, but the second loop has a G♭ in the corresponding measure 72.
Apparent loop:
0:01m – 1:02m
Actual loop:
0:50m – 1:51m
Stage 3 theme (華の幻想 紅夢の宙), original and arranged version:
The trumpet channel starts out panned to the center of the stereo field (64), before being left-panned by 25% (48) at 1:04m, where it stays for the rest of the track.
Apparent loop:
0:01m – 1:29m
Actual loop:
1:04m – 2:32m
I didn't come up with a good way of visualizing panning in a 2D plane, so you have to trust your ears with this one.
Marie's theme (機械サーカス ~ Reverie), arranged version: Every apparent loop modulates up by a semitone 16 measures before it ends, and remains in that new key at the start of the next loop, so the piece technically doesn't loop at all. The original stays in G♯m throughout.
Stage 5 theme (カナベラルの夢幻少女), original version: The ritardando near the supposed end of the first loop drops from 145 BPM to 118 BPM, but only to 129 BPM in all further loops.
Apparent loop:
0:01m – 1:39m
Actual loop:
1:33m – 3:11m
Yup, that means that the intro part technically makes almost up the entire apparent loop. ZUN replaced the ritardando with instant tempo changes in the arranged version, which moves the loop to its expected place at the start of the track.
The loop start and end points are in the respective next measure past this range.
Stage 6 theme (アンティークテラー), arranged version: The string channel starts out with the maximum expression of 127, but then only goes up to 120 after some fading notes later in the piece, where it stays for the beginning of the second loop.
Apparent loop:
0:01m – 1:53m
Actual loop:
0:13m – 2:05m
Same here.
VIVIT-captured-'s first theme (夢機械 ~ Innocent Power), arranged version: Has a unique ending section that starts in Gm and then modulates through Em and Fm before it fades out on F♯m.
VIVIT-captured-'s second theme (幻想科学 ~ Doll's Phantom), original and arranged version: Another fade-related 127 vs. 120 expression inconsistency, this time on the orange square channel.
Apparent loop:
0:01m – 1:32m
Actual loop:
1:03m – 2:34m
VIVIT-captured-'s third theme (少女神性 ~ Pandora's Box), original and arranged version: Another tempo inconsistency: A slightly differently shaped ritardando before the bell tree hit in the supposed first loop.
Marisa's theme (魔女達の舞踏会), arranged version: Has a unique 8-bar ending section that is first played in Cm and then loops in C♯m while fading out.
Ending theme (ハーセルヴズ), arranged version: Probably the best-known one out of these, and I'm talking of course about the beautiful ending section. I'm making the executive decision to not loop this track in-game, and letting it fade to silence instead.
Before we package up these looped soundtracks, let's take a quick look at how they would be shown off in the Music Room. The Seihou Music Rooms carry over the per-channel keyboards from TH05, add the current per-channel volume, expression, and pan pot values, and top it off with a fake spectrum analyzer. All of these visualizations rely on MIDI data, and the Music Room would feel very dull and boring without them. Just look at Kioh Gyoku, whose Music Room basically turns into a still image in WAVE mode.
Retaining these visualizations even when playing waveform BGM was very important for me, and not just because it would make for a unique high-quality feature that would break new ground. It can also double as proof that the waveform versions are, in fact, in perfect sync with both the MIDIs they are based on, and, by extension, the respective stage scripts.
However, this would require the game to process the MIDIs and update the internal visualization state without simultaneously playing them back through the WinMM / MME / midiOut*() API. And just like graphics and text rendering, Shuusou Gyoku's original code came with zero architectural separation between platform-independent processing logic and platform-specific playback…
So I accidentally rewrote almost the entire MIDI code to achieve said separation. This also provided a great occasion to modernize this code and add some much-needed robustness for potential MIDI mods, while retaining the original code's approach of iterating over raw SMF byte streams. It might all have been very excessive for a delivery that was supposed to be just about waveform BGM support, but on the plus side, MIDI output is now portable to any other system's MIDI API as well.
Surprisingly though, it was Shuusou Gyoku's original MIDI timing that quickly turned out to be rather inaccurate, and not the waveforms. The exact numbers vary depending on the piece, but the game played back every MIDI about 1% slower than notated, adding about 2 or 3 seconds to their total playback time after 5 minutes. Tempo changes in particular were the biggest causes of desynchronizations with the waveforms…
To understand how this can happen to begin with, we have to look closer at how you're supposed to use the midiOut*() API. This API is as low-level as it gets, only covering the transmission of a single MIDI message to the selected output device right now. There is no concept of note timing at this low level, so it's completely up to the program to parse delta times and tempo change events out of the MIDI file and correctly time the calls to this API for each MIDI message. With all the code that runs between the API and the actual renderer of the synth for every single message, the resulting timing can only ever be an approximation of the MIDI file. This doesn't really matter for the timescales and polyphony levels of typical music because, again, computers are fast, but such an API is fundamentally unsuitable for accurately playing back even just a moderately complex million-note Black MIDI.
Shuusou Gyoku handles this required manual timing in the simplest possible way: It runs a MIDI processing function (Mid_Proc() in the code) at an interval of 10 ms, which processes and instantly sends out all MIDI events that have occurred at any point within the last 10 ms, maintaining merely their order. This explains not only why the original game incremented its MIDI TIMER by multiples of 10, but also the infamous missing drums when playing the soundtrack through the Microsoft GS Wavetable Synth:
ZUN reduced all drum notes to the minimum possible length allowed by the 480 PPQN pulse resolution of these MIDI files.
In regular music notation, this corresponds to 1/1920th notes.
While the exact real-time length in purely mathematical terms depends on the tempo of a piece, it only has to be ≥13 BPM for a 1/1920th note to be shorter than 10 ms.
Therefore, the higher the BPM, the higher the chance that both a drum note's Note On and Note Off messages are sent within the same call to Mid_Proc(), with the respective two midiOut*() API calls only being at best a two-digit number of microseconds apart.
So it only makes sense why cheap MIDI synths that don't even respond to reverb or release time messages completely drop any note with such a short length. After all, at a sampling rate of 44,100 Hz, a note would have to be at least 22.7 µs long to be represented by even a single PCM sample.
This also extends to the visualizations above, and was the reason why I chose to render all drum notes as fixed-size diamonds. Otherwise, they would barely be visible.
But while sending MIDI events in such quantized chunks might not be perfect, it can't be the cause behind multi-second playback slowdowns. Instead, this issue has to boil down to the way Shuusou Gyoku times each individual message, and specifically how it converts between MIDI pulse units and real-time (milli)seconds. pbg's original MIDI code chose to do this in an equally confusing and inaccurate way: it kept two counters that tracked the current MIDI pulse before and after the latest tempo change, used the value of the latter counter to decide which events to process, and only added the pulse equivalent of 10 ms to this counter at the end of Mid_Proc() in the then current tempo. The commit message for my rewritten algorithm details the problems with this approach using nice ASCII art in case you're interested, but in short, the main problem lies in how the single final addition can only consider a single tempo change within each call to Mid_Proc(). If a MIDI file contains tempo ramps with less than 10 ms between each different tempo, the original game would only use the last of these tempo values as the basis for converting the entire 10 ms back into MIDI pulses. Not to mention that maybe MIDI pulses aren't the best unit in a game that still 📝 treats the FPU as lava and doesn't use any fixed-point means of increasing the resolution of the 10 ms→pulse division either…
On the contrary, it's much more accurate to immediately convert every encountered MIDI delta time to a real-time quantity and use that unit for event timing, especially if we want to restrict ourselves to integer math. Signed 64-bit integers are enough to fit the product of the slowest possible MIDI tempo ((224 - 1) µs per quarter note) and the highest possible MIDI delta time (228 - 1) at nanosecond precision (103), with one bit to spare. Then, we arrive at a much simpler timing algorithm:
Each simultaneously playing track gets a next event timer, starting out at 0
When looking at the next event, add the converted nanosecond value of its delta time to this timer
Subtract the equivalent of 10 ms from each track's timer at the beginning of the processing function
As long as the timer is ≤0, process and send the next message
The additive nature of this timer not only naturally allows more than one event to happen within a single Mid_Proc() call, but also averages out any minor timing inconsistencies across the length of a track.
assert(length_of_tempo_message == 3);
uint32_t tempo = 0;
for(int i = 0; i < length_of_tempo_message; i++) {
- tempo += ((tempo << 8) + (*track_data++));+ tempo = ((tempo << 8) + (*track_data++));
Yup – the original code performed two additions per byte, which incorrectly added the interim value at every byte to the final result, and yielded a tempo that is ≈0.8% / ≈1 BPM slower than notated in the MIDI file, matching the number we were looking for. That's why the |/OR operator is the safer one to use in such a bit-twiddling context…
But now I'm curious. This is such a tiny bug that is bound to remain unnoticed until someone compares the game's MIDI output to another renderer. It must have certainly made it into other games whose MIDI code is based on Shuusou Gyoku's, or that pbg was involved with. And sure enough, not only did this bug survive Kioh Gyoku's OOP refactoring, but it even traveled into Windows Touhou, where it remained in every single game that supported MIDI playback. Now we know for a fact that pbg's Program Support role in the TH06 credits involved sharing ready-made, finished code with ZUN:
The broken tempo deserialization in the respective latest full versions of TH06 through TH10. And yes, that's TH10 – even though TH09's trial version was the last game to ship MIDI versions of its soundtrack, TH10 still contained all of pbg's MIDI code that originated back in Shuusou Gyoku, before TH11 finally removed it.
Amusingly, ZUN's compiler even started optimizing the combination of left-shifting and addition to a multiplication with 257 for TH09, which even sort of highlights this bug if you're used to reading x86 ASM.
That leaves support for MIDI loop points as the only missing feature for syncing MIDI data with a looping waveform track. While it didn't require all too much code, pbg's original zero-copy approach of iterating over raw MIDI data definitely injected a lot of complexity into the required branches. Multi-track/SMF Type 1 files require quite a bit of extra thought to correctly calculate delta times across loop boundaries that reach past the end of the respective track, while still allowing the real-time delta values to be resynchronized at tempo changes within the loop – and yes, 3 of ZUN's 19 arranged MIDI files actually do use more than one track, so this wasn't just about maximizing MIDI compatibility for mods. I stuck to the original approach mostly as a challenge and to prove that it's possible without first parsing the entire MIDI sequence into a friendlier internal representation, but I absolutely do not recommend this to anyone else.
After hardcoding the loop points detected by mly into the binary, we only need to call Mid_Proc() once per frame in the Music Room and pass the frame delta time instead of the 10 ms constant. And then, we get this:
The MIDI TIMER now shows off the arguably more interesting current MIDI pulse value rather than just formatting the PASSED TIME in milliseconds. Ironically, displaying this value in a constantly counting way takes more effort now – the new nanosecond-based timing code doesn't use any measure of total MIDI pulses anymore, and they don't naturally fall out of the algorithm either. Instead, the code remembers the total pulse value of the last event it processed and adds the real-time duration that has passed since, similar to the original timing algorithm.
This naturally causes the timer to jump from the loop end pulse to the loop start pulse, proving that Mid_Proc() is in fact looping the sequence.
Alright, now we know what to package:
We're going to have 8 BGM packs for each permutation of soundtrack (OST / AST), sound source (Romantique Tp / Sound Canvas VA), and codec (FLAC / Vorbis), making up 1.15 GiB of music data in total.
When looking at the package names, you will notice that I don't particularly highlight the FLAC versions as lossless. And for good reason – the Romantique Tp recordings had dithering and noise shaping applied to them, and the Sound Canvas VA versions will necessarily have to be volume-normalized and quantized to 16-bit during the conversion to FLAC. If we wanted a BGM pack with the actual raw Sound Canvas VA output, we'd have to implement WavPack support, which is the only lossless codec that supports 32-bit float – and even that codec could only compress these files down to 14 MiB per minute of music, or 508 MB for the entire original soundtrack. That's 1.4× the size of an equivalent thbgm.dat!
The whole packaging process will be complex enough to warrant a build system. I'd also like to generate an extensive README file for each package, not least to describe the Sound Canvas VA rendering and loop-cutting process in complete detail.
The AST packs need to bundle the MIDI files from ZUN's site for Music Room visualization. We might as well add a 9th MIDI-only AST pack then, as it will naturally fall out of the packaging pipeline anyway. Some people sure love their MIDI synths, after all.
The OST packs can fall back on the original game's MIDI files from MUSIC.DAT for their Music Room visualization, so there's no need to bundle those and infringe copyright. Ironically, the game will still require a MUSIC.DAT even if you use a BGM pack, if only for the one number in that file that says that Shuusou Gyoku's soundtrack consists of 20 tracks in total.
ZUN didn't arrange タイトルドメイド, so we need to copy the OST version recorded with the respective sound source into the AST pack.
Unfortunately, we still haven't reached the end of the complications and weird issues that haunt Shuusou Gyoku's music:
The original game reads the in-game track title directly out of the first Sequence Name event of the playing MIDI file. The waveform equivalent would be the Vorbis comment TITLE tag, which therefore should exactly match the original track's title, down to the exact placement of whitespace. As usual, if I emphasize minor things like this, it's not without reason: 幻想科学 ~ Doll's Phantom inconsistently uses halfwidth spaces at both sides of the ~, and wouldn't fit into the Music Room's limited space otherwise.
However, the AST MIDI files jam a bunch of other metadata into their Sequence Names, roughly following the format
【 $title 】 from 秋霜玉 for sc88Pro comp.ZUN
The track titles should definitely not appear in this format in-game, but how do we get rid of this format without hardcoding either the names or the magic to parse the names out of this format?
The absolute state of GS SysEx tooling rears its ugly head one final time in three of the AST MIDIs, which for some reason are missing the Roland vendor prefix byte in all of their SysEx messages and are therefore undeniably bugged. There even seemed to be another SysEx-related bug which Romantique Tp explained away, but not this one:
The irony of using invalid Reverb Macros within already invalid SysEx messages is not lost on me.
This is something we should fix even before running these files through Sound Canvas VA in order to render these with the reverb settings that ZUN clearly (and, for once, unironically) intended.
For perfect preservation of the original BGM/gameplay synchronicity, it makes sense for the waveform versions to retain the leading 1 or 2 beats of silence that the original MIDI files use for their SysEx setup. While some of the AST tracks use a slightly different tempo compared to their OST counterparts, they would still be largely in sync as ZUN didn't rearrange the layout of their setup area… except for, once again, the three tracks used in the Extra Stage. Marisa's and Reimu's boss themes aren't too bad with their 4 beats of setup, but シルクロードアリス takes the cake with a whopping 12 beats of leading silence. That's 5 seconds from the start of the Extra Stage to the first note you'd hear. 🐌
2) and 4) could theoretically be worked around in Shuusou Gyoku's MIDI code, but there's no way around editing the MIDI files themselves as far as 3) is concerned. Thus, it makes sense to apply all of the workarounds to the AST MIDIs as part of the BGM build process – parsing the titles out of the 【brackets】, inserting the Roland vendor prefix byte where necessary, and compressing the setup bars in the Extra Stage themes to match their OST counterparts. Adding any hidden magic to the MIDI code would only have needlessly increased complexity and/or annoyed some modder in the future who would then have to work around it.
Ideally, these edits would involve taking the mly dump output, performing the necessary replacements at a plaintext level, and rebuilding the result back into a MIDI file, bu~t we're unfortunately missing the latter feature. Luckily, someone else had the same idea 13 years ago and
wrote a tool in C that does exactly what we need. Getting it to compile in 2024 only required fixing a typical C thing… why are students and boomers defending this antique of a language again? 🙄
The single most glaring issue, however, is the drastic difference in volume between the individual tracks in both soundtracks. While Romantique Tp had to normalize each track to the maximum possible volume individually as a consequence of the recording process, the Sound Canvas VA renderings reveal just how inconsistent the volume levels of these MIDI files really are:
The peak amplitudes of every track in both soundtracks, as rendered by Sound Canvas VA at maximum volume. Looking at these, you might think that kaorin's 2007 recordings were purposely trying to preserve the clipping that would come out of an SC-88Pro if you don't manually adjust the volume knob for each song, but those recordings are still much louder than even these numbers.
So how do we interpret this? Is this a bug, because no one in their right mind would want their music to clip on purpose, and that in turn means that everything about these volume levels is arbitrary and unintentional? Or is this a quirk, and ZUN deliberately chose these volume levels for compositional reasons? It certainly would make sense for the name registration theme.
Once again, the AST version of シルクロードアリス is the worst offender in this regard as well, but it might also provide some evidence for the quirk interpretation. The fact that almost all of its MIDI channels blast away at full volume might have been an accident that could have gone unnoticed if the volume knob of ZUN's SC-88Pro was turned rather low during the time he arranged this piece, but the excessive left-panning must have been deliberate. Even Romantique Tp agrees:
It might have even made compositional sense if Silk Road Alice was supposed to be a "Western-style piece", but it's not.
And that's with the volume already normalized. Because this one channel of this one track is almost twice as loud as anything else in the AST, we would consequently have to bring down the volume of every other arranged track and the right channel of the same track by almost 50% if we wanted to maintain the volume differences between the individual tracks of the AST. In the process, we lose almost one entire bit of dynamic range. At this rate, you might even consider remixing and remastering the entire thing, but that would involve so many creative decisions to definitely fall into fanfiction territory…
However, normalizing each track to a peak level of 0 dBFS makes much more sense for in-game playback if you consider how loud Shuusou Gyoku's sound effects are. Once again, the best solution would involve offering both versions, but should we really add two more SCVA BGM packs just to cover volume differences? ReplayGain solves this exact problem for regular music listening in a non-destructive way by writing the per-track and per-album gain levels into an audio file's metadata. Since we need metadata support for titles anyway, we can do something similar, albeit not exactly the same for two reasons:
ReplayGain is specified to target an average volume of −17 dBFS, whereas we'd like to target a peak volume of 0 dBFS in order to always use the entire available digital scale. We've got some loud sound effects to compete with, after all.
ReplayGain expresses its gain values in dB, which is cumbersome to work with. In the realm of PCM, volume changes don't need to involve more than a simple multiplication, so let's go with a simple scalar GAIN FACTOR.
And so, we hard-apply the album-level gain during the conversion from 32-bit float to FLAC to preserve the volume differences between the tracks, calculate the track-levelGAIN FACTOR based on the resulting peak levels, add a volume normalization toggle to the Sound / Config menu, enable it by default, and thus make everyone happy. ✅
The final interesting tidbit in building these packages can be found in the way the Sound Canvas VA recordings are looped. When manually cutting loops, you always have to consider that the intro might end with unique notes that aren't present at the end of the loop, which will still be fading out at the calculated loop start point. This necessitates shifting the loop start point by a few bars until these notes are no longer audible – or you could simply ignore the issue because ZUN's compositions are so frantic that no one would ever notice.
With the separate intro and loop files generated by mly, on the other hand, the reverb/release trails are immediately visible and, after trimming trailing silence, exactly define the number of samples that the calculated loop start point needs to be shifted by. The .loop file then remains always exactly as long, in samples, as the duration of the loop reported by mly. If a piece happens to have a constant tempo whose beat duration corresponds to an integer number of samples, we get some very satisfying, round loop durations out of this process. ☺️
So let's play it all back in-game… and immediately run into two unexpected miniaudio limitations, what the…?!
miniaudio uses a fixed linear function for its fade-out envelope, and doesn't offer anything else? We might not even want a logarithmic one this time because symmetry with MIDI's simple quadratic curve would be neat, but we sure don't want a linear function – those stay near the original volume for too long, and then turn quiet way too quickly.
There is no way to access FLAC metadata from miniaudio's public API, even though the library bundles the author's own FLAC library which has this feature?
📝 Back when I evaluated miniaudio, I alluded that I consider single-file C libraries to be massively overrated, and this is exactly why: Once they grow as massive as miniaudio (how ironic), they can quickly lead to their authors treating their dependencies as implementation details and melting down the interfaces that would naturally arise. In a regular library, dr_flac would be a separate, proper dependency, and the API would have a way to initialize a stream from an externally loaded drflac object. But since the C community collectively pretends that multi-file libraries are a burden on other developers, miniaudio ended up with dr_flac copy-pasted into its giant single file, with a silly ma_ namespacing prefix added to all its functions. And why? Did we have to move so far in the other direction just because CMake doesn't support globbing? That's a symptom of CMake not actually solving any problem, not a valid architectural decision that libraries should bend around. 🙄
So unless we fork and hack around in miniaudio, there's now no way around depending on a second, regular copy of dr_flac. Which has now led to the same project organization bloat that single-file libraries originally set out to prevent…
Sigh. At this rate, it makes more sense to just copy-paste and adapt the old BGM streaming code I wrote for thcrap in late 2018, which used dr_flac directly, and extend it with metadata support. With the streaming code moved out of the platform layer and into game logic, it also makes much more sense to implement the squared fade-out curve at that same level instead of copy-pasting and adjusting an unhealthy amount of miniaudio's verbose C code.
While I'm doing the same for the old Vorbis streaming code, it would also make sense to rewrite that one to use stb_vorbis instead of the old libogg+libvorbis reference libraries. There's no need to add two more dependencies if miniaudio already comes with stb_vorbis.c, and that library is widelyacclaimed. So, integration should be a breeze, right?
Well, surprise, rarely have I seen a C library so actively hostile toward being integrated. Both of its API variants are completely unreasonable:
The pulldata API pulls Vorbis data as needed from either a memory buffer containing the entire Vorbis file, or a C FILE* handle.
Effectively, this forces either you to give up disk streaming completely, or your program into C's terrible I/O API with all its buffering slowness and Unicode issues on Windows. The documentation even goes on to suggest just modifying the code if you need anything else, which might be acceptable in the strange world of game development this library originates from, but it sure isn't in the kind of open-source development I do.
The pushdata API expects the caller to gradually feed chunks of Vorbis data. How large do these chunks have to be? Nobody knows – and, even worse, the API doesn't retain any of the data already pushed in. If the buffer you passed is too small, which you don't get to know in advance, you have to pass the same data plus more in the next call. I get that you might want an API like this to avoid dynamic memory allocations, but not only does this API perform plenty of allocations itself, it actively forces its caller to realloc() over and over again. 🙄 The lack of seeking support reveals that this API is geared towards live-streamed audio, and it might very well be acceptable in such a case, but it's nothing we could use for BGM.
What happened to the tried-and-true idea of providing a structure with read, tell, and seek callbacks, and then providing an optional variant for C FILE* handles if you absolutely must? Sure, the whole point of Vorbis is to be small and nobody these days would care about spending a few MB on keeping an entire Vorbis file in memory, but come on. If pulldata made the deliberate and opinionated choice to only support buffers of complete Vorbis streams and argued in the name of simplicity that hand-coded disk streaming isn't worth it in this day and age, I might have even been convinced. And this is from the guy who popularized the concept of single-file C libraries in the first place?
Oh well, tupblocks go brrr. libvorbis definitely shows its age with all the old command-line tools in the lib/ directory that they never moved away and that we now have to remove from our glob. But even that just adds a single line to the Tupfile, and then we get to enjoy its much friendlier API. That sure beats the almost 800 lines of code that miniaudio had to write to integrate stb_vorbis… which I can't even link because the file is too big for GitHub. 🤷
At this point, it would have even made sense to upgrade from a 24-year-old lossy codec to an 11-year-old lossy codec and use Opus instead, since the enforced 48,000 Hz sampling rate is a non-issue when you control the entire audio pipeline. But let's keep compatibility with existing thcrap mods for now.
In the end, the Windows build ended up using only a single one of the miniaudio features that DirectSound doesn't have, and that's the ability to use the more modern WASAPI instead of DirectSound. We're still going to use miniaudio for the Linux port, but as far as Windows is concerned, it would be quite nice to backport BGM streaming to the game's original DirectSound backend. The P0275 build is pushing 1 MiB of binary size for a game that originally came in a 220 KiB binary, so it would remove a noticeable amount of bloat from GIAN07.EXE, but it would also allow waveform BGM to work in the Windows 98-compatible i586 build. If that sounds cool to you, this is the issue you want to fund.
That only left some logic and UI busywork to put it all together, which means that we've almost reached the end of things to talk about! Here's what it all looks like:
BGM pack selection is done in-game through a new submenu. The <Download> option will open the BGM pack release page in the system's preferred browser:
This window presented a great occasion for already implementing the generic boilerplate for vertically scrolling windows with an unlimited number of items. That will come in quite handy once we introduce better replay support… 👀
Even with per-track BGM volume normalization, Shuusou Gyoku's sound effects are still a bit too loud in comparison, especially when mixed on top of that excessively and unfixably left-panned AST version of the Extra Stage theme. Adding separate volume controls for BGM and sound effects really was the only sustainable solution here, and conveniently checks an important quality-of-life box the original game lacked. So important that it was the very first issue I added to the GitHub tracker of my fork:
I really wanted to have Japanese help text in these menus, as it makes them look just so much more consistent and polished. Many thanks to Elfin, who responded to my bounty offer, and will most likely also provide localizations for future features.
In-game music titles are now consistently right-aligned. Leading whitespace in 4 of the original MIDI Sequence Names suggests that pbg might have intended these titles to be centered within the 216 maximum pixels that the original code designated for music titles, but none of those 4 had the correct amount of spaces that would have been required for exact centering:
Right-aligned text matches the one certain intention I can read out of the code, and allows us to consistently trim whitespace from both the original MIDI Sequence Names and the TITLE tags in the BGM packs… at the cost of significantly changing the animation. 🤔
Maybe, all this whitespace had the explicit purpose of making the animation look the way it did originally? But hard-padding the title tags in the BGM packs would be so dumb… 😩 Let's keep it like this for now and fix the animation later.
At startup, the game now shows a new screen if any of the game's .DAT files are missing, displaying their expected absolute path. This is bound to be very important on Linux because each distribution might have its own idea of where these files are supposed to be stored. But even on Windows, this allows GIAN07.EXE to at least run and show something if one or more of these files are not present, instead of crashing at the first attempt of loading anything from them.The ¥ instead of \ is, 📝 once again, a font issue. Good luck finding a font not named MS Gothic that looks good when rendered in this game…
On a more unfortunate note, I dropped the i586 build from this release. Visual Studio 2022's CRT implements the new filesystem and threading code using Win32 API functions that are only available on Vista or later and are not covered by the one ready-made KernelEx package I was able to find, so I couldn't easily test such a build on Windows 98 anymore. Resurrecting the i586 build would therefore involve additional platform abstraction layers that we wouldn't need otherwise. Writing them wouldn't be too expensive, but it only makes sense if there's actual demand. Backporting waveform BGM to DirectSound to restore feature parity would also be a good idea here, as it would avoid the need to litter the current code with #ifdefs at any place that references anything related to BGM packs.
After half a year of being bought out way past the cap, I've finally got some small room left for new orders again. If it weren't for this blog post and the required research and web development work, this delivery would have probably come out in early January, taking half the time it ended up taking. So I really have to start factoring the blog posts into the push prices in a better and fairer way.
Meanwhile, the hate toward my day job only keeps growing, but there's little point in looking for a new one as long as ReC98 remains this motivating and complex. It leaves pretty much no cognitive room for any similarly demanding job. Thus, I want 2024 to be the year where ReC98 either becomes profitable enough to be my only full-time job, or where we conclusively find out that it can't, I go look for a better day job, and ReC98 shifts to a slower pace. Here's the plan:
From now on, I will immediately increase the push price whenever we reach 100% of the cap, either directly through new orders or indirectly through existing subscriptions. The price increase will be relative to how long it took to reach that point since the last re-opening.
If the store continues selling out, I will aim for per push by the end of the year.
In exchange, microtransactions (i.e., deliveries containing just code and no blog posts) will now be half the price of regular pushes for the same amount of delivered code. Or in other words: If you want to fund a goal that's eligible for microtransactions, you can now decide whether your fixed amount of money goes to 2× coding work and 0× blogging, or 1× coding work and 1× blogging.
I'll permanently increase the default level of the cap from 8 to 10 pushes. The past 12 months were full of mod releases that raised the bar, and 2024 shows no signs of stopping that trend.
If we ever reach per push, I plan to hire people for some of the contribution-ideas or anything else that might improve this project. (Well-produced YouTube videos about the findings of this project might be a nice idea!) At that point, I will have reached my goal of living decently off this project alone, and it's time for others to make money in this space as well.
With the new price of per push, this means that there's now a small window in which you can get a full push worth of functionality for , until the current cap is filled up again.
Next up: Probably TH02's endings to relax a bit. Maybe we're also getting some new Touhou-related contributions?
And now we're taking this small indie game from the year 2000 and porting
its game window, input, and sound to the industry-standard cross-platform
API with "simple" in its name.
Why did this have to be so complicated?! I expected this to take maybe 1-2
weeks and result in an equally short blog post. Instead, it raised so many
questions that I ended up with the longest blog post so far, by quite a wide
margin. These pushes ended up covering so many aspects that could be
interesting to a general and non-Seihou-adjacent audience, so I think we
need a table of contents for this one:
Before we can start migrating to SDL, we of course have to integrate it into
the build somehow. On Linux, we'd ideally like to just dynamically link to a
distribution's SDL development package, but since there's no such thing on
Windows, we'd like to compile SDL from source there. This allows us to reuse
our debug and release flags and ensures that we get debug information,
without needing to clone build scripts for every
C++ library ever in the process or something.
So let's get my Tup build scripts ready for compiling vendored libraries… or
maybe not? Recently, I've kept hearing about a hot new
technology that not only provides the rare kind of jank-free
cross-compiling build system for C/C++ code, but innovates by even
bundling a C++ compiler into a single 279 MiB package with no
further dependencies. Realistically replacing both Visual Studio and Tup
with a single tool that could target every OS is quite a selling point. The
upcoming Linux port makes for the perfect occasion to evaluate Zig, and to
find out whether Tup is still my favorite build system in 2023.
Even apart from its main selling point, there's a lot to like about Zig:
First and foremost: It's a modern systems programming language with
seamless C interop that we could gradually migrate parts of the codebase to.
The feature set of the core language seems to hit the sweet spot between C
and C++, although I'd have to use it more to be completely sure.
A native, optimized Hello World binary with no string formatting is
4 KiB when compiled for Windows, and 6.4 KiB when cross-compiled
from Windows to Linux. It's so refreshing to see a systems language in 2023
that doesn't bundle a bulky runtime for trivial programs and then defends it
with the old excuse of "but all this runtime code will come in handy the
larger your program gets". With a first impression like this, Zig
managed to realize the "don't pay for what you don't use" mantra that C++
typically claims for itself, but only pulls off maybe half of the time.
You can directly
target specific CPU models, down to even the oldest 386 CPUs?! How
amazing is that?! In contrast, Visual Studio only describes its /arch:IA32
compatibility option in very vague terms, leaving it up to you to figure out
that "legacy 32-bit x86 instruction set without any vector
operations" actually means "i586/P5 Pentium, because the startup code
still includes an unconditional CPUID instruction". In any
case, it means that Zig could also cover the i586 build.
Even better, changing Zig's CPU model setting recompiles both its
bundled C/C++ standard library and Zig's own compiler-rt polyfill
library for that architecture. This ensures that no unsupported
instructions ever show up in the binary, and also removes the need for
any CPUID checks. This is so much better than the Visual
Studio model of linking against a fixed pre-compiled standard library
because you don't have to trust that all these newer instructions
wouldn't actually be executed on older CPUs that don't have them.
I love the auto-formatter. Want to lay out your struct literal into
multiple lines? Just add a trailing comma to the end of the last element.
It's very snappy, and a joy to use.
Like every modern programming language, Zig comes with a test framework
built into the language. While it's not all too important for my grand plan
of having one big test that runs a bunch of replays and compares their game
states against the original binary, small tests could still be useful for
protecting gameplay code against accidental changes. It would be great if I
didn't have to evaluate and choose among
the many testing frameworks for C++ and could just use a language
management is still in its infancy, but it's looking pretty good so far,
resembling Go's decentralized approach of just pointing to a URL but with
specific version selection from the get-go.
However, as a version number of 0.11.0 might already suggest, the whole
experience was then bogged down by quite a lot of issues:
While Zig's C/C++ compilation feature is very
well architected to reuse the C/C++ standard libraries of GCC and MinGW and
thus automatically keeps up with changes to the C++ standard library,
it's ultimately still just a Clang frontend. If you've been working with a
Visual Studio-exclusive codebase – which, as we're going to see below, can
easily happen even if you compile in C++23 mode – you'd now have to
migrate to Clang and Zig in a single step. Obviously, this can't ever
be fixed without Microsoft open-sourcing their C++ compiler. And even then,
supporting a separate set of command-line flags might not be worth it.
The standard library is very poorly documented, especially in the
build-related parts that are meant to attract the C++ audience.
Often, the only documentation is found in blog posts from a few years
ago, with example code written against old Zig versions that doesn't compile
on the newest version anymore. It's all very far from stable.
However, Zig's project generation sub-commands (zig
init-exe and friends) do emit well-documented boilerplate
code? It does make sense for that code to double as a comprehensive example,
but Zig advertises itself as so simple that I didn't even think about
bootstrapping my project with a CLI tool at first – unlike, say, Rust, where
a project always starts with filling out a small form in
There's no progress output for C/C++ compilation? Like, at all?
This hurts especially because compilation times are significantly longer
than they were with Visual Studio. By default, the current Tupfile builds
Shuusou Gyoku in both debug and release configurations simultaneously. If I
fully rebuild everything from a clean cache, Visual Studio finishes such a
build in roughly the same amount of time that Zig takes to compile just a
debug build.
The --global-cache-dir option is only supported by specific
subcommands of the zig CLI rather than being a top-level
setting, and throws an error if used for any other subcommand. Not having a
system-wide way to change it and being forced into writing a wrapper script
for that is fine, but it would be nice if said wrapper script didn't have to
also parse and switch over the subcommand just to figure out whether it is
allowed to append the setting.
compiler-rt still needs a bit of dead code elimination work. As soon as
your program needs a single polyfilled function, you get all of them,
because they get referenced in some exception-related table even if nothing
uses them? Changing the link_eh_frame_hdr option had no
And that was not the only std.Build.Step.Compile option
that did nothing. Worse, if I just tweaked the options and changed nothing
about the code itself, Zig simply copied a previously built executable
out of its build cache into the output directory, as revealed by the
timestamp on the .EXE. While I am willing to believe that Zig correctly
detects that all these settings would just produce the same binary, I do not
like how this behavior inspires distrust and uncertainty in Zig's build
process as a whole. After all, we still live in a world where clearing
the build cache is way too often the solution for weird problems in
software, especially when using CMake. And it makes sense why it would be:
If you develop a complex system and then try solving the infamously hard
problem of cache invalidation on top, the risk of getting cache invalidation
wrong is, by definition, higher than if that was the only thing your system
did. That's the reason why I like Tup so much: It solely focuses on
getting cache invalidation right, and rather errs on the side of caution by
maybe unnecessarily rebuilding certain files every once in a while because
the compiler may have read from an environment variable that has changed in
the meantime. But this is the one job I expect a build system to do, and Tup
has been delivering for years and has become fundamentally more trustworthy
as a result.
Zig activates Clang's UBSan
in debug builds by default, which executes a program-crashing
UD2 instruction whenever the program is about to rely on
undefined C++ behavior. In theory, that's a great help for spotting hidden
portability issues, but it's not helpful at all if these crashes are
seemingly caused by C++ standard library code?! Without any clear info
about the actual cause, this just turned into yet another annoyance on
top of all the others. Especially because I apparently kept searching for
the wrong terms when I first encountered this issue, and only found
out how to deactivate it after I already decided against Zig.
Also, can we get /PDBALTPATH?
Baking absolute paths from the filesystem of the developer's machine into
released binaries is not only cringe in itself, but can also cause potential
privacy or security accidents.
So for the time being, I still prefer Tup. But give it maybe two or three
years, and I'm sure that Zig will eventually become the best tool for
resurrecting legacy C++ codebases. That is, if the proposed divorce of the
core Zig compiler from LLVMisn't an indication that the
productive parts of the Zig community consider the C/C++ building features
to be "good enough", and are about to de-emphasize them to focus more
strongly on the actual Zig language. Gaining adoption for your new systems
language by bundling it with a C/C++ build system is such a great and unique
strategy, and it almost worked in my case. And who knows, maybe Zig will
already be good enough by the time I get to port PC-98 Touhou to modern
(If you came from the Zig
wiki, you can stop reading here.)
A few remnants of the Zig experiment still remain in the final delivery. If
that experiment worked out, I would have had to immediately change the
execution encoding to UTF-8, and decompile a few ASM functions exclusive to
the 8-bit rendering mode which we could have otherwise ignored. While Clang
does support inline assembly with Intel syntax via
-fms-extensions, it has trouble with ; comments
and instructions like REP STOSD, and if I have to touch that
code anyway… (The REP STOSD function translated into a single
call to memcpy(), by the way.)
Another smaller issue was Visual Studio's lack of standard library header
hygiene, where #including some of the high-level STL features also includes
more foundational headers that Clang requires to be included separately, but
I've already known about that. Instead, the biggest shocker was that Visual
Studio accepts invalid syntax for a language feature as recent as C++20
// Defines the interface of a text rendering session class. To simplify this
// example, it only has a single `Print(const char* str)` method.
template <class T> concept Session = requires(T t, const char* str) {
// Once the rendering backend has started a new session, it passes the session
// object as a parameter to a user-defined function, which can then freely call
// any of the functions defined in the `Session` concept to render some text.
template <class F, class S> concept UserFunctionForSession = (
Session<S> && requires(F f, S& s) {
{ f(s) };
// The rendering backend defines a `Prerender()` method that takes the
// aforementioned user-defined function object. Unfortunately, C++ concepts
// don't work like this: The standard doesn't allow `auto` in the parameter
// list of a `requires` expression because it defines another implicit
// template parameter. Nevertheless, Visual Studio compiles this code without
// errors.
template <class T, class S> concept BackendAttempt = requires(
T t, UserFunctionForSession<S> auto func
) {
// A syntactically correct definition would use a different constraint term for
// the type of the user-defined function. But this effectively makes the
// resulting concept unusable for actual validation because you are forced to
// specify a type for `F`.
template <class T, class S, class F> concept SyntacticallyFixedBackend = (
UserFunctionForSession<F, S> && requires(T t, F func) {
// The solution: Defining a dummy structure that behaves like a lambda as an
// "archetype" for the user-defined function.
struct UserFunctionArchetype {
void operator ()(Session auto& s) {
// Now, the session type disappears from the template parameter list, which
// even allows the concrete session type to be private.
template <class T> concept CorrectBackend = requires(
T t, UserFunctionArchetype func
) {
What's this, Visual Studio's infamous delayed template parsing applied to
concepts, because they're templates as well? Didn't
they get rid of that 6 years ago? You would think that we've moved
beyond the age where compilers differed in their interpretation of the core
language, and that opting into a current C++ standard turns off any
remaining antiquated behaviors…
So let's actually get my Tup build scripts ready for compiling
vendored libraries, because the
📝 previous 70 lines of Lua definitely
weren't. For this use case, we'd like to have some notion of distinct build
targets that can have a unique set of compilation and linking flags. We'd
also like to always build them in debug and release versions even if you
only intend to build your actual program in one of those versions – with the
previous system of specifying a single version for all code, Tup would
delete the other one, which forces a time-consuming and ultimately needless
rebuild once you switch to the other version.
The solution I came up with treats the set of compiler command-line options
like a tree whose branches can concatenate new options and/or filter the
versions that are built on this branch. In total, this is my 4th
attempt at writing a compiler abstraction layer for Tup. Since we're
effectively forced to write such layers in Lua, it will always be a
bit janky, but I think I've finally arrived at a solid underlying design
that might also be interesting for others. Hence, I've split off the result
into its own separate
repository and added high-level documentation and a documented example.
And yes, that's a Code Nutrition
label! I've wanted to add one of these ever since I first heard about the
idea, since it communicates nicely how seriously such an open-source project
should be taken. Which, in this case, is actually not all too
seriously, especially since development of the core Tup project has all but
stagnated. If Zig does indeed get better and better at being a Clang
frontend/build system, the only niches left for Tup will be Visual
Studio-exclusive projects, or retrocoding with nonstandard toolchains (i.e.,
ReC98). Quite ironic, given Tup's Unix heritage…
Oh, and maybe general Makefile-like tasks where you just want to run
specific programs. Maybe once the general hype swings back around and people
start demanding proper graph-based dependency tracking instead of just a command runner…
Alright, alternatives evaluated, build system ready, time to include SDL!
Once again, I went for Git submodules, but this time they're held together
by a
batch file that ensures that the intended versions are checked out before
starting Tup. Git submodules have a bad rap mainly because of their
usability issues, and such a script should hopefully work around
them? Let's see how this plays out. If it ends up causing issues after all,
I'll just switch to a Zig-like model of downloading and unzipping a source
archive. Since Windows comes with curl and tar
these days, this can even work without any further dependencies.
Compiling SDL from a non-standard build system requires a
bit of globbing to include all the code that is being referenced, as
well as a few linker settings, but it's ultimately not much of a big deal.
I'm quite happy that it was possible at all without pre-configuring a build,
but hey, that's what maintaining a Visual Studio project file does to a
By building SDL with the stock Windows configuration, we then end up with
exactly what the SDL developers want us to use… which is a DLL. You
can statically link SDL, but they really don't want you to do
that. So strongly, in fact, that they not
merely argue how well the textbook advantages of dynamic linking have worked
for them and gamers as a whole, but implemented a whole dynamic API
system that enforces overridable dynamic function loading even in static
builds. Nudging developers to their preferred solution by removing most
advantages from static linking by default… that's certainly a strategy. It
definitely fits with SDL's grassroots marketing, which is very good at
painting SDL as the industry standard and the only reliable way to keep your
game running on all originally supported operating systems. Well, at least
until SDL 3 is so stable that SDL 2 gets deprecated and won't
receive any code for new backends…
However, dynamic linking does make sense if you consider what SDL is.
Offering all those multiple rendering, input, and sound backends is what
sets it apart from its more hip competition, and you want to have all of
them available at any time so that SDL can dynamically select them based on
what works best on a system. As a result, everything in SDL is being
referenced somewhere, so there's no dead code for the linker to eliminate.
Linking SDL statically with link-time code generation just prolongs your
link time for no benefit, even without the dynamic API thwarting any chance
of SDL calls getting inlined.
There's one thing I still don't like about all this, though. The dynamic
API's table references force you to include all of SDL's subsystems in the
DLL even if your game doesn't need some of them. But it does fit with their
intention of having SDL2.dll be swappable: If an older game
stopped working because of an outdated SDL2.dll, it should be
possible for anyone to get that game working again by replacing that DLL
with any newer version that was bundled with any random newer game. And
since that would fail if the newer SDL2.dll was size-optimized
to not include some of the subsystems that the older game required, they
simply removed (or de-prioritized) the possibility altogether.
Maybe that was their train of thought? You can always just use the official Windows
DLL, whose whole point is to include everything, after all. 🤷
So, what do we get in these 1.5 MiB? There are:
renderer backends for Direct3D 9/11/12, regular OpenGL, OpenGL ES 2.0,
Vulkan, and a software renderer,
and audio backends for WinMM, DirectSound, WASAPI, and direct-to-disk
Unfortunately, SDL 2 also statically references some newer Windows API
functions and therefore doesn't run on Windows 98. Since this build of
Shuusou Gyoku doesn't introduce any new features to the input or sound
interfaces, we can still use pbg's original DirectSound and DirectInput code
for the i586 build to keep it working with the rest of the
platform-independent game logic code, but it will start to lag behind in
features as soon as we add support for SC-88Pro BGM or more sophisticated input
remapping. If we do want to keep this build at the same feature level as
the SDL one, we now have a choice: Do we write new DirectInput and
DirectSound code and get it done quickly but only for Shuusou Gyoku, or do
we port SDL 2 to Windows 98 and benefit all other SDL 2 games as
well? I leave
that for my backers to decide.
Immediately after writing the first bits of actual SDL code to initialize
the library and create the game window, you notice that SDL makes it very
simple to gradually migrate a game. After creating the game window, you can
call SDL_GetWindowWMInfo()
to retrieve HWND and HINSTANCE handles that allow
you to continue using your original DirectDraw, DirectSound, and DirectInput
code and focus on porting one subsystem at a time.
Sadly, D3DWindower can no longer turn SDL's fullscreen mode into a windowed
one, but DxWnd still works, albeit behaving a bit janky and insisting on
minimizing the game whenever its window loses focus. But in exchange, the
game window can surprisingly be moved now! Turns out that the originally
fixed window position had nothing to do with the way the game created its
DirectDraw context, and everything to do with pbg
blocking the Win32 "syscommand" that allows a window to be moved. By
deleting a system menu… seriously?! Now I'm dying to hear the Raymond
Chen explanation for how this behavior dates back to an unfortunate decision
during the Win16 days or something.
As implied by that commit, I immediately backported window movability to the
i586 build.
However, the most important part of Shuusou Gyoku's main loop is its frame
rate limiter, whose Win32 version leaves a bit of room for improvement.
Outside of the uncapped [おまけ] DrawMode, the
original main loop continuously checks whether at least 16 milliseconds have
elapsed since the last simulated (but not necessarily rendered) frame. And
by that I mean continuously, and deliberately without using any of
the Windows system facilities to sleep the process in the meantime, as
evidenced by a commented-out Sleep(1) call. This has two
important effects on the game:
The 60Fps DrawMode actually corresponds to a
frame rate of
(1000 / 16) = 62.5 FPS,
not 60. Since the game didn't account for the missing
2/3 ms to bring the limit down to exactly 60 FPS,
62.5 FPS is Shuusou Gyoku's actual official frame rate in a
non-VSynced setting, which we should also maintain in the SDL port.
Not sleeping the process turns Shuusou Gyoku's frame rate limitation
into a busy-waiting loop, which always uses 100% of a single CPU core just
to wait for the next frame.
Sure, modern computers are fast, but a frame won't ever take an
infinitely fast 0 milliseconds to render. So we still need to take the
current frame time into account.
SDL_Delay()'s documentation says that the wake-up could be
further delayed due to OS scheduling.
To address both of these issues, I went with a base delay time of
15 ms minus the time spent on the current frame, followed by
busy-waiting for the last millisecond to make sure that the next frame
starts on the exact frame boundary. And lo and behold: Even though this
still technically wastes up to 1 ms of CPU time, it still dropped CPU
usage into the 0%-2% range during gameplay on my Intel Core i5-8400T CPU,
which is over 5 years old at this point. Your laptop battery will appreciate
this new build quite a bit.
Time to look at audio then, because it sure looks less complicated than
input, doesn't it? Loading sounds from .WAV file buffers, playing a fixed
number of instances of every sound at a given position within the stereo
field and with optional looping… and that's everything already. The
DirectSound implementation is so straightforward that the most complex part
of its code is the .WAV file parser.
Well, the big problem with audio is actually finding a cross-platform
backend that implements these features in a way that seamlessly works with
Shuusou Gyoku's original files. DirectSound really is the perfect sound API
for this game:
It doesn't require the game code to specify any output sample format.
Just load the individual sound effects in their original format, and
playback just works and sounds correctly.
Its final sound stream seems to have a latency of 10 ms, which is
perfectly fine for a game running at 62.5 FPS. Even 15 ms would be
Sound effect looping? Specified by passing the
Stereo panning balancing? One method call.
Playing the same sound multiple times simultaneously from a single
memory buffer? One
method call. (It can fail though, requiring you to copy the data after
Pausing all sounds while the game window is not focused? That's the
default behavior, but it can be equally easily disabled with just
a single per-buffer flag.
Future streaming of waveform BGM? No problem either. Windows Touhou has
always done that, and here's
some code I wrote 12½ years ago that would even work without DirectSound
8's notification feature.
No further binary bloat, because it's part of the operating system.
The last point can't really be an argument against anything, but we'd still
be left with 7 other boxes that a cross-platform alternative would have to
tick. We already picked SDL for our portability needs, so how does its audio
subsystem stack up? Unfortunately, not great:
It's fully DIY. All you get is a single output buffer, and you have to
do all the mixing and effect processing yourself. In other words, it's the
masochistic approach to cross-platform audio.
There are helper functions for resampling and mixing, but the
documentation of the latter is full of FUD. With a disclaimer that so
vehemently discourages the use of this function, what are you supposed to do
if you're newly integrating SDL audio into a game? Hunt for a separate sound
mixing library, even though your only quality goal is parity with stone-age
DirectSound? 🙄
It forces the game to explicitly define the PCM sampling rate, bit
depth, and channel count of the output buffer. You can't
just pass a nullptr to SDL_OpenAudioDevice(),
and if you pass a zeroed SDL_AudioSpec structure, SDL just defaults
to an unacceptable 22,050 Hz sampling rate, regardless of what the
audio device would actually prefer. It took until last year for them to
notice that people would at least like to query the native
format. But of course, this approach requires the backend to actually
provide this information – and since we've seen above that DirectSound
doesn't care, the
DirectSound version of this function has to actually use the more modern
WASAPI, and remains unimplemented if that API is not available.
Standardizing the game on a single sampling rate, bit depth, and channel
count might be a decent choice for games that consistently use a single
format for all its sounds anyway. In that case, you get to do all mixing and
processing in that format, and the audio backend will at most do one final
conversion into the playback device's native format. But in Shuusou Gyoku,
most sound effects use 22,050 Hz, the boss explosion sound effect uses
11,025 Hz, and the future SC-88Pro BGM will obviously use
44,100 Hz. In such a scenario, you would have to pick the highest
sampling rate among all sound sources, and resample any lower-quality sounds
to that rate. But if the audio device uses a different sampling rate, those
lower-quality sounds would get resampled a second time.
I know that this
will be fixed in SDL 3, but that version is still under heavy
Positives? Uh… the callback-based nature means that BGM streaming is
rather trivial, and would even be comparatively less complicated than with
DirectSound. Having a mutex to prevent
writes to your sound instance structures while they're being read by the
audio thread is nice too.
OK, sure, but you're not supposed to use it for anything more than a
single stream of audio. SDL_mixer exists precisely to cover such non-trivial
use cases, and it even supports sound effect looping and panning with just a
single function call! But as far as the rest of the library is concerned, it
manages to be an even bigger disappointment than raw SDL audio:
As it sits on top of SDL's audio subsystem, it still can't just use your
audio device's native sample format.
It only offers a very opinionated system for streaming – and of course,
its opinion is wrong. 😛 The fact that it only supports a single streaming
audio track wouldn't matter all too much if you could switch to another
track at sample precision. But since you can't, you're forced to implement
looping BGM using a single file…
…which brings us to the unfortunate issue of loop point definitions.
And, perhaps most importantly, the complete lack of any way to set them
through the API?! It doesn't take long until you come up with a theory for
why the API only offers a function to retrieve loop points: The
"music" abstraction is so format-agnostic that it even supports MIDI
and tracker formats where a typical loop point in PCM samples doesn't make
sense. Both of these formats already have in-band ways of specifying loop
points in their respective time units. They
might not be standardized, but it's still much better than usual
single-file solutions for PCM streams where the loop point has to be stored
in an out-of-band way – such as in a metadata tag or an entirely separate
Speaking of MIDI, why is it so common among these APIs to not have
any way of specifying the MIDI device? The fact that Windows Vista
removed the Control Panel option for specifying the system-wide default
MIDI output device is no excuse for your API lacking the option as well.
In fact, your MIDI API now needs such a setting more than it was
needed in the Windows XP and 9x days.
Funnily enough, they did once receive a patch for a function to set loop
points which was never upstreamed… and this patch came from
the main developer behind PyTouhou, who needed that feature for obvious
reasons. The world sure is a small place.
As a result, they turned loop points into a property that each
individual format may
or may
not have. Want to loop
MP3 files at sample precision? Tough luck, time to reconvert to another
lossy format. 🙄 This is the exact jank I decided against when I implemented
BGM modding for thcrap back in 2018,
where I concluded that separate intro and
loop files are the way to go.
But OK, we only plan to use FLAC and Ogg Vorbis for the SC-88Pro BGM, for
which SDL_mixer does support loop points in the form of Vorbiscomments,
and hey, we can even pass them at sample accuracy. Sure, it's wrong and
everything, but nothing I couldn't work with…
However, the final straw that makes SDL_mixer unsuitable for Shuusou
Gyoku is its core sound mixing paradigm of distributing all sound effects
onto a fixed number of channels, set to 8
by default. Which raises the quite ridiculous question of how many we
would actually need to cover the maximum amount of sounds that can
simultaneously be played back in any game situation. The theoretic maximum
would be 41, which is the combined sum of individual sound buffer instances
of all 20 original sound effects. The practical limit would surely be a lot
smaller, but we could only find out that one through experiments, which
honestly is quite a silly proposition.
It makes you wonder why they went with this paradigm in the first
place. And sure enough, they actually
use the aforementioned SDL core function for mixing audio. Yes, the
same function whose current documentation advises against using it for
this exact use case. 🙄 What's the argument here? "Sure, 8 is
significantly more than 2, but any mixing artifacts that will occur for
the next 6 sounds are not worrying about, but they get really bad
after the 8th sound, so we're just going to protect you from
This dire situation made me wonder if SDL was the wrong choice for Shuusou
Gyoku to begin with. Looking at other low-level cross-platform game
libraries, you'll quickly notice that all of them come with mostly
equally capable 2D renderers these days, and mainly differentiate themselves
in minute API details that you'd only notice upon a really close look. raylib is another one of those
libraries and has been getting exceptionally popular in recent years, to the
point of even having more than twice as many GitHub stars as SDL. By
restricting itself to OpenGL, it can even offer an
abstraction for shaders, which we'd really like for the 西方Project lens ball effect.
In the case of raylib's audio system, the lack of sound effect looping is
the minute API detail that would make it annoying to use for Shuusou Gyoku.
But it might be worth a look at how raylib implements all this if it doesn't
use SDL… which turned out to be the best look I've taken in a long time,
because raylib builds on top of miniaudio
which is exactly the kind of audio library I was hoping to find.
Let's check the list from above:
🟢 miniaudio's high-level API initialization defaults to the native
sample format of the playback device. Its internal processing uses 32-bit
floating-point samples and only converts back to the native bit depth as
necessary when writing the final stream into the backend's audio buffer.
WASAPI, for example, never needs any further conversion because it operates
with 32-bit floats as well.
🟢 The final audio stream uses the same 10 ms update period (and
thus, sound effect latency) that I was getting with DirectSound.
🟢 Stereo panning balancing? ma_sound_set_pan(),
although it does require a conversion from Shuusou Gyoku's dB units into a
linear attenuation factor.
🟢 Sound effect looping? ma_sound_set_looping().
🟢 Playing the same sound multiple times simultaneously from a single
memory buffer? Perfectly possible, but requires a bit of digging in the
header to find the best solution. More on that below.
🟢 Future streaming of waveform BGM? Just call
ma_sound_init_from_file() with the
👍 It also comes with a FLAC decoder in the core library and an Ogg
Vorbis one as part of the repo, …
🤩 … and even supports gapless switching between the intro and loop
files via a single declarative call to
(Oh, and it also has ma_data_set_loop_point_in_pcm_frames()
for anyone who still believes in obviously and objectively
inferior out-of-band loop points.)
🟢 Pausing all sounds while the game window is not focused? It's not
automatic, but adding new functions to the sound interface and calling
ma_engine_stop() and ma_engine_start() does the
trick, and most importantly doesn't cause any samples to be lost in the
🟡 Sound control is implemented in a lock-free way, allowing your main
game thread to call these at any time without causing glitches on the audio
thread. While that looks nice and optimal on the surface, you now have to
either believe in the soundness (ha) of the implementation, or verify that
atomic structure fields actually are enough to not cause any race
conditions (which I did for the calls that Shuusou Gyoku uses, and I didn't
find any). "It's all lock-free, don't worry about it" might be
easier, but I consider SDL's approach of just providing a mutex to
prevent the output callback from running while you mutate the sound state to
actually be simpler conceptually.
🟡 miniaudio adds 247 KB to the binary in its minimum
configuration, a bit more than expected. Some of that is bloat from effect
code that we never use, but it does include backends for all three Windows
audio subsystems (WASAPI, DirectSound, and WinMM).
✅ But perhaps most importantly: It natively supports all modern
operating systems that one could seriously want to port this game to, and
could be easily ported to any other backend, including
Oh, and it's written by the same developer who also wrote the best FLAC
library back in 2018. And that's despite them being single-file C libraries,
which I consider to be massively overrated…
The drawback? Similar to Zig, it's only on version 0.11.18, and also focuses
on good high-level documentation at the expense of an API reference. Unlike
Zig though, the three issues I ran into turned out to be actual and fixable
bugs: Two minor
ones related to looping of streamed sounds shorter than 2 seconds which
won't ever actually affect us before we get into BGM modding, and a critical one that
added high-frequency corruption to any mono sound effect during its
expansion to stereo. The latter took days to track down – with symptoms
like these, you'd immediately suspect the bug to lie in the resampler or its
low-pass filter, both of which are so much more of a fickle and configurable
part of the conversion chain here. Compared to that, stereo expansion is so
conceptually simple that you wouldn't imagine anyone getting it wrong.
While the latter PR has been merged, the fix is still only part of the
dev branch and hasn't been properly released yet. Fortunately,
raylib is not affected by this bug: It does currently
ship version 0.11.16 of miniaudio, but its usage of the library predates
miniaudio's high-level API and it therefore uses a different,
non-SSE-optimized code path for its format conversions.
The only slightly tricky part of implementing a miniaudio backend for
Shuusou Gyoku lies in setting up multiple simultaneously playing instances
for each individual sound. The documentation and answers on the issue
tracker heavily push you toward miniaudio's resource manager and its file
abstractions to handle this use case. We surely could turn Shuusou Gyoku's
numeric sound effect IDs into fake file names, but it doesn't really fit the
existing architecture where the sound interface just receives in-memory .WAV
file buffers loaded from the SOUND.DAT packfile.
In that case, this seems to be the best way:
Call ma_decode_memory() to decode from any of the supported
audio formats to a buffer of raw PCM samples. At this point, you can
choose between
decoding into the original format the sound effect is stored in,
which would require it to be converted to the playback format every
time it's played, or
decoding into 32-bit floats (the native bit depth of the miniaudio
engine) and the native sampling rate of the playback device, which
avoids any further resampling and floating-point conversion, but takes
up more memory.
Nowadays, it's not clear at all which of the two approaches is faster.
Does it actually matter if we save the audio thread from doing all those
floating-point operations on every sample? Or is that no longer true these
days because the audio thread is probably running on a different CPU core,
the rest of the game largely doesn't touch the floating-point parts of your
CPU anyway, and you'd rather want to keep sound effects small so that they
can better fit into the CPU cache? That would be an interesting question to
benchmark, but just like the similar text rendering question from the last
blog posts, it doesn't matter for this tiny 2000s retro game. 😌
I went with 2) mainly because it simplified all the debugging I was doing.
At a sampling rate of 48,000 Hz, this increases the memory usage for
all sound effects from 379 KiB to 3.67 MiB. At least I'm not
channel-expanding all sound effects as well here…
We've seen earlier that mono➜stereo expansion
is SSE-optimized, so it's very hard to justify a further doubling of the
memory usage here.
Then, for each instance of the sound, call
ma_audio_buffer_ref_init() to create a reference
buffer with its own playback cursor, and
ma_sound_init_from_data_source() to create a new
high-level sound node that will play back the reference buffer.
As a side effect of hunting that one critical bug in miniaudio, I've now
learned a fair bit about audio resampling in general. You'll probably need
some knowledge about basic
digital signal behavior to follow this section, and that video is still
probably the best introduction to the topic.
So, how could this ever be an issue? The only time I ever consciously
thought about resampling used to be in the context of the Opus codec and its
enforced sampling rate of 48,000 Hz, and how Opus advocates
claim that resampling is a solved problem and nothing to worry about,
especially in the context of a lossy codec. Still, I didn't add Opus to
thcrap's BGM modding feature entirely because the mere thought of having to
downsample to 44,100 Hz in the decoder was off-putting enough. But even
if my worries were unfounded in that specific case: Recording the
Stereo Mix of Shuusou Gyoku's now two audio backends revealed that
apparently not every audio processing chain features an Opus-quality
If we take a look at the material that resamplers actually have to work with
here, it quickly becomes obvious why their results are so varied. As
mentioned above, Shuusou Gyoku's sound effects use rather low sampling rates
that are pretty far away from the 48,000 Hz your audio device is most
definitely outputting. Therefore, any potential imaging noise across the
extended high-frequency range – i.e., from the original Nyquist frequencies
of 11,025 Hz/5,512.5 Hz up to the new limit of 24,000 Hz – is
still within the audible range of most humans and can clearly color the
resulting sound.
But it gets worse if the audio data you put into the resampler is
objectively defective to begin with, which is exactly the problem we're
facing with over half of Shuusou Gyoku's sound effects. Encoding them all as
8-bit PCM is definitely excusable because it was the turn of the millennium
and the resulting noise floor is masked by the BGM anyway, but the blatant
clipping and DC offsets definitely aren't:
Waveforms for all 20 of Shuusou Gyoku's sound effects, in the order they
appear inside SOUND.DAT and with their internal names. We can
see quite an abundance of clipping, as well
as a significant DC
Wait a moment, true peaks? Where do those come from? And, equally
importantly, how can we even observe, measure, and store anything
above the maximum amplitude of a digital signal?
The answer to the first question can be directly derived from the
video I linked above: Digital signals are lollipop graphs, not stairsteps as
commonly depicted in audio editing software. Converting them back to an
analog signal involves constructing a continuous curve that passes through
each sample point, and whose frequency components stay below the Nyquist
frequency. And if the amplitude of that reconstructed wave changes too
strongly and too rapidly, the resulting curve can easily overshoot the
maximum digital amplitude of 0
dBFS even if none of the defined samples are above that limit.
So let's store the resampled output as a FLAC file and load it into Audacity
to visualize the clipped peaks… only to find all of them replaced with the
typical kind of clipping distortion? 😕 Turns out that I've stumbled over
the one case where the FLAC format isn't lossless and there's
actually no alternative to .WAV: FLAC just doesn't support
floating-point samples and simply truncates them to discrete integers during
encoding. When we measured inter-sample peaks above, we weren't only
resampling to a floating-point format to avoid any quantization to discrete
integer values, but also to make it possible to store amplitudes beyond the
0 dBFS point of ±1.0 in the first place. Once we lose that ability,
these amplitudes are clipped to the maximum value of the integer bit depth,
and baked into the waveform with no way to get rid of them again. After all,
the resampled file now uses a higher sampling rate, and the clipping
distortion is now a defined part of what the sound is.
Finally, storing a digital signal with inter-sample peaks in a
floating-point format also makes it possible for you to reduce the
volume, which moves these peaks back into the regular, unclipped amplitude
range. This is especially relevant for Shuusou Gyoku as you'll probably
never listen to sound effects at full volume.
Now that we understand what's going on there, we can finally compare the
output of various resamplers and pick a suitable one to use with miniaudio.
And immediately, we see how they fall into two categories:
High-quality resamplers are the ones I described earlier: They cleanly
recreate the signal at a higher sampling rate from its raw frequency
representation and thus add no high-frequency noise, but can lead to
inter-sample peaks above 0 dBFS.
Linear resamplers use much simpler math to merely interpolate
between neighboring samples. Since the newly interpolated samples can only
ever stay within 0 dBFS, this approach fully avoids inter-sample
clipping, but at the expense of adding high-frequency imaging noise that has
to then be removed using a low-pass filter.
miniaudio only comes with a linear resampler – but so does DirectSound as it
turns out, so we can get actually pretty close to how the game sounded
All of Shuusou Gyoku's sound effects combined and resampled into a
single 48,000 Hz / 32-bit float .WAV file, using GoldWave's File Merger tool. By
converting to 32-bit float first and then resampling, the
conversion preserved the exact frequency range of the original
22,050 Hz and 11,025 Hz files, even despite clipping. There
are small noise peaks across the entire frequency range, but they
only occur at the exact boundary between individual sound effects. These
are a simple result of the discontinuities that naturally occur in the
waveform when concatenating signals that don't start or end at a 0
As mentioned above, you'll only get this sound out of your DAC at lower
volumes where all of the resampled peaks still fit within 0 dBFS.
But you most likely will have reduced your volume anyway, because these
effects would be ear-splittingly loud otherwise.
The result of converting 1️⃣ into FLAC. The necessary bit depth
conversion from 32-bit float to 16-bit integers clamps any data above
0 dBFS or ±1.0f to the discrete
[-32,678; 32,767] range, the maximum value of such
an integer. The resulting straight lines at maximum amplitude in the
time domain then turn into distortion across the entire 24,000 Hz
frequency domain, which then remains a part of the waveform even at
lower volumes. The locations of the high-frequency noise exactly match
the clipped locations in the time-domain waveform images above.
The resulting additional distortion can be best heard in
BOSSBOMB, where the low source frequency ensures that any
distortion stays firmly within the hearing range of most humans.
All of Shuusou Gyoku's sound effects as played through DirectSound and
recorded through Stereo Mix. DirectSound also seems to use a linear
low-pass filter that leaves quite a bit of high-frequency noise in the
signals, making these effects sound crispier than they should be.
Depending on where you stand, this is either highly inaccurate and
something that should be fixed, or actually good because the sound
effects really benefit from that added high end. I myself am definitely
in the latter camp – and hey, this sound is the result of original game
code, so it is accurate at least in that regard.
All of Shuusou Gyoku's sound effects as converted by miniaudio and
directly saved to a file, with the same low-pass filter setting used in
the P0256 build. This first-order low-pass filter is a decent
approximation of DirectSound's resampler, even though it sounds slightly
crispier as the high-frequency noise is boosted a little further. By
default, miniaudio would use a 4th-order low-pass filter, so
this is the second-lowest resampling quality you can get, short of
disabling the low-pass filter altogether.
Conversion results when using miniaudio's 8th-order low-pass
filter for resampling, the highest quality supported. This is the
closest we can get to the reference conversion without using a custom
resampler. If we do want to go for perfect accuracy though, we might as
well go
for 1️⃣ directly?
These spectrum images were initially created using ffmpeg's -lavfi
showspectrumpic=mode=combined:s=1280x720 filter. The samples
appear in the same order as in the waveform above.
And yes, these are indeed the first videos on this blog to have sound! I
spent another push on preparing the
📝 video conversion pipeline for audio
support, and on adding the highly important volume control to the player.
Web video codecs only support lossy audio, so the sound in these videos will
not exactly match the spectrum image, but the lossless source files do
contain the original audio as uncompressed PCM streams.
Compared to that whole mess of signals and noise, keyboard and joypad input
is indeed much simpler. Thanks to SDL, it's almost trivial, and only
slightly complicated because SDL offers two subsystems with seemingly
identical APIs:
SDL_GameController provides a consistent interface for the typical kind
of modern gamepad with two analog sticks, a D-pad, and at least 4 face and 2
shoulder buttons. This API is implemented by simply combining SDL_Joystick
with a
long list of mappings for specific controllers, and therefore doesn't
work with joypads that don't match this standard.
to SDL, this is what a "game controller" looks like. Here's
the source of the SVG.
To match Shuusou Gyoku's original WinMM backend, we'd ideally want to keep
the best aspects from both APIs but without being restricted to
SDL_GameController's idea of a controller. The Joy
Pad menu just identifies each button with a numeric ID, so
SDL_Joystick would be a natural fit. But what do we do about directional
controls if SDL_Joystick doesn't tell us which joypad axes correspond to the
X and Y directions, and we don't have the SDL-recommended configuration UI yet?
Doing that right would also mean supporting
POV hats and D-pads, after all… Luckily, all joypads we've tested map
their main X axis to ID 0 and their main Y axis to ID 1, so this seems like
a reasonable default guess.
The necessary consolidation of the game's original input handling uncovered
several minor bugs around the High Score and Game Over screen that I
sufficiently described in the release notes of the new build. But it also
revealed an interesting detail about the Joy Pad
screen: Did you know that Shuusou Gyoku lets you unbind all these
actions by pressing more than one joypad button at the same time? The
original game indicated unbound actions with a [Button
0] label, which is pretty confusing if you have ever programmed
anything because you now no longer know whether the game starts numbering
buttons at 0 or 1. This is now communicated much more clearly.
ESC is not bound to any joypad button in
either screenshot, but it's only really obvious in the P0256
With that, we're finally feature-complete as far as this delivery is
concerned! Let's send a build over to the backers as a quick sanity check…
a~nd they quickly found a bug when running on Linux and Wine. When holding a
button, the game randomly stops registering directional inputs for a short
while on some joypads? Sounds very much like a Wine bug, especially if the
same pad works without issues on Windows.
And indeed, on certain joypads, Wine maps the buttons to completely
different and disconnected IDs, as if it simply invents new buttons or axes
to fill the resulting gaps. Until we can differentiate joypad bindings
per controller, it's therefore unlikely that you can use the same joypad
mapping on both Windows and Linux/Wine without entering the Joy Pad menu and remapping the buttons every time you
switch operating systems.
Still, by itself, this shouldn't cause any issues with my SDL event handling
code… except, of course, if I forget a break; in a switch case.
This completely preventable implicit fallthrough has now caused a few hours
of debugging on my end. I'd better crank up the warning level to keep this
from ever happening again. Opting into this specific warning also revealed
why we haven't been getting it so far: Visual Studio did gain a whole host
of new warnings related to the C++ Core
Guidelines a while ago, including the one I
was looking for, but actually getting the compiler to throw these
requires activating
a separate static analysis mode together with a plugin, which
significantly slows down build times. Therefore I only activate them for
release builds, since these already take long enough.
Since all that input debugging already started a 5th push, I
might as well fill that one by restoring the original screenshot feature.
After all, it's triggered by a key press (and is thus related to the input
backend), reads the contents of the frame buffer (and is thus related to the
graphics backend), and it honestly looks bad to have this disclaimer in the
release notes just because we're one small feature away from 100% parity
with pbg's original binary.
Coincidentally, I had already written code to save a DirectDraw surface to a
.BMP file for all the debugging I did in the last delivery, so we were
basically only missing filename generation. Except that Shuusou
Gyoku's original choice of mapping screenshots to the PrintScreen key did
not age all too well:
And as of Windows 11, the OS takes full control of the key by binding it
to the Snipping Tool by default, complete with a UI that politely steals
focus when hitting that key.
As a result, both Arandui and I independently arrived at the
idea of remapping screenshots to the P key, which is the same screenshot key
used by every Windows Touhou game since TH08.
The rest of the feature remains unchanged from how it was in pbg's original
build and will save every distinct frame rendered by the game (i.e., before
flipping the two framebuffers) to a .BMP file as long as the P key is being
held. At a 32-bit color depth, these screenshots take up 1.2 MB per
frame, which will quickly add up – especially since you'll probably hold the
P key for more than 1/60 of a second and therefore end
up saving multiple frames in a row. We should probably compress
them one day.
Since I already translated some of Shuusou Gyoku's ASM code to C++ during
the Zig experiment, it made sense to finish the fifth push by covering the
rest of those functions. The integer math functions are used all throughout
the game logic, and are the main reason why this goal is important for a
Linux port, or any port to a 64-bit architecture for that matter. If you've
ever read a micro-optimization-related blog post, you'll know that hand-written ASM is a great recipe that often results in the finest jank, and the game's square root function definitely delivers in that regard, right out of the gate.
What slightly differentiates this algorithm from the typical definition of
an integer
square root is that it rounds up: In real numbers, √3 is
≈ 1.73, so isqrt(3) returns 2 instead of 1. However, if
the result is always rounded down, you can determine whether you have to
round up by simply squaring the calculated root and comparing it to the radicand. And even that
is only necessary if the difference between the two doesn't naturally fall
out of the algorithm – which is what also happens with Shuusou Gyoku's
original ASM code, but pbg
didn't realize this and squared the result regardless.
That's one suboptimal detail already. Let's call the original ASM function
in a loop over the entire supported range of radicands from 0 to
231 and produce a list of results that I can verify my C++
translation against… and watch as the function's linear time complexity with
regard to the radicand causes the loop to run for over 15 hours on my
system. 🐌 In a way, I've found the literal opposite of Q_rsqrt()
here: Not fast, not inverse, no bit hacks, and surely without the
awe-inspiring kind of WTF.
I really didn't want to run the same loop over a
literal C++ translation of the same algorithm afterward. Calculating
integer square roots is a common problem with lots of solutions, so let's
see if we can go better than linear.
And indeed, Wikipedia
also has a bitwise algorithm that runs in logarithmic time, uses only
additions, subtractions, and bit shifts, and even ends up with an error term
that we can use to round up the result as necessary, without a
multiplication. And this algorithm delivers the exact same results over the
exact same range in… 50 seconds. 🏎️ And that's with the I/O to print
the first value that returns each of the 46,341 different square root
"But wait a moment!", I hear you say. "Why are you bothering with
an integer square root algorithm to begin with? Shouldn't good old
round(sqrt(x)) from <math.h> do the trick
just fine? Our CPUs have had SSE for a long time, and this probably compiles
into the single SQRTSD instruction. All that extra
floating-point hardware might mean that this instruction could even run in
parallel with non-SSE code!"
And yes, all of that is technically true. So I tested it, and my very
synthetic and constructed micro-benchmark did indeed deliver the same
results in… 48 seconds. That's not enough of a
difference to justify breaking the spirit of treating the FPU as lava that
permeates Shuusou Gyoku's code base. Besides, it's not used for that much to
begin with:
pre-calculating the 西方Project lens ball effect
the fade animation when entering and leaving stages
rendering the circular part of stationary lasers
pulling items to the player when bombing
After a quick C++ translation of the RNG function that spells out a 32-bit
multiplication on a 32-bit CPU using 16-bit instructions, we reach the final
pieces of ASM code for the 8-bit atan2() and trapezoid
rendering. These could actually pass for well-written ASM code in how they
express their 64-bit calculations: atan8() prepares its 64-bit
dividend in the combined EDX and EAX registers in
a way that isn't obvious at all from a cursory look at the code, and the
trapezoid functions effectively use Q32.32 subpixels. C++ allows us to
cleanly model all these calculations with 64-bit variables, but
unfortunately compiles the divisions into a call to a comparatively much
more bloated 64-bit/64-bit-division polyfill function. So yeah, we've
actually found a well-optimized piece of inline assembly that even Visual
Studio 2022's optimizer can't compete with. But then again, this is all
about code generation details that are specific to 32-bit code, and it
wouldn't be surprising if that part of the optimizer isn't getting much
attention anymore. Whether that optimization was useful, on the other hand…
Oh well, the new C++ version will be much more efficient in 64-bit builds.
And with that, there's no more ASM code left in Shuusou Gyoku's codebase,
and the original DirectXUTYs directory is slowly getting
emptier and emptier.
Phew! Was that everything for this delivery? I think that was everything.
Here's the new build, which checks off 7 of the 15 remaining portability
Next up: Taking a well-earned break from Shuusou Gyoku and starting with the
preparations for multilingual PC-98 Touhou translatability by looking at
TH04's and TH05's in-game dialog system, and definitely writing a shorter
blog post about all that…
And then I'm even late by yet another two days… For some reason, preparing
Shuusou Gyoku for an OpenGL port has been the most difficult and drawn-out
task I've worked on so far throughout this project. These pushes were in
development since April, and over two months in total. Tackling a legacy
codebase with such a rather vague goal while simultaneously wanting to keep
everything running did not do me any favors, and it was pretty hard to
resist the urge to fix everything that had better be fixed to make
this game portable… 📝 2022 ended with Shuusou Gyoku working at full speed on Windows ≥8 by itself, without external tools, for the first
time. However, since it all came down to just one small bugfix, the
resulting build still had several issues:
The game might still start in the slow, mitigated 8-bit or 16-bit
mode if the respective app compatibility flag is still present in the
registry from the earlier 📝 P0217 build. A
player would then have to manually put the game into 32-bit mode via the
Option menu to make it run at its actual intended speed. Bypassing this flag
programmatically would require some rather fiddly .EXE patching techniques.
The 32-bit mode tends to lag significantly if a lot of sprites are
onscreen, for example when canceling the final pattern of the Extra Stage
midboss. (#35)
If the game window lost and regained focus during the ending (for
example via Alt-Tabbing), the game reloads the wrong sprite sheet. (#19)
And, of course, we still have no native windowed mode, or support for
rendering in the higher resolutions you'd want to use on modern high-DPI
displays. (#7)
Now, we could tackle all of these issues one by one, in focused pushes… or
wait for one hero to fund a full-on OpenGL backend as part of the larger
goal of porting this game to Linux. This would take much longer, but fix all
these issues at once while bringing us significantly closer to Shuusou Gyoku
being cross-platform. Which is exactly what Ember2528 did.
Shuusou Gyoku is a very Windows-native codebase. Its usage of types
declared in <windows.h> even extends to core gameplay
code, the rendering code is completely architected around DirectDraw's
features and drawbacks, and text rendering is not abstracted at all. Looks
like it's now my task to write all the abstractions that pbg didn't manage
to write…
Therefore, I chose to stay with DirectDraw for a few more pushes while I
would build these abstractions. In hindsight, this was the least efficient
approach one could possibly imagine for the exact goal of porting the game
to Linux. Suddenly, I had to understand all this DirectDraw and GDI
jank, just to keep the game running at every step along the way. Retaining
Shuusou Gyoku's 8-bit mode in particular was a huge pain, but I didn't want
to remove it because it's currently the only way I can easily debug the game
in windowed mode at a scaled resolution, through DxWnd. In 16-bit or
32-bit mode, DxWnd slows down to a crawl, roughly resembling the performance
drop we used to get with Windows' own compatibility mitigations for the
original build.
The upside, though, is that everything I've built so far still works with
the original 8-bit and 16-bit graphics modes. And with just one compiler flag to disable
any modern x86 instructions, my build can still run on i586/P5 Pentium
CPUs, and only requires KernelEx and its latest
Kstub822 patches to run on Windows 98. And, surprisingly, my core
audience does appreciate this fact. Thus, I will include an i586 build
in all of my upcoming Shuusou Gyoku releases from now on. Once this codebase
can compile into a 64-bit binary (which will obviously be required for a
native Linux build), the i586 build will remain the only 32-bit Windows
build I'll include in my releases.
So, what was DirectDraw? In the shortest way that still describes it
accurately from the point of view of a developer: "A hardware acceleration
layer over Ye Olde Win32 GDI, providing double-buffering and fast blitting
of rectangles." There's the primary double-buffered framebuffer
surface, the offscreen surfaces that you create (which are
comparable to what 3D rendering APIs would call textures), and you
can blit rectangular regions between the two. That's it. Except for
double-buffering, DirectDraw offers no feature that GDI wouldn't also
support, while not covering some of GDI's more complex features. I mean,
DirectDraw can blit rectangles only? How
However, DirectDraw's relative lack of features is not as much of a problem
as it might appear at first. The reason for that lies in what I consider to
be DirectDraw's actual killer feature: compatibility with GDI's device
context (DC) abstraction. By acquiring a DC for a DirectDraw surface,
you can use all existing GDI functions to draw onto the surface, and, in
general, it will all just work. 😮 Most notably, you can use GDI's blitting
functions (i.e., BitBlt() and friends) to transfer pixel data
from a GDI HBITMAP in system memory onto a DirectDraw surface
in video memory, which is the easiest and most straightforward way to, well,
get sprite data onto a DirectDraw surface in the first place.
In theory, you could do that without ever touching GDI by locking the
surface memory and writing the raw bytes yourself. But in practice, you
probably won't, because your game has to run under multiple bit depths and
your data files typically only store one copy of all your sprites in a
single bit depth. And the necessary conversion and palette color matching…
is a mere implementation detail of GDI's blitting functions, using a
supposedly optimized code path for every permutation of source and
destination bit depths.
All in all, DirectDraw doesn't look too bad so far, does it? Fast blitting,
and you can still use the full wealth of GDI functions whenever needed… at
the small cost of potentially losing your surface memory at any time. 🙄
Yup, if a DirectDraw game runs in true resolution-changing fullscreen mode
and you switch to the Windows desktop, all your surface memory is freed and
you have to manually restore it once the game regains focus, followed by
manually copying all intended bitmap data back onto all surfaces. DirectDraw
is where this concept of surface loss originated, which later carried over
to the earlier versions of Direct3D and, infamously,
Direct2D as well.
Looking at it from the point of view of the mid-90s, it does make sense to
let the application handle trashed video memory if that's an unfortunate
reality that your graphics API implementation has to deal with. You don't
want to retain a second copy of each surface in a less volatile part of
memory because you didn't have that much of it. Instead, the application can
now choose the most appropriate way to restore each individual surface. For
procedurally generated surfaces, it could just re-run the generating code,
whereas all the fixed sprite sheets could be reloaded from disk.
In practice though, this well-intentioned freedom turns into a huge pain.
Suddenly, it's no longer enough to load every sprite sheet once before it's
needed, blit its pixel data onto the DirectDraw surface, and forget about
it. Now, the renderer must also be able to refresh the pixel data of every
surface from within itself whenever any of DirectDraw's blitting
functions fails with a DDERR_SURFACELOST error. This fact alone
is enough to push your renderer interface towards central management and
allocation of surfaces. You could maybe avoid the conceptual
SurfaceManager by bundling each surface with a regeneration
callback, but why should you? Any other graphics API would work with
straight-line procedural load-and-forget initialization code, so why slice
that code into little parts just because of some DirectDraw quirk?
So if your surfaces can get trashed at any time, and you already use
GDI to copy them from system memory to DirectDraw-managed video memory,
and your game features at least one procedurally generated surface…
you might as well retain every currently loaded surface in the form of an
additional GDI device-independent bitmap. 🤷 In fact, that's even better
than what Shuusou Gyoku did originally: For all .BMP-sourced surfaces, it
only kept a buffer of the entire decompressed .BMP file data, which means
that it had to recreate said intermediate GDI bitmap every time it needed to
restore a surface. The in-game music title was originally restored
via regeneration callback that re-rendered the intended title directly onto
the DirectDraw surface, but this was handled by an additional "restore hook"
system that remained unused for anything else.
Anything more involved would be a micro-optimization, especially since the
goal is to get away from DirectDraw here. Not much point in "neatly"
reloading sprite surfaces from disk if the total size of all loaded sprite
sheets barely exceeds the 1 MiB mark. Also, keeping these GDI DIBs loaded
and initialized does speed up getting back into the game… in theory,
at least. After all, the game still runs in fullscreen mode, and resolution
switching already takes longer on modern flat-panel displays than any
surface restoration method we could come up with.
So that was all pretty annoying. But once we start rendering in 8-bit mode,
it gets even worse as we suddenly have to bother with palette management.
Similar to PC-98 Touhou, Shuusou Gyoku
uses way too many different palettes. In fact, it creates
a separate DirectDraw palette to retain the palette embedded into every
loaded .BMP file, and simply sets the palette of the primary surface and the
backbuffer to the one it loaded last. Like, why would you retain
per-surface palettes, and what effect does this even have? What even happens
when you blit between two DirectDraw surfaces that have different palettes?
Might this be the cause of the discolored in-game music title when playing
under DxWnd? 😵 But if we try throwing out those extra palettes, it
only takes until Stage 3 for us to be greeted with… the infamous golf
As you might have guessed, these exact colors come from Gates' face sprite,
whose palette apparently doesn't match the sprite sheets used in Stage 3.
Turns out that 256 colors are not enough for what Shuusou Gyoku would like
to use across the entire stage. In sprite loading order:
Sprite sheet
Additional unique colors
Total unique colors
General system sprites
Stage 3 enemies
Stage 3 map tiles
Wide Shot bomb cut-in
VIVIT's faceset
Unknown face
Gates' faceset
And that's why Shuusou Gyoku does not only have to retain these palettes,
but also contains stage
script commands (!) to switch the current palette back to either the map
or enemy one, after the dialog system enforced the face palette.
But the worst aspects about palettes rear their ugly head at the boundary
between GDI and DirectDraw, when GDI adds its own palettes into the mix.
None of the following points are clearly documented in either ancient or
current MSDN, forcing each new DirectDraw developer to figure them out on
their own:
When calling IDirectDraw::CreateSurface() in 8-bit mode,
DirectDraw automatically sets up the newly created surface with a reference
(not a copy!) to the palette that's currently assigned to the primary
When locking an 8-bit surface for GDI blitting via
IDirectDrawSurface::GetDC(), DirectDraw is supposed to set the
GDI palette of the returned DC to the current palette of the DirectDraw…
primary surface?! Not the surface you're actually calling
GetDC() on?!
Interestingly, it took until March of this year for DxWnd to discover a
different game that relied on this detail, while DDrawCompat had
implemented it for years. DxWnd version 2.05.95 then introduced the
DirectX(2) → Fix DC palette tweak, and it's this option that would
fix the colors of the in-game music title on any Shuusou Gyoku build older
than P0251.
Make sure to neverBitBlt() from a 24-bit RGB GDI
image to a palettized 8-bit DirectDraw offscreen surface. You might be
tempted to just go 24-bit because there's no palette to worry about and you
can retain a single GDI image for every supported bit depth, but the
resulting palette mapping glitches will be much worse than if you just
stayed in 8-bit. If you want to procedurally generate a GDI bitmap for a
DirectDraw surface, for example if you need to render text, just create
a bitmap that's compatible with the DC of DirectDraw's primary or
backbuffer surface. Doing that magically removes all palette woes, and
CreateCompatibleBitmap() is much easier to call anyway.
Ultimately, all of this is why Shuusou Gyoku's original DirectDraw backend
looks the way it does. It might seem redundant and inefficient in places,
but pbg did in fact discover the only way where all the undocumented GDI and
DirectDraw color mapping internals come together to make the game look as
intended. 🧑🔬
And what else are you going to do if you want to target old hardware? My
PC-9821Nw133, for example, can only run the original Shuusou Gyoku in 8-bit
mode. For a Windows game on such old hardware, 8-bit DirectDraw looks like
the only viable option. You certainly don't want to use GDI alone, because
that's probably slow and you'd have to worry about even more palette-related
issues. Although people have reported that Shuusou Gyoku does actually
run faster on their old Windows 9x machine if they disable DirectDraw
In that case, it might be worth a try to write a completely new 8-bit
software renderer, employing the same retained VRAM techniques that the
PC-98 Touhou games used to implement their scrolling playfields with a
minimum of redraws. The hardware scrolling feature of the PC-98 GDC would
then be replicated by blitting the playfield in two halves every frame. I
wonder how fast that would be…
Or you go straight back to DOS, and bring your own font renderer and
MIDI/PCM sound driver.
So why did we have to learn about all this? Well, if GDI functions can
directly render onto any kind of DirectDraw surface, this also includes text
rendering functions like TextOut() and DrawText().
If you're really lazy, you can even render your text directly onto
the DirectDraw backbuffer, which probably re-rasterizes all glyphs
every frame!
Which, you guessed it, is exactly how Shuusou Gyoku renders most of its
text. 🐷 Granted, it's not too bad with MS Gothic thanks to its embedded
bitmaps for font
heights between 7 and 22 inclusive, which replace the usual Bézier curve
rasterization for TrueType fonts with a rather quick bitmap lookup. However,
it would not only become a hypothetical problem if future translations end
up choosing more complex fonts without embedded bitmaps, but also as soon as
we port the game to other systems. Nobody in their right mind would
integrate a cross-platform font renderer directly with a 3D graphics API… right?
Instead, let's refactor the game to render all its existing text to and from
a bitmap,
extending the way the in-game music title is rendered to the rest of the
game. Conceptually, this is also how the Windows Touhou games have always
rendered their text. Since they've always used Direct3D, they've always had
to blit GDI's output onto a texture. Through the definitions in
text.anm, this fixed-size texture is then turned into a sprite
sheet, allowing every rendered line of text to be individually placed on the
screen and animated.
However, the static nature of both the sprite sheet and the texture caused
its fair share of problems for thcrap's translation support. Some of the
sprites, particularly the ones for spell card titles, don't originally take
up the entire width of the playfield, cutting off translations long before
they reach the left edge. Consequently, thcrap's base patch
for the Windows Touhou games has to resize the respective sprites to
make translators happy. Before I added .ANM header
patching in late 2018, this had to be done through a complete modified
copy of text.anm for every game – with possibly additional
variants if ZUN changed the layout of this file between game versions. Not
to mention that it's bound to be quite annoying to manually allocate a
rectangle for every line of text we want to show. After all, I have at leasttwo text-heavy future
features in mind already…
So let's not do exactly that. Since DirectDraw wants us to manage all
surfaces in a central place, we keep the idea of using a single surface for
all text. But instead of predefining anything about the surface layout, we
fully build up the surface at runtime based on whatever rectangles we need,
using a rectangle
packing algorithm… yup, I wouldn't have expected to enter such territory
either. For now, we still hardcode a fixed size that each piece of text is
allowed to maximally take up. But once we get translations, nothing is
stopping us from dynamically extending this size to fit even longer strings,
and fitting them onto the fixed screen space via smooth scrolling.
To prevent the surface from arbitrarily growing as the game wants to render
more and more text, we also reset all allocated rectangles whenever the game
state changes. In turn, this will also recreate the text surface to match
the new bounding box of all rectangles before the first prerendering call
with the new layout. And if you remember the first bullet point about
DirectDraw palettes in 8-bit mode, this also means that the text surface
automatically receives the current palette of the primary surface, giving
us correct colors even without requiring DxWnd's DC palette tweak. 🎨
In fact, the need to dynamically create surfaces at custom sizes was the
main reason why I had to look into DirectDraw surface management to begin
with. The original game created
all of its surfaces at once, at startup or after changing the bit depth
in the main menu, which was a bad idea for many reasons:
It hardcoded and limited the size of all sprite sheets,
added another rendering-API-specific function that game code should not
need to worry about,
introduced surface IDs that have to be synchronized with the
surface pointers used throughout the rest of the game,
and was the main reason why the game had to distribute the six 320×240
ending pictures across two of the fixed 640×480 surfaces, which ended up
causing the sprite reload
bug in the ending. As implied in the issue, this was a DirectDraw bug
that pretty much had to fix itself before I could port the game to OpenGL,
and was the only bug where this was the case. Check the issue comments for
more details about this specific bug.
In the end, we get four different layouts for the text surface: One for the
main menu, the Music Room, the in-game portion, and the ending. With,
perhaps surprisingly, not too much text on either of them:
Yes, the ending uses just a single rectangle that takes up the entire screen
space below the pictures and credits.
For the menus, the resulting packed layout reveals how I'm assigning a
separately cached rectangle to each possible option – otherwise, they
couldn't be arranged vertically on screen with this bitmap layout. Right
now, I'm only storing all text for the current menu level, which requires
text to be rendered again when entering or leaving submenus. However, I'm
allocating as many rectangles as required for the submenu with the most
amount of items to at least prevent the single text surface from being
resized while navigating through the menu. As a side effect, this is also
why you can see multiple Exit labels: These simply come from
other submenus with more elements than the currently visited Sound /
Music one.
Still, we're re-rasterizing whole lines of text exactly as they appear on
screen, and are even doing so multiple times to apply any drop shadows.
Isn't that exactly what every text rendering tutorial nowadays advises
against doing? Why not directly go for the classic solution to this problem
and render using a font texture
atlas? Well…
Most of the game text is still in Japanese. If we were to build a font
atlas in advance, we'd have to add a separate build step that collects all
needed codepoints by parsing all text the game would ever print, adding a
build-time dependency on the original game's copyrighted data files. We'd
also have to move all hardcoded strings to a separate file since we surely
don't want to parse C++ manually during said build step. Theoretically, we
would then also give up the idea of modding text at run-time without
re-running that build step, since we'd restrict all text to the glyphs we've
rasterized in the atlas… yeah, that's more than enough reasons for static
atlas generation to be a non-starter.
OK, then let's build the atlas dynamically, adding new glyphs as we
encounter them. Since this game is old, we can even be a bit lazy as far as
the packing is concerned, and don't have to get as fancy as the GIF in the
link above. Just assume a fixed height for each glyph, and fill the atlas
from left to right. We can even clear it periodically to keep it from
getting too big, like before entering the Music Room, the in-game portion,
or the ending, or after switching languages once we have translations.
Should work, right?
Except that most text in Shuusou Gyoku comes with a shadow, realized by
first drawing the same string in a darker color and displaced by a few
pixels. With a 3D renderer, none of this would be an issue because we can
define vertex colors. But we're still using DirectDraw, which has no way of
applying any sort of color formula – again, all it can do is take a
rectangle and blit it somewhere else. So we can't just keep one atlas with
white glyphs and let the renderer recolor it. Extending Shuusou Gyoku's
Direct3D code with support for textured quads is also out of the question
because then we wouldn't have any text in the Direct3D-less 8-bit mode. So
what do we do instead? Throw the atlas away on every color change? Keep
multiple atlases for every color we've seen so far? Turn shadows into a
high-level concept? Outright forgetting the idea seems to be the best choice
For a rather square language like Japanese where one Shift-JIS codepoint
always corresponds to one glyph, a texture atlas can work fine and without
too much effort. But once we support languages with more complex ligatures,
we suddenly need to get a shaping
engine from somewhere, and directly interact with it from our rendering
code. This necessarily involves changing APIs and maybe even bundling the
first cross-platform libraries, which I wanted to avoid in an already packed
and long overdue delivery such as this one. If we continue to render
line-by-line, translations would only need a line break algorithm.
Most importantly though: It's not going to matter anyway. The
game ran fine on early 2000s hardware even though it called
TextOut() every frame, and any approach that caches the result
of this call is going to be faster.
While the Music Room and the ending can be easily migrated to a prerendering
system, it's much harder for the main menu. Technically, all option
strings of the currently active submenu are rewritten every frame, even
though that would only be necessary for the scrolling MIDI device name in
the Sound / Music submenu. And since all this rewriting is done
via a classic sprintf() on fixed-size char
buffers, we'd have to deploy our own change detection before prerendering
can have any performance difference.
In essence, we'd be shifting the text rendering paradigm from the original
immediate approach to a more retained one. If you've ever used any of the
hot new immediate-mode GUI or web frameworks that have become popular over
the last 10 years, your alarm bells are probably already ringing by now.
Adding retained elements is always a step back in terms of code quality, as
it increases complexity by storing UI state in a second place.
Wouldn't it be better if we could just stay with the original immediate
approach then? Absolutely, and we only need a simple cache system to get
there. By remembering the string that was last rendered to every registered
rectangle, the text renderer can offer an immediate API that combines the
distinct Prerender() and Blit() steps into a
single Render() call. There still has to be an initialization
point that registers all rectangles for each game state (which,
surprisingly, was not present for the in-game portion in the original code),
but the rendering code remains architecturally unchanged in how we call the
text renderer every frame. As long as the text doesn't change, the text
renderer just blits whatever it previously rendered to the respective
rectangle. With an API like this, the whole pre-rendering part turns into a
mere implementation detail.
So, how much faster is the result? Since I can only measure non-VSynced
performance in a quite rudimentary way using DxWnd's FPS counter, it highly
depends on the selected renderer. Weirdly enough, even just switching font
creation to the Unicode APIs tripled the FPS inside the Music Room
when rendering with OpenGL? That said, the primary surface renderer
seems to yield the most realistic numbers, as we still stay entirely within
DirectDraw and perform no API wrapping. Using this renderer, I get speedups
of roughly:
~3.5× in the Music Room,
~1.9× during in-game dialog, and
~1.5× in the main menu.
Not bad for something I had to do anyway to port the game away from
DirectDraw! Shuusou Gyoku is rather infamous among the vintage computer
scene for being ridiculously unoptimized, so I should definitely be able to
get some performance gains out of the in-game portion as well.
For a final test of all the new blitting code, I also tried running
outside DxWnd to verify everything against real and unpatched
DirectDraw. Amusingly, this revealed how blitting from the new text surface
seems to reach the color mapping limits of the DWM mitigation in 8-bit mode:
For some reason, my system maps the intended #FFFFFF text
color to #E4E3BB in the main menu?
8-bit mode does render correctly when I ran the same build in a Windows 98
VirtualBox on the same system though, so it's not worth looking into a mode
that the system reports as unsupported to begin with. Let's leave this as
somewhat of a visual reminder for players to select 32-bit mode instead.
Alright, enough about the annoying parts of GDI and DirectDraw for now.
Let's stop looking back and start looking forward, to a time within this
Seihou revolution when we're going to have lots of new options in the main
menu. Due to the nature of delivering individual pushes, we can expect lots
of revisions to the config file format. Therefore, we'd like to have a
backward-compatible system that allows players to upgrade from any older
build, including the original 秋霜玉.exe, to a newer one. The
original game predominantly used single-byte values for all its options, but
we'd like our system to work with variables of any size, including strings
to store things like the
name of the selected MIDI device in a more robust way. Also, it's pure
evil to reset the entire configuration just because someone tried to
hex-edit the config file and didn't keep the checksum in mind.
It didn't take long for me to arrive at a common
Size()/Read()/Write() interface. By
using the same interface for both arrays and individual values, new config
file versions can naturally expand older ones by taking the array of option
references from the previous version and wrapping it into a new array,
together with the new options.
The classic way of implementing this in C++ involves a typical
object-oriented class hierarchy: An Option base class would
define the interface in the form of virtual abstract functions, and the
Value, Array, and ConfigVersion
subclasses would provide different implementations. This works, but
introduces quite a bit of boilerplate, not to mention the runtime bloat from
all the virtual functions which Visual C++ can't inline. Why should we do
any runtime dispatch here? We know the set of configuration options
at compile time, after all…
Let's try looking into the modern C++ toolbox and see if we can do better.
The only real challenge here is that the array type has to support
arbitrarily sized option value types, which sounds like a job for
template parameter packs. If we save these into a
std::tuple, we can then "iterate" over all options with std::apply
and fold
expressions, in a nice functional style.
I was amazed by just how clearly the "crazy" modern C++ approach with
template parameter packs, std::apply() over giant
std::tuples, and fold expressions beats a classic polymorphic
hierarchy of abstract virtual functions. With the interface moved into an
even optional concept, the class hierarchy can be completely
flattened, which surprisingly also makes the code easier to both read and
Here's how the new system works from the player's point of view:
The config files now use a kanji-less and explicitly forward-compatible
naming scheme, starting with SSG_V00.CFG in the P0251 build.
The format of this initial version simply includes all values from the
original 秋霜CFG.DAT without padding bytes or a checksum. Once
we release a new build that adds new config options, we go up to
SSG_V01.CFG, and so on.
When loading, the game starts at its newest supported config file
version. If that file doesn't exist, the game retries with each older
version in succession until it reaches the last file in the chain, which is
always the original 秋霜CFG.DAT. This makes it possible to
upgrade from any older Shuusou Gyoku build to a newer one while retaining
all your settings – including, most importantly, which shot types you
unlocked the Extra Stage with. The newly introduced settings will simply
remain at their initial default in this case.
When saving, the game always writes all versions it knows about,
down to and including the original 秋霜CFG.DAT, in the
respective version-specific format. This means that you can change options
in a newer build and they'll show up changed in older builds as well if they
were supported there.
And yes, this also means that we can stop writing the unsupported 32-bit bit
depth setting to 秋霜CFG.DAT, which would cause a validation
failure on the original build. This is now avoided by simply turning 32-bit
into 16-bit just for the configuration that gets saved to this file. And
speaking of validation failures…
This per-value validation is also done if my builds loaded the
original 秋霜CFG.DAT. The checksum is still written for
compatibility with the original build, but my builds ignore it.
With that, we've got more than enough code for a new build:
This build also contains two more fixes that didn't fit into the big
DirectDraw or configuration categories:
The P0226 build had a bug that allowed invalid stages to be selected for
replay recording. If the ReplaySave option was
[O F F], pressing the ⬅️ left arrow key on the
option would overflow its value to 255. The effects of this weren't all too
serious: The game would simply stay on the Weapon Select screen for an
invalid stage number, or launch into the Extra Stage if you scrolled all the
way to 131. Still, it's fixed in this build.Whoops! That one was fully my fault.
The render time for the in-game music title is now roughly cut in half:
Achieved by simply trimming trailing whitespace and using slightly more
efficient GDI functions to draw the gradient. Spending 4 frames on
rendering a gradient is still way too much though. I'll optimize that
further once I actually get to port this effect away from GDI.
These videos also show off how DxWnd's DC palette bug affected the
original game, and how it doesn't affect the P0251 build.
These 6 pushes still left several of Shuusou Gyoku's DirectDraw portability
issues unsolved, but I'd better look at them once I've set up a basic OpenGL
skeleton to avoid any more premature abstraction. Since the ultimate goal is
a Linux port, I might as well already start looking at the current best
platform layer libraries. SDL would be the standard choice here, and while
SDL_ttf looks regrettably misdesigned, the core SDL library seems to cover
all we could possibly want for Shuusou Gyoku, including a 2D renderer… wait,
Yup. Admittedly, I've been living under a rock as far as SDL is concerned,
and thus wasn't aware that SDL 2 introduced its own abstraction for 2D
rendering that just happens to almost exactly cover everything we need
for Shuusou Gyoku. This API even covers all of the game's Direct3D code,
which only draws alpha-blended, untextured, and pre-transformed
vertex-colored triangles and lines. It's the exact abstraction over OpenGL I
thought I had to write myself, and such a perfect match for this game that
it would be foolish to go for a custom OpenGL backend – especially since SDL
will automatically target the ideal graphics API for any given operating
Sadly, the one thing SDL_Renderer is missing is something equivalent to
pixel shaders, which we would need to replicate the 西方Project lens ball effect shown at startup. Looks like we have
to drop into a
completely separate, unaccelerated rendering mode and continue to
software-render this one effect before switching to hardware-accelerated
rendering for the rest of the game. But at least we can do that in a
cross-platform way, and don't have to bother with shading languages –
or, perhaps even worse, SDL's own shading
If we were extremely pedantic, we'd also have to do the same for the
📝 unused spiral effect that was originally intended for the staff roll.
Software rendering would be even more annoying there, since we don't
just have to software-render these staff sprites, but also the ending
picture and text, complete with their respective fade effects. And while I
typically do go the extra mile to preserve whatever code was present in
these games, keeping this effect would just needlessly drive up the
cost of the SDL backend. Let's just move this one to the museum of unused
code and no longer actively compile it. RIP spiral 🥲 At least you're
still preserved in lossless video form.
Now that SDL has become an integral part of Shuusou Gyoku's portability plan
rather than just being one potential platform layer among many, the optimal
order of tasks has slightly changed. If we stayed within the raw Win32 API
any longer than absolutely necessary, we'd only risk writing more
Win32-native code for things like audio streaming that we'd
then have to throw away and rewrite in SDL later. Next up, therefore:
Staying with Shuusou Gyoku, but continuing in a much more focused manner by
fixing the input system and starting the SDL migration with input and sound.
TH04 PI/RE (Stage 5 star rendering + Stage 6 Yuuka checkerboard + Custom entity structures, part 1/2)
TH04 PI/RE (Custom entity structures, part 2/2 + Thick laser structure + PI false positives + .STD loading)
💰 Funded by:
JonathKane, Blue Bolt, [Anonymous]
🏷️ Tags:
Well, well. My original plan was to ship the first step of Shuusou Gyoku
OpenGL support on the next day after this delivery. But unfortunately, the
complications just kept piling up, to a point where the required solutions
definitely blow the current budget for that goal. I'm currently sitting on
over 70 commits that would take at least 5 pushes to deliver as a meaningful
release, and all of that is just rearchitecting work, preparing the
game for a not too Windows-specific OpenGL backend in the first place. I
haven't even written a single line of OpenGL yet… 🥲
This shifts the intended Big Release Month™ to June after all. Now I know
that the next round of Shuusou Gyoku features should better start with the
SC-88Pro recordings, which are much more likely to get done within their
current budget. At least I've already completed the configuration versioning
system required for that goal, which leaves only the actual audio part.
So, TH04 position independence. Thanks to a bit of funding for stage
dialogue RE, non-ASCII translations will soon become viable, which finally
presents a reason to push TH04 to 100% position independence after
📝 TH05 had been there for almost 3 years. I
haven't heard back from Touhou Patch Center about how much they want to be
involved in funding this goal, if at all, but maybe other backers are
interested as well.
And sure, it would be entirely possible to implement non-ASCII translations
in a way that retains the layout of the original binaries and can be easily
compared at a binary level, in case we consider translations to be a
critical piece of infrastructure. This wouldn't even just be an exercise in
needless perfectionism, and we only have to look to Shuusou Gyoku to realize
why: Players expected
that my builds were compatible with existing SpoilerAL SSG files, which
was something I hadn't even considered the need for. I mean, the game is
open-source 📝 and I made it easy to build.
You can just fork the code, implement all the practice features you want in
a much more efficient way, and I'd probably even merge your code into my
builds then?
But I get it – recompiling the game yields just yet another build that can't
be easily compared to the original release. A cheat table is much more
trustworthy in giving players the confidence that they're still practicing
the same original game. And given the current priorities of my backers,
it'll still take a while for me to implement proof by replay validation,
which will ultimately free every part of the community from depending on the
original builds of both Seihou and PC-98 Touhou.
However, such an implementation within the original binary layout would
significantly drive up the budget of non-ASCII translations, and I sure
don't want to constantly maintain this layout during development. So, let's
chase TH04 position independence like it's 2020, and quickly cover a larger
amount of PI-relevant structures and functions at a shallow level. The only
parts I decompiled for now contain calculations whose intent can't be
clearly communicated in ASM. Hitbox visualizations or other more in-depth
research would have to wait until I get to the proper decompilation of these
But even this shallow work left us with a large amount of TH04-exclusive
code that had its worst parts RE'd and could be decompiled fairly quickly.
If you want to see big TH04 finalization% gains, general TH04 progress would
be a very good investment.
The first push went to the often-mentioned stage-specific custom entities
that share a single statically allocated buffer. Back in 2020, I
📝 wrongly claimed that these were a TH05 innovation,
but the system actually originated in TH04. Both games use a 26-byte
structure, but TH04 only allocates a 32-element array rather than TH05's
64-element one. The conclusions from back then still apply, but I also kept
wondering why these games used a static array for these entities to begin
with. You know what they call an area of memory that you can cleanly
repurpose for things? That's right, a heap!
And absolutely no one would mind one additional heap allocation at the start
of a stage, next to the ones for all the sprites and portraits.
However, we are still running in Real Mode with segmented memory. Accessing
anything outside a common data segment involves modifying segment registers,
which has a nonzero CPU cycle cost, and Turbo C++ 4.0J is terrible at
optimizing away the respective instructions. Does this matter? Probably not,
but you don't take "risks" like these if you're in a permanent
micro-optimization mindset…
In TH04, this system is used for:
Kurumi's symmetric bullet spawn rays, fired from her hands towards the left
and right edges of the playfield. These are rather infamous for being the
last thing you see before
📝 the Divide Error crash that can happen in ZUN's original build.
Capped to 6 entities.
The 4 📝 bits used in Marisa's Stage 4 boss
fight. Coincidentally also related to the rare Divide Error
crash in that fight.
Stage 4 Reimu's spinning orbs. Note how the game uses two different sets
of sprites just to have two different outline colors. This was probably
better than messing with the palette, which can easily cause unintended
effects if you only have 16 colors to work with. Heck, I have an entire blog post tag just to highlight
these cases. Capped to the full 32 entities.
The chasing cross bullets, seen in Phase 14 of the same Stage 6 Yuuka
fight. Featuring some smart sprite work, making use of point symmetry to
achieve a fluid animation in just 4 frames. This is
good-code in sprite form. Capped to 31 entities, because the 32nd custom entity during this fight is defined to be…
The single purple pulsating and shrinking safety circle, seen in Phase 4 of
the same fight. The most interesting aspect here is actually still related
to the cross bullets, whose spawn function is wrongly limited to 32 entities
and could theoretically overwrite this circle. This
is strictly landmine territory though:
Yuuka never uses these bullets and the safety circle
She never spawns more than 24 cross bullets
All cross bullets are fast enough to have left the screen by the
time Yuuka restarts the corresponding subpattern
The cross bullets spawn at Yuuka's center position, and assign its
Q12.4 coordinates to structure fields that the safety circle interprets
as raw pixels. The game does try to render the circle afterward, but
since Yuuka's static position during this phase is nowhere near a valid
pixel coordinate, it is immediately clipped.
The flashing lines seen in Phase 5 of the Gengetsu fight,
telegraphing the slightly random bullet columns.
These structures only took 1 push to reverse-engineer rather than the 2 I
needed for their TH05 counterparts because they are much simpler in this
game. The "structure" for Gengetsu's lines literally uses just a single X
position, with the remaining 24 bytes being basically padding. The only
minor bug I found on this shallow level concerns Marisa's bits, which are
clipped at the right and bottom edges of the playfield 16 pixels earlier
than you would expect:
The remaining push went to a bunch of smaller structures and functions:
The structure for the up to 2 "thick" (a.k.a. "Master Spark") lasers. Much
saner than the
📝 madness of TH05's laser system while being
equally customizable in width and duration.
The structure for the various monochrome 16×16 shapes in the background of
the Stage 6 Yuuka fight, drawn on top of the checkerboard.
The rendering code for the three falling stars in the background of Stage 5.
The effect here is entirely palette-related: After blitting the stage tiles,
the 📝 1bpp star image is ORed
into only the 4th VRAM plane, which is equivalent to setting the
highest bit in the palette color index of every pixel within the star-shaped
region. This of course raises the question of how the stage would look like
if it was fully illuminated:
The full tile map of TH04's Stage 5, in both dark and fully
illuminated views. Since the illumination effect depends on two
matching sets of palette colors that are distinguished by a single
bit, the illuminated view is limited to only 8 of the 16 colors. The
dark view, on the other hand, can freely use colors from the
illuminated set, since those are unaffected by the OR
Most code that modifies a stage's tile map, and directly specifies tiles via
their top-left offset in VRAM.
Thanks to code alignment reasons, this forced a much longer detour into the
.STD format loader. Nothing all too noteworthy there since we're still
missing the enemy script and spawn structures before we can call .STD
"reverse-engineered", but maybe still helpful if you're looking for an
overview of the format. Also features a buffer overflow landmine if a .STD
file happens to contain more than 32 enemy scripts… you know, the usual
To top off the second push, we've got the vertically scrolling checkerboard
background during the Stage 6 Yuuka fight, made up of 32×32 squares. This
one deserves a special highlight just because of its needless complexity.
You'd think that even a performant implementation would be pretty simple:
Set the GRCG to TDW mode
Set the GRCG tile to one of the two square colors
Start with Y as the current scroll offset, and X
as some indicator of which color is currently shown at the start of each row
of squares
Iterate over all lines of the playfield, filling in all pixels that
should be displayed in the current color, skipping over the other ones
Count down Y for each line drawn
If Y reaches 0, reset it to 32 and flip X
At the bottom of the playfield, change the GRCG tile to the other color,
and repeat with the initial value of X flipped
The most important aspect of this algorithm is how it reduces GRCG state
changes to a minimum, avoiding the costly port I/O that we've identified
time and time again as one of the main bottlenecks in TH01. With just 2
state variables and 3 loops, the resulting code isn't that complex either. A
naive implementation that just drew the squares from top to bottom in a
single pass would barely be simpler, but much slower: By changing the GRCG
tile on every color, such an implementation would burn a low 5-digit number
of CPU cycles per frame for the 12×11.5-square checkerboard used in the
And indeed, ZUN retained all important aspects of this algorithm… but still
implemented it all in ASM, with a ridiculous layer of x86 segment arithmetic
on top? Which blows up the complexity to 4 state
variables, 5 nested loops, and a bunch of constants in unusual units. I'm
not sure what this code is supposed to optimize for, especially with that
rather questionable register allocation that nevertheless leaves one of the
general-purpose registers unused. Fortunately,
the function was still decompilable without too many code generation hacks,
and retains the 5 nested loops in all their goto-connected
glory. If you want to add a checkerboard to your next PC-98
demo, just stick to the algorithm I gave above.
(Using a single XOR for flipping the starting X offset between 32 and 64
pixels is pretty nice though, I have to give him that.)
This makes for a good occasion to talk about the third and final GRCG mode,
completing the series I started with my previous coverage of the
📝 RMW and
📝 TCR modes. The TDW (Tile Data Write) mode
is the simplest of the three and just writes the 8×1 GRCG tile into VRAM
as-is, without applying any alpha bitmask. This makes it perfect for
clearing rectangular areas of pixels – or even all of VRAM by doing a single
// Set up the GRCG in TDW mode.
outportb(0x7C, 0x80);
// Fill the tile register with color #7 (0111 in binary).
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0xFF); // Plane 2: (G): (********)
outportb(0x7E, 0x00); // Plane 3: (E): ( )
// Set the 32 pixels at the top-left corner of VRAM to the exact contents of
// the tile register, effectively repeating the tile 4 times. In TDW mode, the
// GRCG ignores the CPU-supplied operand, so we might as well just pass the
// contents of a register with the intended width. This eliminates useless load
// instructions in the compiled assembly, and even sort of signals to readers
// of this code that we do not care about the source value.
*reinterpret_cast<uint32_t far *>(MK_FP(0xA800, 0)) = _EAX;
// Fill the entirety of VRAM with the GRCG tile. A simple C one-liner that will
// probably compile into a single `REP STOS` instruction. Unfortunately, Turbo
// C++ 4.0J only ever generates the 16-bit `REP STOSW` here, even when using
// the `__memset__` intrinsic and when compiling in 386 mode. When targeting
// that CPU and above, you'd ideally want `REP STOSD` for twice the speed.
memset(MK_FP(0xA800, 0), _AL, ((640 / 8) * 400));
However, this might make you wonder why TDW mode is even necessary. If it's
functionally equivalent to RMW mode with a CPU-supplied bitmask made up
entirely of 1 bits (i.e., 0xFF, 0xFFFF, or
0xFFFFFFFF), what's the point? The difference lies in the
hardware implementation: If all you need to do is write tile data to
VRAM, you don't need the read and modify parts of RMW mode
which require additional processing time. The PC-9801 Programmers'
Bible claims a speedup of almost 2× when using TDW mode over equivalent
operations in RMW mode.
And that's the only performance claim I found, because none of these old
PC-98 hardware and programming books did any benchmarks. Then again, it's
not too interesting of a question to benchmark either, as the byte-aligned
nature of TDW blitting severely limits its use in a game engine anyway.
Sure, maybe it makes sense to temporarily switch from RMW to TDW mode
if you've identified a large rectangular and byte-aligned section within a
sprite that could be blitted without a bitmask? But the necessary
identification work likely nullifies the performance gained from TDW mode,
I'd say. In any case, that's pretty deep
micro-optimization territory. Just use TDW mode for the
few cases it's good at, and stick to RMW mode for the rest.
So is this all that can be said about the GRCG? Not quite, because there are
4 bits I haven't talked about yet…
And now we're just 5.37% away from 100% position independence for TH04! From
this point, another 2 pushes should be enough to reach this goal. It might
not look like we're that close based on the current estimate, but a
big chunk of the remaining numbers are false positives from the player shot
control functions. Since we've got a very special deadline to hit, I'm going
to cobble these two pushes together from the two current general
subscriptions and the rest of the backlog. But you can, of course, still
invest in this goal to allow the existing contributions to go to something
… Well, if the store was actually open. So I'd better
continue with a quick task to free up some capacity sooner rather than
later. Next up, therefore: Back to TH02, and its item and player systems.
Shouldn't take that long, I'm not expecting any surprises there. (Yeah, I
know, famous last words…)
> "OK, TH03/TH04/TH05 cutscenes done, let's quickly finish the Touhou Patch Center MediaWiki upgrade. Just some scripting and verification left, it will be done so quickly that I don't even have to mention it on this blog"
> Still not done after 3 weeks
> Blocked by one final critical bug that really should be fixed upstream
> Code reviewers are probably on vacation
And so, the year unfortunately ended with yet another slow month. During the
MediaWiki upgrade, I was slowly decompiling the TH05 Sara fight on the side,
but stumbled over one interesting but high-maintenance detail there that
would really enhance her blog post. TH02 would need a lot of attention for
the basic rendering calls as well…
…so let's end the year with Shuusou Gyoku instead, looking at its most
critical issue in particular. As if that were the easy option here…
The game does not run properly on modern Windows systems due to its usage of
the ancient DirectDraw APIs, with issues ranging from unbearable slowdown to
glitched colors to the game not even starting at all. Thankfully, Shuusou
Gyoku is not the only ancient Windows game affected by these issues, and
people have developed a variety of generic DirectDraw wrappers and patches
for playing such games on modern systems. Out of all these, DDrawCompat is one of the
simpler solutions for Shuusou Gyoku in particular: Just drop its
ddraw proxy DLL into the game directory, and the game will run
as it's supposed to.
So let's just bundle that DLL with all my future Shuusou Gyoku releases
then? That would have been the quick and dirty option, coming with
several drawbacks:
Linux users might be annoyed by the potential need to configure a native
DLL override for ddraw.dll. It's not too much of an issue as we
could simply rename the DLL and replace the import with the new name.
However, doing that reproducibly would already involve changes to either the
DDrawCompat or Shuusou Gyoku build process.
Win32 API hooking is another potential point of failure in general,
requiring continual maintenance for new Windows versions. This is not even a
hypothetical concern: DDrawCompat does rely on particularly volatile Win32
API details, to the point that the recent Windows 11 22H2 update completely
broke it, causing a hang at startup that required a workaround.
But sure, it's still just a single third-party component. Keeping it up to
date doesn't sound too bad by itself…
…if DDrawCompat weren't evolving way beyond what we need to keep Shuusou
Gyoku running. Being a typical DirectDraw wrapper, it has always aimed to
solve all sorts of issues in old DirectDraw games. However, the latest
version, 0.4.0, has gone above and beyond in this regard, adding lots of
configuration options with default settings that actually
break Shuusou Gyoku.
To get a glimpse of how this is likely to play out, we only have to look at
the more mature DxWnd
project. In its expert mode, DxWnd features three rows of tabs, each packed
with checkboxes that toggle individual hacks, and most of these are
related to something that Shuusou Gyoku could be affected by. Imagine
checking a precise permutation of a three-digit number of checkboxes just to
keep an old game running at full speed on modern systems…
Finally, aesthetic and bloat considerations. If
📝 C++ fstreams were already too embarrassing
with the ~100 KB of bloat they add to the binary, a 565 KiB DLL is
even worse. And that's the old version 0.3.2 – version 0.4.0 comes in
at 2.43 MiB.
Fortunately, I had the budget to dig a bit deeper and figure out what
exactly DDrawCompat does to make Shuusou Gyoku work properly. Turns
out that among all the hooks and patches, the game only needs the most
central one: Enforcing a 32-bit display mode regardless of whatever lower
bit depth the game requests natively, combined with converting the game's
pixel buffer to 32-bit on the fly.
So does this mean that adding 32-bit to the game's list of supported bit
depths is everything we have to do?
Interestingly, Shuusou Gyoku already saved the DirectDraw enumeration flag
that indicates support for 32-bit display modes. The official version just
did nothing with it.
Well, almost everything. Initially, this surprised me as well: With
all the if statements checking for precise bit depths, you
would think that supporting one more bit depth would be way harder in this
code base. As it turned out though, these conditional branches are not
really about 8-bit or 16-bit color for the most part, but instead
differentiate between two very distinct rendering approaches:
"8-bit" is a pure 2D mode with palettized colors,
while "16-bit" is a hybrid 2D/3D mode that uses Direct3D 2 on top of DirectDraw, with
3-channel RGB colors.
Consequently, most of these branches deal with differences between these two
approaches that couldn't be nicely abstracted away in pbg's renderer
interface: Specific palette changes that are exclusive to "8-bit" mode, or
certain entities and effects whose Direct3D draw calls in "16-bit" mode
require tailor-made approximations for the "8-bit" mode. Since our new
32-bit mode is equivalent to the 16-bit mode in all of these branches, I
only needed to replace the raw number comparisons with more meaningful
method calls.
That only left a very small number of 2D raster effects that directly write
to or read from DirectDraw surface memory, and therefore do need to know the
bit size of each pixel. Thanks to std::variant and
std::visit(), adding 32-bit support becomes trivial here: By
rewriting the code in a generic manner that derives all offsets from the
template type, you only have to say hey,
I'd like to have 32-bit as well, and C++ will automatically
instantiate correct 32-bit variants of all bit depth-dependent code
There are only three features in the entire game that access pixel buffers
this way: a color key retrieval function, the lens ball animation on the
logo screen, and… the ending staff roll? Sure, the text sprites fade in and
out, but so does the picture next to it, using Direct3D alpha blending or
palette color ramping depending on the current rendering mode. Instead, the
only reason why these sprites directly access their pixel buffer is… an
unused and pretty wild spiral effect. 😮 It's still part of the code, and
only doesn't show up because the
parameters that control its timing were commented out before release:
They probably considered it too wild for the mood of this
The main ending text was the only remaining issue of mojibake present in my
previous Shuusou Gyoku builds, and is now fixed as well. Windows can
render Shift-JIS text via GDI even outside Japanese locale, but only when
explicitly selecting a font that supports the SHIFTJIS_CHARSET,
and the game simply didn't select any font for rendering this text.
Thus, GDI fell back onto its default font, which obviously is only
guaranteed to support the SHIFTJIS_CHARSET if your system
locale is set to Japanese. This is why the font in the original game might
lookdifferent between systems.
For my build, I chose the font that would appear on a clean Windows
installation – a basic 400-weighted MS Gothic at font size 16, which is
already used all throughout the game.
Alright, 32-bit mode complete, let's set it as the default if possible… and
break compatibility to the original 秋霜CFG.DAT format in the
process? When validating this file, the original game only allows the
originally supported 8-bit or 16-bit modes. Setting the
BitDepth field to any other value causes the entire file
to be reset to its defaults, re-locking the Extra Stage in the process.
Introducing a backward-compatible version
system for 秋霜CFG.DAT was beyond the scope of this push.
Changing the validation to a per-field approach was a good small first step
to take though. The new build no longer validates the BitDepth
field against a fixed list, but against the actually supported bit depths on
your system, picking a different supported one if necessary. With the
original approach, this would have caused your entire configuration to fail
the validation check. Instead, you can now safely update to the new build
without losing your option settings, or your previously unlocked access to
the Extra Stage.
Side note: The validation limit for starting bombs is off by one, and the
one for starting lives check is off by two. By modifying
秋霜CFG.DAT, you could theoretically get new games to start with
7 lives and 3 bombs… if you then calculate a correct checksum for your
hacked config file, that is. 🧑💻
Interestingly, DirectDraw doesn't even indicate support for 8-bit or 16-bit
color on systems that are affected by the initially mentioned issues.
Therefore, these issues are not the fault of DirectDraw, but of
Shuusou Gyoku, as the original release requested a bit depth that it has
even verified to be unsupported. Unfortunately, Windows sides with
Sim City Shuusou Gyoku here: If you previously experimented with the
Windows app compatibility settings, you might have ended up with the
DWM8And16BitMitigation flag assigned to the full file path of
your Shuusou Gyoku executable in either
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Layers, or
As the term mitigation suggests, these modes are (poorly) emulated,
which is exactly what causes the issues with this game in the first place.
Sure, this might be the lesser evil from the point of view of an operating
system: If you don't have the budget for a full-blown DDrawCompat-style
DirectDraw wrapper, you might consider it better for users to have the game
run poorly than have it fail at startup due to incorrect API usage.
Controlling this with a flag that sticks around for future runs of a binary
is definitely suboptimal though, especially given how hard it
is to programmatically remove this flag within the binary itself. It
only adds additional complexity to the ideal clean upgrade path.
So, make sure to check your registry and manually remove these flags for the
time being. Without them, the new Config → Graphic menu will
correctly prevent you from selecting anything else but 32-bit on modern
After all that, there was just enough time left in this push to implement
basic locale independence, as requested by the Seihou development
Discord group, without looking into automatic fixes for previous mojibake
filenames yet. Combining std::filesystem::path with the native
Win32 API should be straightforward and bloat-free, especially with all the
abstractions I've been building, right?
Well, turns out that std::filesystem::path does not
actually meet my expectations. At least as long as it's not
constexpr-enabled, because you still get the unfortunate
conversion from narrow to wide encoding at runtime, even for globals with
static storage duration. That brings us back to writing our path abstraction
in terms of the regular std::string and
std::wstring containers, which at least allow us to enforce the
respective encoding at compile time. Even std::string_view only
adds to the complexity here, as its strings are never inherently
null-terminated, which is required by both the POSIX and Win32 APIs. Not to
mention dynamic filenames: C++20's std::format() would be the
obvious idiomatic choice here, but using it almost doubles the size
of the compiled binary… 🤮
In the end, the most bloat-free way of implementing C++ file I/O in 2023 is
still the same as it was 30 years ago: Call system APIs, roll a custom
abstraction that conditionally uses the L prefix, and pass
around raw pointers. And if you need a dynamic filename, just write the
dynamic characters into arrays at fixed positions. Just as PC-98 Touhou used
to do…
Oh, and the game's window also uses a Unicode title bar now.
And that's it for this push! Make sure to rename your configuration
(秋霜CFG.DAT), score (秋霜SC.DAT), and replay
(秋霜りぷ*.DAT) filenames if you were previously running the
game on a non-Japanese locale, and then grab the new build:
Next up: Starting the new year with all my plans hopefully working out for
once. TH05 Sara very soon, ZMBV code review afterward, low-hanging fruit of
the TH01 Anniversary Edition after that, and then kicking off TH02 with a
bunch of low-level blitting code.
Thanks to handlerug for
implementing and PR'ing the feature in a very clean way. That makes at least
two people I know who wanted to see feed support, so there are probably
a few more out there.
So, Shuusou Gyoku. pbg released the original source code for the first two
Seihou games back in February 2019, but notably removed the crucial
decompression code for the original packfiles due to… various unspecified
reasons, considerations, and implications. This vague
language and subsequent rejection of a pull request
to add these features back in were probably the main reasons why no one
has publicly done anything with this codebase since.
The only other fork I know about is Priw8's private fork from 2020, but only
because WishMakers
informed me about it shortly after this push was funded. Both of them
might also contribute some features to my fork in the future if their time
allows it.
In this fork, Priw8 replaced packfile decompression with raw reads from
directories with the pre-extracted contents of all the .DAT files. This
works for playing the game, but there are actually two more things that
require the original packfile code:
High scores are stored as a bitstream with every variable separated by
an alternating 0 or 1 bit, using the same bit-level access functions as the
packfile reader. That's a quite… unique form of obfuscation: It requires way
too much code to read and write the format, and doesn't even obfuscate the
data that well because you can still see clear patterns when opening
these scorefiles in a hex editor.
Replays are 2-"file" archives compressed using the same algorithm as the
packfile. The first "file" contains metadata like the shot type, stage, and
RNG seed, and the second one contains the input state for every frame.
We can surely implement our own simple and uncompressed formats for these
things, but it's not the best idea to build all future Shuusou Gyoku
features on top of a replay-incompatible fork. So, what do we do? On the one
hand, pbg expressed the clear wish to not include data reverse-engineered
from the original binary. On the other hand, he released the code under the
MIT license, which allows us to modify the code and distribute the results
in any way we wish.
So, let's meet in the middle, and go for a clean-room implementation of the
missing features as indicated by their usage, without looking at either the
original binary or wangqr's reverse-engineered code.
With incremental rebuilds being broken in the latest Visual Studio project
files as well, it made sense to start from scratch on pbg's last commit. Of
course, I can't pass up a chance to use
📝 Tup, my favorite build system for every
project I'm the main developer of. It might not fit Shuusou Gyoku as well as
it fits ReC98, but let's see whether it would be reasonable at all…
… and it's actually not too bad! Modern Visual Studio makes this a bit
harder than it should be with all the intermediate build artifacts you have
to keep track of. In the end though, it's still only 70
lines of Lua to have a nice abstraction for both Debug and Release
builds. With this layer underneath, the actual
Shuusou Gyoku-specific part can be expressed as succinctly as in any
other modern build system, while still making every compiler flag explicit.
It might be slightly slower than a traditional .vcxproj build
due to launching
one cl.exe process per translation unit, but the result is
way more reliable and trustworthy compared to anything that involves Visual
Studio project files. This simplicity paves the way for expanding the build
process to multiple steps, and doing all the static checking on translation
strings that I never got to do for thcrap-based patches. Heck, I might even
compile all future translations directly into the binary…
Every C++ build system will invariably be hated by someone, so I'd
say that your goal should always be to simplify the actually important parts
of your build enough to allow everyone else to easily adapt it to their
favorite system. This Tupfile definitely does a better job there than your
average .vcxproj file – but if you still want such a thing (or,
gasp, 🤮 CMake project files 🤮) for better Visual Studio IDE
integration, you should have no problem generating them for yourself.
There might still be a point in doing that because that's the one part that
unfortunately sucks about this approach. Visual Studio is horribly broken
for any nonstandard C++ project even in 2022:
Makefile projects can be nicely integrated with Debug and Release
configurations, but setting a later C++ language standard requires dumb
.vcxproj hacks that don't even work properly anymore.
Folder projects are ridiculously ugly: The Build toolbar is permanently
grayed out even if you configured a build task. For some reason,
configuring these tasks merely adds one additional element to a 9-element
context menu in the Solution Explorer. Also, why does the big IDE use a
different JSON schema than the perfectly functional and adequate one from
Visual Studio Code?
In both cases, IntelliSense doesn't work properly at all even if it
appears to be configured correctly, and Tup's dependency tracking appeared
to be weirdly cut off for the very final .PDB file. Interestingly though,
using the big Visual Studio IDE for just debugging a binary via
devenv bin/GIAN07.exe suddenly eliminates all the IntelliSense
issues. Looks like there's a lot of essential information stored in the .PDB
files that Visual Studio just refuses to read in any other context.
But now compare that to Visual Studio Code: Open it from the x64_x86
Cross Tools Command Prompt via code ., launch a build or
debug task, or browse the code with perfect IntelliSense. Three small
configuration files and everything just works – heck, you even get the Tup
progress bar in the terminal. It might be Electron bloatware and horribly
slow at times, but Visual Studio Code has long outperformed regular Visual
Studio in terms of non-debug functionality.
On to the compression algorithm then… and it's just textbook LZSS,
with 13 bits for the offset of a back-reference and 4 bits for its length?
Hardly a trade secret there. The hard parts there all come from unexpected
inefficiencies in the bitstream format:
Encoding back-references as offsets into an 8 KiB ring buffer dictionary
means that the most straightforward implementation actually needs an 8 KiB
array for the LZSS sliding window. This could have easily been done with
zero additional memory if the offset was encoded as the difference to the
current byte instead.
The packfile format stores the uncompressed size of every file in its
header, which is a good thing because you want to know in advance how much
heap memory to allocate for a specific file. Nevertheless, the original game
only stops reading bits from the packfile once it encountered a
back-reference with an offset of 0. This means that the compressor not only
has to write this technically unneeded back-reference to the end of the
compressed bitstream, but also ignore any potential other longest
back-reference with an offset of 0 within the file. The latter can
easily happen with a ring buffer dictionary.
The original game used a single BIT_DEVICE class with mode
flags for every combination of reading and writing memory buffers and
on-disk files. Since that would have necessitated a lot of error checking
for all (pseudo-)methods of this class, I wrote one dedicated small class
for each one of these permutations instead. To further emphasize the
clean-room property of this code, these use modern C++ memory ownership
features: std::unique_ptr for the fixed-size read-only buffers
we get from packfiles, std::vector for the newly compressed
buffers where we don't know the size in advance, and std::span
for a borrowed reference to an immutable region of memory that we want to
treat as a bitstream. Definitely better than using the native Win32
LocalAlloc() and LocalFree() allocator, especially
if we want to port the game away from Windows one day.
One feature I didn't use though: C++ fstreams, because those are trash.
These days, they would seem to be the natural
choice with the new std::filesystem::path type from C++17:
Correctly constructed, you can pass that type to an fstream constructor and
gain both locale independence on Windows and portability to
everything else, without writing any Windows-specific UTF-16 code. But even
in a Release build, fstreams add ~100 KB of locale-related bloat to the .EXE
which adds no value for just reading binary files. That's just too
embarrassing if you look at how much space the rest of the game takes up.
Writing your own platform layer that calls the Win32
CreateFileW(), ReadFile(), and
WriteFile() API functions is apparently still the way to go
even in 2022. And with std::filesystem::path still being a
welcome addition to C++, it's not too much code to write either.
This gets us file format compatibility with the original release… and a
crash as soon as the ending starts, but only in Release mode? As it turns
out, this crash is caused by an
access bug that was present even in the original game, and only turned
into a crash now because the optimizer in modern Visual Studio versions
reorders static data. As a result, the 6-element pFontInfo
array got placed in front of an ECL-related counter variable that then got
corrupted by the write to the 7th element, which subsequently
crashed the game with a read access to previously deallocated danmaku script
data. That just goes to show that these technical bugs are important
and worth fixing even if they don't cause issues in the original game. Who
knows how many of these will turn into crashes once we get to porting PC-98
So here we go, a new build of Shuusou Gyoku, compiled with Visual Studio
2022, and compatible with all original data formats:
Inside the regular Shuusou Gyoku installation directory, this binary works
as a full-fledged drop-in replacement for the original
秋霜玉.exe. It still has all of the original binary's problems
Separate Japanese locale emulation is still needed to correctly refer to
the original names of the configuration (秋霜CFG.DAT), score
(秋霜SC.DAT), and replay (秋霜りぷ*.DAT) files.
It's also required for the ending text to not render as mojibake.
Running the game at full speed and without graphical glitches on modern
Windows still requires a separate DirectDraw patch such as DDrawCompat. To
eliminate any remaining flickering, configure the game to use 16-bit
graphics in the Config → Graphic menu.
As well as some of its own:
The original screenshot feature is still missing, as it also wasn't part
of pbg's released source code.
So all in all, it's a strict downgrade at this point in time.
And more of a symbol that we can now start
doing actual work on this game. Seihou has been a fun change of pace, and I
hope that I get to do more work on the series. There is quite a lot to be
done with Shuusou Gyoku alone, and the 21 GitHub issues I've opened
are probably only scratching the surface.
However, all the required research for this one consumed more like 1⅔
pushes. Despite just one push being funded, it wouldn't have made sense to
release the commits or this binary in any earlier state. To repay this debt,
I'm going to put the next for Seihou towards the
small code maintenance and performance tasks that I usually do for free,
before doing any more feature and bugfix work. Next up: Improving video
playback on the blog, and maybe delivering some microtransaction work on the