And now we're taking this small indie game from the year 2000 and porting
its game window, input, and sound to the industry-standard cross-platform
API with "simple" in its name.
Why did this have to be so complicated?! I expected this to take maybe 1-2
weeks and result in an equally short blog post. Instead, it raised so many
questions that I ended up with the longest blog post so far, by quite a wide
margin. These pushes ended up covering so many aspects that could be
interesting to a general and non-Seihou-adjacent audience, so I think we
need a table of contents for this one:
Before we can start migrating to SDL, we of course have to integrate it into
the build somehow. On Linux, we'd ideally like to just dynamically link to a
distribution's SDL development package, but since there's no such thing on
Windows, we'd like to compile SDL from source there. This allows us to reuse
our debug and release flags and ensures that we get debug information,
without needing to clone build scripts for every
C++ library ever in the process or something.
So let's get my Tup build scripts ready for compiling vendored libraries… or
maybe not? Recently, I've kept hearing about a hot new
technology that not only provides the rare kind of jank-free
cross-compiling build system for C/C++ code, but innovates by even
bundling a C++ compiler into a single 279 MiB package with no
further dependencies. Realistically replacing both Visual Studio and Tup
with a single tool that could target every OS is quite a selling point. The
upcoming Linux port makes for the perfect occasion to evaluate Zig, and to
find out whether Tup is still my favorite build system in 2023.
Even apart from its main selling point, there's a lot to like about Zig:
First and foremost: It's a modern systems programming language with
seamless C interop that we could gradually migrate parts of the codebase to.
The feature set of the core language seems to hit the sweet spot between C
and C++, although I'd have to use it more to be completely sure.
A native, optimized Hello World binary with no string formatting is
4 KiB when compiled for Windows, and 6.4 KiB when cross-compiled
from Windows to Linux. It's so refreshing to see a systems language in 2023
that doesn't bundle a bulky runtime for trivial programs and then defends it
with the old excuse of "but all this runtime code will come in handy the
larger your program gets". With a first impression like this, Zig
managed to realize the "don't pay for what you don't use" mantra that C++
typically claims for itself, but only pulls off maybe half of the time.
You can directly
target specific CPU models, down to even the oldest 386 CPUs?! How
amazing is that?! In contrast, Visual Studio only describes its /arch:IA32
compatibility option in very vague terms, leaving it up to you to figure out
that "legacy 32-bit x86 instruction set without any vector
operations" actually means "i586/P5 Pentium, because the startup code
still includes an unconditional CPUID instruction". In any
case, it means that Zig could also cover the i586 build.
Even better, changing Zig's CPU model setting recompiles both its
bundled C/C++ standard library and Zig's own compiler-rt polyfill
library for that architecture. This ensures that no unsupported
instructions ever show up in the binary, and also removes the need for
any CPUID checks. This is so much better than the Visual
Studio model of linking against a fixed pre-compiled standard library
because you don't have to trust that all these newer instructions
wouldn't actually be executed on older CPUs that don't have them.
I love the auto-formatter. Want to lay out your struct literal into
multiple lines? Just add a trailing comma to the end of the last element.
It's very snappy, and a joy to use.
Like every modern programming language, Zig comes with a test framework
built into the language. While it's not all too important for my grand plan
of having one big test that runs a bunch of replays and compares their game
states against the original binary, small tests could still be useful for
protecting gameplay code against accidental changes. It would be great if I
didn't have to evaluate and choose among
the many testing frameworks for C++ and could just use a language
The standard library is very poorly documented, especially in the
build-related parts that are meant to attract the C++ audience.
Often, the only documentation is found in blog posts from a few years
ago, with example code written against old Zig versions that doesn't compile
on the newest version anymore. It's all very far from stable.
However, Zig's project generation sub-commands (zig
init-exe and friends) do emit well-documented boilerplate
code? It does make sense for that code to double as a comprehensive example,
but Zig advertises itself as so simple that I didn't even think about
bootstrapping my project with a CLI tool at first – unlike, say, Rust, where
a project always starts with filling out a small form in
There's no progress output for C/C++ compilation? Like, at all?
This hurts especially because compilation times are significantly longer
than they were with Visual Studio. By default, the current Tupfile builds
Shuusou Gyoku in both debug and release configurations simultaneously. If I
fully rebuild everything from a clean cache, Visual Studio finishes such a
build in roughly the same amount of time that Zig takes to compile just a
The --global-cache-dir option is only supported by specific
subcommands of the zig CLI rather than being a top-level
setting, and throws an error if used for any other subcommand. Not having a
system-wide way to change it and being forced into writing a wrapper script
for that is fine, but it would be nice if said wrapper script didn't have to
also parse and switch over the subcommand just to figure out whether it is
allowed to append the setting.
compiler-rt still needs a bit of dead code elimination work. As soon as
your program needs a single polyfilled function, you get all of them,
because they get referenced in some exception-related table even if nothing
uses them? Changing the link_eh_frame_hdr option had no
And that was not the only std.Build.Step.Compile option
that did nothing. Worse, if I just tweaked the options and changed nothing
about the code itself, Zig simply copied a previously built executable
out of its build cache into the output directory, as revealed by the
timestamp on the .EXE. While I am willing to believe that Zig correctly
detects that all these settings would just produce the same binary, I do not
like how this behavior inspires distrust and uncertainty in Zig's build
process as a whole. After all, we still live in a world where clearing
the build cache is way too often the solution for weird problems in
software, especially when using CMake. And it makes sense why it would be:
If you develop a complex system and then try solving the infamously hard
problem of cache invalidation on top, the risk of getting cache invalidation
wrong is, by definition, higher than if that was the only thing your system
did. That's the reason why I like Tup so much: It solely focuses on
getting cache invalidation right, and rather errs on the side of caution by
maybe unnecessarily rebuilding certain files every once in a while because
the compiler may have read from an environment variable that has changed in
the meantime. But this is the one job I expect a build system to do, and Tup
has been delivering for years and has become fundamentally more trustworthy
as a result.
Zig activates Clang's UBSan
in debug builds by default, which executes a program-crashing
UD2 instruction whenever the program is about to rely on
undefined C++ behavior. In theory, that's a great help for spotting hidden
portability issues, but it's not helpful at all if these crashes are
seemingly caused by C++ standard library code?! Without any clear info
about the actual cause, this just turned into yet another annoyance on
top of all the others. Especially because I apparently kept searching for
the wrong terms when I first encountered this issue, and only found
out how to deactivate it after I already decided against Zig.
Also, can we get /PDBALTPATH?
Baking absolute paths from the filesystem of the developer's machine into
released binaries is not only cringe in itself, but can also cause potential
privacy or security accidents.
So for the time being, I still prefer Tup. But give it maybe two or three
years, and I'm sure that Zig will eventually become the best tool for
resurrecting legacy C++ codebases. That is, if the proposed divorce of the
core Zig compiler from LLVMisn't an indication that the
productive parts of the Zig community consider the C/C++ building features
to be "good enough", and are about to de-emphasize them to focus more
strongly on the actual Zig language. Gaining adoption for your new systems
language by bundling it with a C/C++ build system is such a great and unique
strategy, and it almost worked in my case. And who knows, maybe Zig will
already be good enough by the time I get to port PC-98 Touhou to modern
(If you came from the Zig
wiki, you can stop reading here.)
A few remnants of the Zig experiment still remain in the final delivery. If
that experiment worked out, I would have had to immediately change the
execution encoding to UTF-8, and decompile a few ASM functions exclusive to
the 8-bit rendering mode which we could have otherwise ignored. While Clang
does support inline assembly with Intel syntax via
-fms-extensions, it has trouble with ; comments
and instructions like REP STOSD, and if I have to touch that
code anyway… (The REP STOSD function translated into a single
call to memcpy(), by the way.)
Another smaller issue was Visual Studio's lack of standard library header
hygiene, where #including some of the high-level STL features also includes
more foundational headers that Clang requires to be included separately, but
I've already known about that. Instead, the biggest shocker was that Visual
Studio accepts invalid syntax for a language feature as recent as C++20
What's this, Visual Studio's infamous delayed template parsing applied to
concepts, because they're templates as well? Didn't
they get rid of that 6 years ago? You would think that we've moved
beyond the age where compilers differed in their interpretation of the core
language, and that opting into a current C++ standard turns off any
remaining antiquated behaviors…
So let's actually get my Tup build scripts ready for compiling
vendored libraries, because the
📝 previous 70 lines of Lua definitely
weren't. For this use case, we'd like to have some notion of distinct build
targets that can have a unique set of compilation and linking flags. We'd
also like to always build them in debug and release versions even if you
only intend to build your actual program in one of those versions – with the
previous system of specifying a single version for all code, Tup would
delete the other one, which forces a time-consuming and ultimately needless
rebuild once you switch to the other version.
The solution I came up with treats the set of compiler command-line options
like a tree whose branches can concatenate new options and/or filter the
versions that are built on this branch. In total, this is my 4th
attempt at writing a compiler abstraction layer for Tup. Since we're
effectively forced to write such layers in Lua, it will always be a
bit janky, but I think I've finally arrived at a solid underlying design
that might also be interesting for others. Hence, I've split off the result
into its own separate
repository and added high-level documentation and a documented example.
And yes, that's a Code Nutrition
label! I've wanted to add one of these ever since I first heard about the
idea, since it communicates nicely how seriously such an open-source project
should be taken. Which, in this case, is actually not all too
seriously, especially since development of the core Tup project has all but
stagnated. If Zig does indeed get better and better at being a Clang
frontend/build system, the only niches left for Tup will be Visual
Studio-exclusive projects, or retrocoding with nonstandard toolchains (i.e.,
ReC98). Quite ironic, given Tup's Unix heritage…
Oh, and maybe general Makefile-like tasks where you just want to run
specific programs. Maybe once the general hype swings back around and people
start demanding proper graph-based dependency tracking instead of just a command runner…
Alright, alternatives evaluated, build system ready, time to include SDL!
Once again, I went for Git submodules, but this time they're held together
batch file that ensures that the intended versions are checked out before
starting Tup. Git submodules have a bad rap mainly because of their
usability issues, and such a script should hopefully work around
them? Let's see how this plays out. If it ends up causing issues after all,
I'll just switch to a Zig-like model of downloading and unzipping a source
archive. Since Windows comes with curl and tar
these days, this can even work without any further dependencies, and will
also remove all the test code bloat.
However, dynamic linking does make sense if you consider what SDL is.
Offering all those multiple rendering, input, and sound backends is what
sets it apart from its more hip competition, and you want to have all of
them available at any time so that SDL can dynamically select them based on
what works best on a system. As a result, everything in SDL is being
referenced somewhere, so there's no dead code for the linker to eliminate.
Linking SDL statically with link-time code generation just prolongs your
link time for no benefit, even without the dynamic API thwarting any chance
of SDL calls getting inlined.
There's one thing I still don't like about all this, though. The dynamic
API's table references force you to include all of SDL's subsystems in the
DLL even if your game doesn't need some of them. But it does fit with their
intention of having SDL2.dll be swappable: If an older game
stopped working because of an outdated SDL2.dll, it should be
possible for anyone to get that game working again by replacing that DLL
with any newer version that was bundled with any random newer game. And
since that would fail if the newer SDL2.dll was size-optimized
to not include some of the subsystems that the older game required, they
simply removed (or de-prioritized) the possibility altogether.
Maybe that was their train of thought? You can always just use the official Windows
DLL, whose whole point is to include everything, after all. 🤷
So, what do we get in these 1.5 MiB? There are:
renderer backends for Direct3D 9/11/12, regular OpenGL, OpenGL ES 2.0,
Vulkan, and a software renderer,
and audio backends for WinMM, DirectSound, WASAPI, and direct-to-disk
Unfortunately, SDL 2 also statically references some newer Windows API
functions and therefore doesn't run on Windows 98. Since this build of
Shuusou Gyoku doesn't introduce any new features to the input or sound
interfaces, we can still use pbg's original DirectSound and DirectInput code
for the i586 build to keep it working with the rest of the
platform-independent game logic code, but it will start to lag behind in
features as soon as we add support for SC-88Pro BGM or more sophisticated input
remapping. If we do want to keep this build at the same feature level as
the SDL one, we now have a choice: Do we write new DirectInput and
DirectSound code and get it done quickly but only for Shuusou Gyoku, or do
we port SDL 2 to Windows 98 and benefit all other SDL 2 games as
well? I leave
that for my backers to decide.
Immediately after writing the first bits of actual SDL code to initialize
the library and create the game window, you notice that SDL makes it very
simple to gradually migrate a game. After creating the game window, you can
to retrieve HWND and HINSTANCE handles that allow
you to continue using your original DirectDraw, DirectSound, and DirectInput
code and focus on porting one subsystem at a time.
Sadly, D3DWindower can no longer turn SDL's fullscreen mode into a windowed
one, but DxWnd still works, albeit behaving a bit janky and insisting on
minimizing the game whenever its window loses focus. But in exchange, the
game window can surprisingly be moved now! Turns out that the originally
fixed window position had nothing to do with the way the game created its
DirectDraw context, and everything to do with pbg
blocking the Win32 "syscommand" that allows a window to be moved. By
deleting a system menu… seriously?! Now I'm dying to hear the Raymond
Chen explanation for how this behavior dates back to an unfortunate decision
during the Win16 days or something.
As implied by that commit, I immediately backported window movability to the
However, the most important part of Shuusou Gyoku's main loop is its frame
rate limiter, whose Win32 version leaves a bit of room for improvement.
Outside of the uncapped [おまけ] DrawMode, the
original main loop continuously checks whether at least 16 milliseconds have
elapsed since the last simulated (but not necessarily rendered) frame. And
by that I mean continuously, and deliberately without using any of
the Windows system facilities to sleep the process in the meantime, as
evidenced by a commented-out Sleep(1) call. This has two
important effects on the game:
The 60Fps DrawMode actually corresponds to a
frame rate of
(1000 / 16) = 62.5 FPS,
not 60. Since the game didn't account for the missing
2/3 ms to bring the limit down to exactly 60 FPS,
62.5 FPS is Shuusou Gyoku's actual official frame rate in a
non-VSynced setting, which we should also maintain in the SDL port.
Not sleeping the process turns Shuusou Gyoku's frame rate limitation
into a busy-waiting loop, which always uses 100% of a single CPU core just
to wait for the next frame.
Sure, modern computers are fast, but a frame won't ever take an
infinitely fast 0 milliseconds to render. So we still need to take the
current frame time into account.
SDL_Delay()'s documentation says that the wake-up could be
further delayed due to OS scheduling.
To address both of these issues, I went with a base delay time of
15 ms minus the time spent on the current frame, followed by
busy-waiting for the last millisecond to make sure that the next frame
starts on the exact frame boundary. And lo and behold: Even though this
still technically wastes up to 1 ms of CPU time, it still dropped CPU
usage into the 0%-2% range during gameplay on my Intel Core i5-8400T CPU,
which is over 5 years old at this point. Your laptop battery will appreciate
this new build quite a bit.
Time to look at audio then, because it sure looks less complicated than
input, doesn't it? Loading sounds from .WAV file buffers, playing a fixed
number of instances of every sound at a given position within the stereo
field and with optional looping… and that's everything already. The
DirectSound implementation is so straightforward that the most complex part
of its code is the .WAV file parser.
Well, the big problem with audio is actually finding a cross-platform
backend that implements these features in a way that seamlessly works with
Shuusou Gyoku's original files. DirectSound really is the perfect sound API
for this game:
It doesn't require the game code to specify any output sample format.
Just load the individual sound effects in their original format, and
playback just works and sounds correctly.
Its final sound stream seems to have a latency of 10 ms, which is
perfectly fine for a game running at 62.5 FPS. Even 15 ms would be
Sound effect looping? Specified by passing the
DSBPLAY_LOOPING flag to
Stereo panning balancing? One method call.
Playing the same sound multiple times simultaneously from a single
memory buffer? One
method call. (It can fail though, requiring you to copy the data after
Pausing all sounds while the game window is not focused? That's the
default behavior, but it can be equally easily disabled with just
a single per-buffer flag.
Future streaming of waveform BGM? No problem either. Windows Touhou has
always done that, and here's
some code I wrote 12½ years ago that would even work without DirectSound
8's notification feature.
No further binary bloat, because it's part of the operating system.
The last point can't really be an argument against anything, but we'd still
be left with 7 other boxes that a cross-platform alternative would have to
tick. We already picked SDL for our portability needs, so how does its audio
subsystem stack up? Unfortunately, not great:
It's fully DIY. All you get is a single output buffer, and you have to
do all the mixing and effect processing yourself. In other words, it's the
masochistic approach to cross-platform audio.
There are helper functions for resampling and mixing, but the
documentation of the latter is full of FUD. With a disclaimer that so
vehemently discourages the use of this function, what are you supposed to do
if you're newly integrating SDL audio into a game? Hunt for a separate sound
mixing library, even though your only quality goal is parity with stone-age
Positives? Uh… the callback-based nature means that BGM streaming is
rather trivial, and would even be comparatively less complicated than with
DirectSound. Having a mutex to prevent
writes to your sound instance structures while they're being read by the
audio thread is nice too.
OK, sure, but you're not supposed to use it for anything more than a
single stream of audio. SDL_mixer exists precisely to cover such non-trivial
use cases, and it even supports sound effect looping and panning with just a
single function call! But as far as the rest of the library is concerned, it
manages to be an even bigger disappointment than raw SDL audio:
As it sits on top of SDL's audio subsystem, it still can't just use your
audio device's native sample format.
It only offers a very opinionated system for streaming – and of course,
its opinion is wrong. 😛 The fact that it only supports a single streaming
audio track wouldn't matter all too much if you could switch to another
track at sample precision. But since you can't, you're forced to implement
looping BGM using a single file…
…which brings us to the unfortunate issue of loop point definitions.
And, perhaps most importantly, the complete lack of any way to set them
through the API?! It doesn't take long until you come up with a theory for
why the API only offers a function to retrieve loop points: The
"music" abstraction is so format-agnostic that it even supports MIDI
and tracker formats where a typical loop point in PCM samples doesn't make
sense. Both of these formats already have in-band ways of specifying loop
points in their respective time units. They
might not be standardized, but it's still much better than usual
single-file solutions for PCM streams where the loop point has to be stored
in an out-of-band way – such as in a metadata tag or an entirely separate
Speaking of MIDI, why is it so common among these APIs to not have
any way of specifying the MIDI device? The fact that Windows Vista
removed the Control Panel option for specifying the system-wide default
MIDI output device is no excuse for your API lacking the option as well.
In fact, your MIDI API now needs such a setting more than it was
needed in the Windows XP and 9x days.
Funnily enough, they did once receive a patch for a function to set loop
points which was never upstreamed… and this patch came from
the main developer behind PyTouhou, who needed that feature for obvious
reasons. The world sure is a small place.
As a result, they turned loop points into a property that each
individual format may
not have. Want to loop
MP3 files at sample precision? Tough luck, time to reconvert to another
lossy format. 🙄 This is the exact jank I decided against when I implemented
BGM modding for thcrap back in 2018,
where I concluded that separate intro and
loop files are the way to go.
But OK, we only plan to use FLAC and Ogg Vorbis for the SC-88Pro BGM, for
which SDL_mixer does support loop points in the form of Vorbiscomments,
and hey, we can even pass them at sample accuracy. Sure, it's wrong and
everything, but nothing I couldn't work with…
However, the final straw that makes SDL_mixer unsuitable for Shuusou
Gyoku is its core sound mixing paradigm of distributing all sound effects
onto a fixed number of channels, set to 8
by default. Which raises the quite ridiculous question of how many we
would actually need to cover the maximum amount of sounds that can
simultaneously be played back in any game situation. The theoretic maximum
would be 41, which is the combined sum of individual sound buffer instances
of all 20 original sound effects. The practical limit would surely be a lot
smaller, but we could only find out that one through experiments, which
honestly is quite a silly proposition.
It makes you wonder why they went with this paradigm in the first
place. And sure enough, they actually
use the aforementioned SDL core function for mixing audio. Yes, the
same function whose current documentation advises against using it for
this exact use case. 🙄 What's the argument here? "Sure, 8 is
significantly more than 2, but any mixing artifacts that will occur for
the next 6 sounds are not worrying about, but they get really bad
after the 8th sound, so we're just going to protect you from
This dire situation made me wonder if SDL was the wrong choice for Shuusou
Gyoku to begin with. Looking at other low-level cross-platform game
libraries, you'll quickly notice that all of them come with mostly
equally capable 2D renderers these days, and mainly differentiate themselves
in minute API details that you'd only notice upon a really close look. raylib is another one of those
libraries and has been getting exceptionally popular in recent years, to the
point of even having more than twice as many GitHub stars as SDL. By
restricting itself to OpenGL, it can even offer an
abstraction for shaders, which we'd really like for the 西方Ｐｒｏｊｅｃｔ lens ball effect.
In the case of raylib's audio system, the lack of sound effect looping is
the minute API detail that would make it annoying to use for Shuusou Gyoku.
But it might be worth a look at how raylib implements all this if it doesn't
use SDL… which turned out to be the best look I've taken in a long time,
because raylib builds on top of miniaudio
which is exactly the kind of audio library I was hoping to find.
Let's check the list from above:
🟢 miniaudio's high-level API initialization defaults to the native
sample format of the playback device. Its internal processing uses 32-bit
floating-point samples and only converts back to the native bit depth as
necessary when writing the final stream into the backend's audio buffer.
WASAPI, for example, never needs any further conversion because it operates
with 32-bit floats as well.
🟢 The final audio stream uses the same 10 ms update period (and
thus, sound effect latency) that I was getting with DirectSound.
🟢 Stereo panning balancing? ma_sound_set_pan(),
although it does require a conversion from Shuusou Gyoku's dB units into a
linear attenuation factor.
🟢 Sound effect looping? ma_sound_set_looping().
🟢 Playing the same sound multiple times simultaneously from a single
memory buffer? Perfectly possible, but requires a bit of digging in the
header to find the best solution. More on that below.
🟢 Future streaming of waveform BGM? Just call
ma_sound_init_from_file() with the
👍 It also comes with a FLAC decoder in the core library and an Ogg
Vorbis one as part of the repo, …
🤩 … and even supports gapless switching between the intro and loop
files via a single declarative call to
(Oh, and it also has ma_data_set_loop_point_in_pcm_frames()
for anyone who still believes in obviously and objectively
inferior out-of-band loop points.)
🟢 Pausing all sounds while the game window is not focused? It's not
automatic, but adding new functions to the sound interface and calling
ma_engine_stop() and ma_engine_start() does the
trick, and most importantly doesn't cause any samples to be lost in the
🟡 Sound control is implemented in a lock-free way, allowing your main
game thread to call these at any time without causing glitches on the audio
thread. While that looks nice and optimal on the surface, you now have to
either believe in the soundness (ha) of the implementation, or verify that
atomic structure fields actually are enough to not cause any race
conditions (which I did for the calls that Shuusou Gyoku uses, and I didn't
find any). "It's all lock-free, don't worry about it" might be
easier, but I consider SDL's approach of just providing a mutex to
prevent the output callback from running while you mutate the sound state to
actually be simpler conceptually.
🟡 miniaudio adds 247 KB to the binary in its minimum
configuration, a bit more than expected. Some of that is bloat from effect
code that we never use, but it does include backends for all three Windows
audio subsystems (WASAPI, DirectSound, and WinMM).
✅ But perhaps most importantly: It natively supports all modern
operating systems that one could seriously want to port this game to, and
could be easily ported to any other backend, including
Oh, and it's written by the same developer who also wrote the best FLAC
library back in 2018. And that's despite them being single-file C libraries,
which I consider to be massively overrated…
The only slightly tricky part of implementing a miniaudio backend for
Shuusou Gyoku lies in setting up multiple simultaneously playing instances
for each individual sound. The documentation and answers on the issue
tracker heavily push you toward miniaudio's resource manager and its file
abstractions to handle this use case. We surely could turn Shuusou Gyoku's
numeric sound effect IDs into fake file names, but it doesn't really fit the
existing architecture where the sound interface just receives in-memory .WAV
file buffers loaded from the SOUND.DAT packfile.
In that case, this seems to be the best way:
Call ma_decode_memory() to decode from any of the supported
audio formats to a buffer of raw PCM samples. At this point, you can
decoding into the original format the sound effect is stored in,
which would require it to be converted to the playback format every
time it's played, or
decoding into 32-bit floats (the native bit depth of the miniaudio
engine) and the native sampling rate of the playback device, which
avoids any further resampling and floating-point conversion, but takes
up more memory.
Nowadays, it's not clear at all which of the two approaches is faster.
Does it actually matter if we save the audio thread from doing all those
floating-point operations on every sample? Or is that no longer true these
days because the audio thread is probably running on a different CPU core,
the rest of the game largely doesn't touch the floating-point parts of your
CPU anyway, and you'd rather want to keep sound effects small so that they
can better fit into the CPU cache? That would be an interesting question to
benchmark, but just like the similar text rendering question from the last
blog posts, it doesn't matter for this tiny 2000s retro game. 😌
I went with 2) mainly because it simplified all the debugging I was doing.
At a sampling rate of 48,000 Hz, this increases the memory usage for
all sound effects from 379 KiB to 3.67 MiB. At least I'm not
channel-expanding all sound effects as well here…
We've seen earlier that mono➜stereo expansion
is SSE-optimized, so it's very hard to justify a further doubling of the
memory usage here.
Then, for each instance of the sound, call
ma_audio_buffer_ref_init() to create a reference
buffer with its own playback cursor, and
ma_sound_init_from_data_source() to create a new
high-level sound node that will play back the reference buffer.
As a side effect of hunting that one critical bug in miniaudio, I've now
learned a fair bit about audio resampling in general. You'll probably need
some knowledge about basic
digital signal behavior to follow this section, and that video is still
probably the best introduction to the topic.
So, how could this ever be an issue? The only time I ever consciously
thought about resampling used to be in the context of the Opus codec and its
enforced sampling rate of 48,000 Hz, and how Opus advocates
claim that resampling is a solved problem and nothing to worry about,
especially in the context of a lossy codec. Still, I didn't add Opus to
thcrap's BGM modding feature entirely because the mere thought of having to
downsample to 44,100 Hz in the decoder was off-putting enough. But even
if my worries were unfounded in that specific case: Recording the
Stereo Mix of Shuusou Gyoku's now two audio backends revealed that
apparently not every audio processing chain features an Opus-quality
If we take a look at the material that resamplers actually have to work with
here, it quickly becomes obvious why their results are so varied. As
mentioned above, Shuusou Gyoku's sound effects use rather low sampling rates
that are pretty far away from the 48,000 Hz your audio device is most
definitely outputting. Therefore, any potential imaging noise across the
extended high-frequency range – i.e., from the original Nyquist frequencies
of 11,025 Hz/5,512.5 Hz up to the new limit of 24,000 Hz – is
still within the audible range of most humans and can clearly color the
But it gets worse if the audio data you put into the resampler is
objectively defective to begin with, which is exactly the problem we're
facing with over half of Shuusou Gyoku's sound effects. Encoding them all as
8-bit PCM is definitely excusable because it was the turn of the millennium
and the resulting noise floor is masked by the BGM anyway, but the blatant
clipping and DC offsets definitely aren't:
Waveforms for all 20 of Shuusou Gyoku's sound effects, in the order they
appear inside SOUND.DAT and with their internal names. We can
see quite an abundance of clipping, as well
as a significant DC
offset in WARNING, BUZZ, JOINT,
SBBOMB, and BOSSBOMB.
Wait a moment, true peaks? Where do those come from? And, equally
importantly, how can we even observe, measure, and store anything
above the maximum amplitude of a digital signal?
The answer to the first question can be directly derived from the Xiph.org
video I linked above: Digital signals are lollipop graphs, not stairsteps as
commonly depicted in audio editing software. Converting them back to an
analog signal involves constructing a continuous curve that passes through
each sample point, and whose frequency components stay below the Nyquist
frequency. And if the amplitude of that reconstructed wave changes too
strongly and too rapidly, the resulting curve can easily overshoot the
maximum digital amplitude of 0
dBFS even if none of the defined samples are above that limit.
So let's store the resampled output as a FLAC file and load it into Audacity
to visualize the clipped peaks… only to find all of them replaced with the
typical kind of clipping distortion? 😕 Turns out that I've stumbled over
the one case where the FLAC format isn't lossless and there's
actually no alternative to .WAV: FLAC just doesn't support
floating-point samples and simply truncates them to discrete integers during
encoding. When we measured inter-sample peaks above, we weren't only
resampling to a floating-point format to avoid any quantization to discrete
integer values, but also to make it possible to store amplitudes beyond the
0 dBFS point of ±1.0 in the first place. Once we lose that ability,
these amplitudes are clipped to the maximum value of the integer bit depth,
and baked into the waveform with no way to get rid of them again. After all,
the resampled file now uses a higher sampling rate, and the clipping
distortion is now a defined part of what the sound is.
Finally, storing a digital signal with inter-sample peaks in a
floating-point format also makes it possible for you to reduce the
volume, which moves these peaks back into the regular, unclipped amplitude
range. This is especially relevant for Shuusou Gyoku as you'll probably
never listen to sound effects at full volume.
Now that we understand what's going on there, we can finally compare the
output of various resamplers and pick a suitable one to use with miniaudio.
And immediately, we see how they fall into two categories:
High-quality resamplers are the ones I described earlier: They cleanly
recreate the signal at a higher sampling rate from its raw frequency
representation and thus add no high-frequency noise, but can lead to
inter-sample peaks above 0 dBFS.
Linear resamplers use much simpler math to merely interpolate
between neighboring samples. Since the newly interpolated samples can only
ever stay within 0 dBFS, this approach fully avoids inter-sample
clipping, but at the expense of adding high-frequency imaging noise that has
to then be removed using a low-pass filter.
miniaudio only comes with a linear resampler – but so does DirectSound as it
turns out, so we can get actually pretty close to how the game sounded
And yes, these are indeed the first videos on this blog to have sound! I
spent another push on preparing the
📝 video conversion pipeline for audio
support, and on adding the highly important volume control to the player.
Web video codecs only support lossy audio, so the sound in these videos will
not exactly match the spectrum image, but the lossless source files do
contain the original audio as uncompressed PCM streams.
Compared to that whole mess of signals and noise, keyboard and joypad input
is indeed much simpler. Thanks to SDL, it's almost trivial, and only
slightly complicated because SDL offers two subsystems with seemingly
SDL_GameController provides a consistent interface for the typical kind
of modern gamepad with two analog sticks, a D-pad, and at least 4 face and 2
shoulder buttons. This API is implemented by simply combining SDL_Joystick
long list of mappings for specific controllers, and therefore doesn't
work with joypads that don't match this standard.
To match Shuusou Gyoku's original WinMM backend, we'd ideally want to keep
the best aspects from both APIs but without being restricted to
SDL_GameController's idea of a controller. The Joy
Pad menu just identifies each button with a numeric ID, so
SDL_Joystick would be a natural fit. But what do we do about directional
controls if SDL_Joystick doesn't tell us which joypad axes correspond to the
X and Y directions, and we don't have the SDL-recommended configuration UI yet?
Doing that right would also mean supporting
POV hats and D-pads, after all… Luckily, all joypads we've tested map
their main X axis to ID 0 and their main Y axis to ID 1, so this seems like
a reasonable default guess.
The necessary consolidation of the game's original input handling uncovered
several minor bugs around the High Score and Game Over screen that I
sufficiently described in the release notes of the new build. But it also
revealed an interesting detail about the Joy Pad
screen: Did you know that Shuusou Gyoku lets you unbind all these
actions by pressing more than one joypad button at the same time? The
original game indicated unbound actions with a [Button
0] label, which is pretty confusing if you have ever programmed
anything because you now no longer know whether the game starts numbering
buttons at 0 or 1. This is now communicated much more clearly.
With that, we're finally feature-complete as far as this delivery is
concerned! Let's send a build over to the backers as a quick sanity check…
a~nd they quickly found a bug when running on Linux and Wine. When holding a
button, the game randomly stops registering directional inputs for a short
while on some joypads? Sounds very much like a Wine bug, especially if the
same pad works without issues on Windows.
And indeed, on certain joypads, Wine maps the buttons to completely
different and disconnected IDs, as if it simply invents new buttons or axes
to fill the resulting gaps. Until we can differentiate joypad bindings
per controller, it's therefore unlikely that you can use the same joypad
mapping on both Windows and Linux/Wine without entering the Joy Pad menu and remapping the buttons every time you
switch operating systems.
Still, by itself, this shouldn't cause any issues with my SDL event handling
code… except, of course, if I forget a break; in a switch case.
This completely preventable implicit fallthrough has now caused a few hours
of debugging on my end. I'd better crank up the warning level to keep this
from ever happening again. Opting into this specific warning also revealed
why we haven't been getting it so far: Visual Studio did gain a whole host
of new warnings related to the C++ Core
Guidelines a while ago, including the one I
was looking for, but actually getting the compiler to throw these
a separate static analysis mode together with a plugin, which
significantly slows down build times. Therefore I only activate them for
release builds, since these already take long enough.
Since all that input debugging already started a 5th push, I
might as well fill that one by restoring the original screenshot feature.
After all, it's triggered by a key press (and is thus related to the input
backend), reads the contents of the frame buffer (and is thus related to the
graphics backend), and it honestly looks bad to have this disclaimer in the
release notes just because we're one small feature away from 100% parity
with pbg's original binary.
Coincidentally, I had already written code to save a DirectDraw surface to a
.BMP file for all the debugging I did in the last delivery, so we were
basically only missing filename generation. Except that Shuusou
Gyoku's original choice of mapping screenshots to the PrintScreen key did
not age all too well:
And as of Windows 11, the OS takes full control of the key by binding it
to the Snipping Tool by default, complete with a UI that politely steals
focus when hitting that key.
As a result, both Arandui and I independently arrived at the
idea of remapping screenshots to the P key, which is the same screenshot key
used by every Windows Touhou game since TH08.
The rest of the feature remains unchanged from how it was in pbg's original
build and will save every distinct frame rendered by the game (i.e., before
flipping the two framebuffers) to a .BMP file as long as the P key is being
held. At a 32-bit color depth, these screenshots take up 1.2 MB per
frame, which will quickly add up – especially since you'll probably hold the
P key for more than 1/60 of a second and therefore end
up saving multiple frames in a row. We should probably compress
them one day.
Since I already translated some of Shuusou Gyoku's ASM code to C++ during
the Zig experiment, it made sense to finish the fifth push by covering the
rest of those functions. The integer math functions are used all throughout
the game logic, and are the main reason why this goal is important for a
Linux port, or any port to a 64-bit architecture for that matter. If you've
ever read a micro-optimization-related blog post, you'll
know that hand-written ASM is a great recipe that often results in the
finest jank, and the game's square root function definitely delivers in that
regard, right out of the gate.
What slightly differentiates this algorithm from the typical definition of
square root is that it rounds up: In real numbers, √3 is
≈ 1.73, so isqrt(3) returns 2 instead of 1. However, if
the result is always rounded down, you can determine whether you have to
round up by simply squaring the calculated root and comparing it to the radicand. And even that
is only necessary if the difference between the two doesn't naturally fall
out of the algorithm – which is what also happens with Shuusou Gyoku's
original ASM code, but pbg
didn't realize this and squared the result regardless.
That's one suboptimal detail already. Let's call the original ASM function
in a loop over the entire supported range of radicands from 0 to
231 and produce a list of results that I can verify my C++
translation against… and watch as the function's linear time complexity with
regard to the radicand causes the loop to run for over 15 hours on my
system. 🐌 In a way, I've found the literal opposite of Q_rsqrt()
here: Not fast, not inverse, no bit hacks, and surely without the
awe-inspiring kind of WTF.
I really didn't want to run the same loop over a
literal C++ translation of the same algorithm afterward. Calculating
integer square roots is a common problem with lots of solutions, so let's
see if we can go better than linear.
And indeed, Wikipedia
also has a bitwise algorithm that runs in logarithmic time, uses only
additions, subtractions, and bit shifts, and even ends up with an error term
that we can use to round up the result as necessary, without a
multiplication. And this algorithm delivers the exact same results over the
exact same range in… 50 seconds. 🏎️ And that's with the I/O to print
the first value that returns each of the 46,341 different square root
"But wait a moment!", I hear you say. "Why are you bothering with
an integer square root algorithm to begin with? Shouldn't good old
round(sqrt(x)) from <math.h> do the trick
just fine? Our CPUs have had SSE for a long time, and this probably compiles
into the single SQRTSD instruction. All that extra
floating-point hardware might mean that this instruction could even run in
parallel with non-SSE code!"
And yes, all of that is technically true. So I tested it, and my very
synthetic and constructed micro-benchmark did indeed deliver the same
results in… 48 seconds. That's not enough of a
difference to justify breaking the spirit of treating the FPU as lava that
permeates Shuusou Gyoku's code base. Besides, it's not used for that much to
pre-calculating the 西方Ｐｒｏｊｅｃｔ lens ball effect
the fade animation when entering and leaving stages
rendering the circular part of stationary lasers
pulling items to the player when bombing
After a quick C++ translation of the RNG function that spells out a 32-bit
multiplication on a 32-bit CPU using 16-bit instructions, we reach the final
pieces of ASM code for the 8-bit atan2() and trapezoid
rendering. These could actually pass for well-written ASM code in how they
express their 64-bit calculations: atan8() prepares its 64-bit
dividend in the combined EDX and EAX registers in
a way that isn't obvious at all from a cursory look at the code, and the
trapezoid functions effectively use Q32.32 subpixels. C++ allows us to
cleanly model all these calculations with 64-bit variables, but
unfortunately compiles the divisions into a call to a comparatively much
more bloated 64-bit/64-bit-division polyfill function. So yeah, we've
actually found a well-optimized piece of inline assembly that even Visual
Studio 2022's optimizer can't compete with. But then again, this is all
about code generation details that are specific to 32-bit code, and it
wouldn't be surprising if that part of the optimizer isn't getting much
attention anymore. Whether that optimization was useful, on the other hand…
Oh well, the new C++ version will be much more efficient in 64-bit builds.
And with that, there's no more ASM code left in Shuusou Gyoku's codebase,
and the original DirectXUTYs directory is slowly getting
emptier and emptier.
Phew! Was that everything for this delivery? I think that was everything.
Here's the new build, which checks off 7 of the 15 remaining portability
Next up: Taking a well-earned break from Shuusou Gyoku and starting with the
preparations for multilingual PC-98 Touhou translatability by looking at
TH04's and TH05's in-game dialog system, and definitely writing a shorter
blog post about all that…
And then I'm even late by yet another two days… For some reason, preparing
Shuusou Gyoku for an OpenGL port has been the most difficult and drawn-out
task I've worked on so far throughout this project. These pushes were in
development since April, and over two months in total. Tackling a legacy
codebase with such a rather vague goal while simultaneously wanting to keep
everything running did not do me any favors, and it was pretty hard to
resist the urge to fix everything that had better be fixed to make
this game portable… 📝 2022 ended with Shuusou Gyoku working at full speed on Windows ≥8 by itself, without external tools, for the first
time. However, since it all came down to just one small bugfix, the
resulting build still had several issues:
The game might still start in the slow, mitigated 8-bit or 16-bit
mode if the respective app compatibility flag is still present in the
registry from the earlier 📝 P0217 build. A
player would then have to manually put the game into 32-bit mode via the
Option menu to make it run at its actual intended speed. Bypassing this flag
programmatically would require some rather fiddly .EXE patching techniques.
The 32-bit mode tends to lag significantly if a lot of sprites are
onscreen, for example when canceling the final pattern of the Extra Stage
If the game window lost and regained focus during the ending (for
example via Alt-Tabbing), the game reloads the wrong sprite sheet. (#19)
And, of course, we still have no native windowed mode, or support for
rendering in the higher resolutions you'd want to use on modern high-DPI
Now, we could tackle all of these issues one by one, in focused pushes… or
wait for one hero to fund a full-on OpenGL backend as part of the larger
goal of porting this game to Linux. This would take much longer, but fix all
these issues at once while bringing us significantly closer to Shuusou Gyoku
being cross-platform. Which is exactly what Ember2528 did.
Shuusou Gyoku is a very Windows-native codebase. Its usage of types
declared in <windows.h> even extends to core gameplay
code, the rendering code is completely architected around DirectDraw's
features and drawbacks, and text rendering is not abstracted at all. Looks
like it's now my task to write all the abstractions that pbg didn't manage
Therefore, I chose to stay with DirectDraw for a few more pushes while I
would build these abstractions. In hindsight, this was the least efficient
approach one could possibly imagine for the exact goal of porting the game
to Linux. Suddenly, I had to understand all this DirectDraw and GDI
jank, just to keep the game running at every step along the way. Retaining
Shuusou Gyoku's 8-bit mode in particular was a huge pain, but I didn't want
to remove it because it's currently the only way I can easily debug the game
in windowed mode at a scaled resolution, through DxWnd. In 16-bit or
32-bit mode, DxWnd slows down to a crawl, roughly resembling the performance
drop we used to get with Windows' own compatibility mitigations for the
The upside, though, is that everything I've built so far still works with
the original 8-bit and 16-bit graphics modes. And with just one compiler flag to disable
any modern x86 instructions, my build can still run on i586/P5 Pentium
CPUs, and only requires KernelEx and its latest
Kstub822 patches to run on Windows 98. And, surprisingly, my core
audience does appreciate this fact. Thus, I will include an i586 build
in all of my upcoming Shuusou Gyoku releases from now on. Once this codebase
can compile into a 64-bit binary (which will obviously be required for a
native Linux build), the i586 build will remain the only 32-bit Windows
build I'll include in my releases.
So, what was DirectDraw? In the shortest way that still describes it
accurately from the point of view of a developer: "A hardware acceleration
layer over Ye Olde Win32 GDI, providing double-buffering and fast blitting
of rectangles." There's the primary double-buffered framebuffer
surface, the offscreen surfaces that you create (which are
comparable to what 3D rendering APIs would call textures), and you
can blit rectangular regions between the two. That's it. Except for
double-buffering, DirectDraw offers no feature that GDI wouldn't also
support, while not covering some of GDI's more complex features. I mean,
DirectDraw can blit rectangles only? How
However, DirectDraw's relative lack of features is not as much of a problem
as it might appear at first. The reason for that lies in what I consider to
be DirectDraw's actual killer feature: compatibility with GDI's device
context (DC) abstraction. By acquiring a DC for a DirectDraw surface,
you can use all existing GDI functions to draw onto the surface, and, in
general, it will all just work. 😮 Most notably, you can use GDI's blitting
functions (i.e., BitBlt() and friends) to transfer pixel data
from a GDI HBITMAP in system memory onto a DirectDraw surface
in video memory, which is the easiest and most straightforward way to, well,
get sprite data onto a DirectDraw surface in the first place.
In theory, you could do that without ever touching GDI by locking the
surface memory and writing the raw bytes yourself. But in practice, you
probably won't, because your game has to run under multiple bit depths and
your data files typically only store one copy of all your sprites in a
single bit depth. And the necessary conversion and palette color matching…
is a mere implementation detail of GDI's blitting functions, using a
supposedly optimized code path for every permutation of source and
destination bit depths.
All in all, DirectDraw doesn't look too bad so far, does it? Fast blitting,
and you can still use the full wealth of GDI functions whenever needed… at
the small cost of potentially losing your surface memory at any time. 🙄
Yup, if a DirectDraw game runs in true resolution-changing fullscreen mode
and you switch to the Windows desktop, all your surface memory is freed and
you have to manually restore it once the game regains focus, followed by
manually copying all intended bitmap data back onto all surfaces. DirectDraw
is where this concept of surface loss originated, which later carried over
to the earlier versions of Direct3D and, infamously,
Direct2D as well.
Looking at it from the point of view of the mid-90s, it does make sense to
let the application handle trashed video memory if that's an unfortunate
reality that your graphics API implementation has to deal with. You don't
want to retain a second copy of each surface in a less volatile part of
memory because you didn't have that much of it. Instead, the application can
now choose the most appropriate way to restore each individual surface. For
procedurally generated surfaces, it could just re-run the generating code,
whereas all the fixed sprite sheets could be reloaded from disk.
In practice though, this well-intentioned freedom turns into a huge pain.
Suddenly, it's no longer enough to load every sprite sheet once before it's
needed, blit its pixel data onto the DirectDraw surface, and forget about
it. Now, the renderer must also be able to refresh the pixel data of every
surface from within itself whenever any of DirectDraw's blitting
functions fails with a DDERR_SURFACELOST error. This fact alone
is enough to push your renderer interface towards central management and
allocation of surfaces. You could maybe avoid the conceptual
SurfaceManager by bundling each surface with a regeneration
callback, but why should you? Any other graphics API would work with
straight-line procedural load-and-forget initialization code, so why slice
that code into little parts just because of some DirectDraw quirk?
So if your surfaces can get trashed at any time, and you already use
GDI to copy them from system memory to DirectDraw-managed video memory,
and your game features at least one procedurally generated surface…
you might as well retain every currently loaded surface in the form of an
additional GDI device-independent bitmap. 🤷 In fact, that's even better
than what Shuusou Gyoku did originally: For all .BMP-sourced surfaces, it
only kept a buffer of the entire decompressed .BMP file data, which means
that it had to recreate said intermediate GDI bitmap every time it needed to
restore a surface. The in-game music title was originally restored
via regeneration callback that re-rendered the intended title directly onto
the DirectDraw surface, but this was handled by an additional "restore hook"
system that remained unused for anything else.
Anything more involved would be a micro-optimization, especially since the
goal is to get away from DirectDraw here. Not much point in "neatly"
reloading sprite surfaces from disk if the total size of all loaded sprite
sheets barely exceeds the 1 MiB mark. Also, keeping these GDI DIBs loaded
and initialized does speed up getting back into the game… in theory,
at least. After all, the game still runs in fullscreen mode, and resolution
switching already takes longer on modern flat-panel displays than any
surface restoration method we could come up with.
As you might have guessed, these exact colors come from Gates' face sprite,
whose palette apparently doesn't match the sprite sheets used in Stage 3.
Turns out that 256 colors are not enough for what Shuusou Gyoku would like
to use across the entire stage. In sprite loading order:
Additional unique colors
Total unique colors
General system sprites
Stage 3 enemies
Stage 3 map tiles
Wide Shot bomb cut-in
And that's why Shuusou Gyoku does not only have to retain these palettes,
but also contains stage
script commands (!) to switch the current palette back to either the map
or enemy one, after the dialog system enforced the face palette.
But the worst aspects about palettes rear their ugly head at the boundary
between GDI and DirectDraw, when GDI adds its own palettes into the mix.
None of the following points are clearly documented in either ancient or
current MSDN, forcing each new DirectDraw developer to figure them out on
When calling IDirectDraw::CreateSurface() in 8-bit mode,
DirectDraw automatically sets up the newly created surface with a reference
(not a copy!) to the palette that's currently assigned to the primary
When locking an 8-bit surface for GDI blitting via
IDirectDrawSurface::GetDC(), DirectDraw is supposed to set the
GDI palette of the returned DC to the current palette of the DirectDraw…
primary surface?! Not the surface you're actually calling
Interestingly, it took until March of this year for DxWnd to discover a
different game that relied on this detail, while DDrawCompat had
implemented it for years. DxWnd version 2.05.95 then introduced the
DirectX(2) → Fix DC palette tweak, and it's this option that would
fix the colors of the in-game music title on any Shuusou Gyoku build older
Make sure to neverBitBlt() from a 24-bit RGB GDI
image to a palettized 8-bit DirectDraw offscreen surface. You might be
tempted to just go 24-bit because there's no palette to worry about and you
can retain a single GDI image for every supported bit depth, but the
resulting palette mapping glitches will be much worse than if you just
stayed in 8-bit. If you want to procedurally generate a GDI bitmap for a
DirectDraw surface, for example if you need to render text, just create
a bitmap that's compatible with the DC of DirectDraw's primary or
backbuffer surface. Doing that magically removes all palette woes, and
CreateCompatibleBitmap() is much easier to call anyway.
Ultimately, all of this is why Shuusou Gyoku's original DirectDraw backend
looks the way it does. It might seem redundant and inefficient in places,
but pbg did in fact discover the only way where all the undocumented GDI and
DirectDraw color mapping internals come together to make the game look as
And what else are you going to do if you want to target old hardware? My
PC-9821Nw133, for example, can only run the original Shuusou Gyoku in 8-bit
mode. For a Windows game on such old hardware, 8-bit DirectDraw looks like
the only viable option. You certainly don't want to use GDI alone, because
that's probably slow and you'd have to worry about even more palette-related
issues. Although people have reported that Shuusou Gyoku does actually
run faster on their old Windows 9x machine if they disable DirectDraw
In that case, it might be worth a try to write a completely new 8-bit
software renderer, employing the same retained VRAM techniques that the
PC-98 Touhou games used to implement their scrolling playfields with a
minimum of redraws. The hardware scrolling feature of the PC-98 GDC would
then be replicated by blitting the playfield in two halves every frame. I
wonder how fast that would be…
Or you go straight back to DOS, and bring your own font renderer and
MIDI/PCM sound driver.
So why did we have to learn about all this? Well, if GDI functions can
directly render onto any kind of DirectDraw surface, this also includes text
rendering functions like TextOut() and DrawText().
If you're really lazy, you can even render your text directly onto
the DirectDraw backbuffer, which probably re-rasterizes all glyphs
Which, you guessed it, is exactly how Shuusou Gyoku renders most of its
text. 🐷 Granted, it's not too bad with MS Gothic thanks to its embedded
bitmaps for font
heights between 7 and 22 inclusive, which replace the usual Bézier curve
rasterization for TrueType fonts with a rather quick bitmap lookup. However,
it would not only become a hypothetical problem if future translations end
up choosing more complex fonts without embedded bitmaps, but also as soon as
we port the game to other systems. Nobody in their right mind would
integrate a cross-platform font renderer directly with a 3D graphics API… right?
Instead, let's refactor the game to render all its existing text to and from
extending the way the in-game music title is rendered to the rest of the
game. Conceptually, this is also how the Windows Touhou games have always
rendered their text. Since they've always used Direct3D, they've always had
to blit GDI's output onto a texture. Through the definitions in
text.anm, this fixed-size texture is then turned into a sprite
sheet, allowing every rendered line of text to be individually placed on the
screen and animated.
However, the static nature of both the sprite sheet and the texture caused
its fair share of problems for thcrap's translation support. Some of the
sprites, particularly the ones for spell card titles, don't originally take
up the entire width of the playfield, cutting off translations long before
they reach the left edge. Consequently, thcrap's base patch
for the Windows Touhou games has to resize the respective sprites to
make translators happy. Before I added .ANM header
patching in late 2018, this had to be done through a complete modified
copy of text.anm for every game – with possibly additional
variants if ZUN changed the layout of this file between game versions. Not
to mention that it's bound to be quite annoying to manually allocate a
rectangle for every line of text we want to show. After all, I have at leasttwo text-heavy future
features in mind already…
So let's not do exactly that. Since DirectDraw wants us to manage all
surfaces in a central place, we keep the idea of using a single surface for
all text. But instead of predefining anything about the surface layout, we
fully build up the surface at runtime based on whatever rectangles we need,
using a rectangle
packing algorithm… yup, I wouldn't have expected to enter such territory
either. For now, we still hardcode a fixed size that each piece of text is
allowed to maximally take up. But once we get translations, nothing is
stopping us from dynamically extending this size to fit even longer strings,
and fitting them onto the fixed screen space via smooth scrolling.
To prevent the surface from arbitrarily growing as the game wants to render
more and more text, we also reset all allocated rectangles whenever the game
state changes. In turn, this will also recreate the text surface to match
the new bounding box of all rectangles before the first prerendering call
with the new layout. And if you remember the first bullet point about
DirectDraw palettes in 8-bit mode, this also means that the text surface
automatically receives the current palette of the primary surface, giving
us correct colors even without requiring DxWnd's DC palette tweak. 🎨
In fact, the need to dynamically create surfaces at custom sizes was the
main reason why I had to look into DirectDraw surface management to begin
with. The original game created
all of its surfaces at once, at startup or after changing the bit depth
in the main menu, which was a bad idea for many reasons:
It hardcoded and limited the size of all sprite sheets,
added another rendering-API-specific function that game code should not
need to worry about,
introduced surface IDs that have to be synchronized with the
surface pointers used throughout the rest of the game,
and was the main reason why the game had to distribute the six 320×240
ending pictures across two of the fixed 640×480 surfaces, which ended up
causing the sprite reload
bug in the ending. As implied in the issue, this was a DirectDraw bug
that pretty much had to fix itself before I could port the game to OpenGL,
and was the only bug where this was the case. Check the issue comments for
more details about this specific bug.
In the end, we get four different layouts for the text surface: One for the
main menu, the Music Room, the in-game portion, and the ending. With,
perhaps surprisingly, not too much text on either of them:
Still, we're re-rasterizing whole lines of text exactly as they appear on
screen, and are even doing so multiple times to apply any drop shadows.
Isn't that exactly what every text rendering tutorial nowadays advises
against doing? Why not directly go for the classic solution to this problem
and render using a font texture
Most of the game text is still in Japanese. If we were to build a font
atlas in advance, we'd have to add a separate build step that collects all
needed codepoints by parsing all text the game would ever print, adding a
build-time dependency on the original game's copyrighted data files. We'd
also have to move all hardcoded strings to a separate file since we surely
don't want to parse C++ manually during said build step. Theoretically, we
would then also give up the idea of modding text at run-time without
re-running that build step, since we'd restrict all text to the glyphs we've
rasterized in the atlas… yeah, that's more than enough reasons for static
atlas generation to be a non-starter.
OK, then let's build the atlas dynamically, adding new glyphs as we
encounter them. Since this game is old, we can even be a bit lazy as far as
the packing is concerned, and don't have to get as fancy as the GIF in the
link above. Just assume a fixed height for each glyph, and fill the atlas
from left to right. We can even clear it periodically to keep it from
getting too big, like before entering the Music Room, the in-game portion,
or the ending, or after switching languages once we have translations.
Should work, right?
Except that most text in Shuusou Gyoku comes with a shadow, realized by
first drawing the same string in a darker color and displaced by a few
pixels. With a 3D renderer, none of this would be an issue because we can
define vertex colors. But we're still using DirectDraw, which has no way of
applying any sort of color formula – again, all it can do is take a
rectangle and blit it somewhere else. So we can't just keep one atlas with
white glyphs and let the renderer recolor it. Extending Shuusou Gyoku's
Direct3D code with support for textured quads is also out of the question
because then we wouldn't have any text in the Direct3D-less 8-bit mode. So
what do we do instead? Throw the atlas away on every color change? Keep
multiple atlases for every color we've seen so far? Turn shadows into a
high-level concept? Outright forgetting the idea seems to be the best choice
For a rather square language like Japanese where one Shift-JIS codepoint
always corresponds to one glyph, a texture atlas can work fine and without
too much effort. But once we support languages with more complex ligatures,
we suddenly need to get a shaping
engine from somewhere, and directly interact with it from our rendering
code. This necessarily involves changing APIs and maybe even bundling the
first cross-platform libraries, which I wanted to avoid in an already packed
and long overdue delivery such as this one. If we continue to render
line-by-line, translations would only need a line break algorithm.
Most importantly though: It's not going to matter anyway. The
game ran fine on early 2000s hardware even though it called
TextOut() every frame, and any approach that caches the result
of this call is going to be faster.
While the Music Room and the ending can be easily migrated to a prerendering
system, it's much harder for the main menu. Technically, all option
strings of the currently active submenu are rewritten every frame, even
though that would only be necessary for the scrolling MIDI device name in
the Sound / Music submenu. And since all this rewriting is done
via a classic sprintf() on fixed-size char
buffers, we'd have to deploy our own change detection before prerendering
can have any performance difference.
In essence, we'd be shifting the text rendering paradigm from the original
immediate approach to a more retained one. If you've ever used any of the
hot new immediate-mode GUI or web frameworks that have become popular over
the last 10 years, your alarm bells are probably already ringing by now.
Adding retained elements is always a step back in terms of code quality, as
it increases complexity by storing UI state in a second place.
Wouldn't it be better if we could just stay with the original immediate
approach then? Absolutely, and we only need a simple cache system to get
there. By remembering the string that was last rendered to every registered
rectangle, the text renderer can offer an immediate API that combines the
distinct Prerender() and Blit() steps into a
single Render() call. There still has to be an initialization
point that registers all rectangles for each game state (which,
surprisingly, was not present for the in-game portion in the original code),
but the rendering code remains architecturally unchanged in how we call the
text renderer every frame. As long as the text doesn't change, the text
renderer just blits whatever it previously rendered to the respective
rectangle. With an API like this, the whole pre-rendering part turns into a
mere implementation detail.
So, how much faster is the result? Since I can only measure non-VSynced
performance in a quite rudimentary way using DxWnd's FPS counter, it highly
depends on the selected renderer. Weirdly enough, even just switching font
creation to the Unicode APIs tripled the FPS inside the Music Room
when rendering with OpenGL? That said, the primary surface renderer
seems to yield the most realistic numbers, as we still stay entirely within
DirectDraw and perform no API wrapping. Using this renderer, I get speedups
~3.5× in the Music Room,
~1.9× during in-game dialog, and
~1.5× in the main menu.
Not bad for something I had to do anyway to port the game away from
DirectDraw! Shuusou Gyoku is rather infamous among the vintage computer
scene for being ridiculously unoptimized, so I should definitely be able to
get some performance gains out of the in-game portion as well.
For a final test of all the new blitting code, I also tried running
outside DxWnd to verify everything against real and unpatched
DirectDraw. Amusingly, this revealed how blitting from the new text surface
seems to reach the color mapping limits of the DWM mitigation in 8-bit mode:
8-bit mode does render correctly when I ran the same build in a Windows 98
VirtualBox on the same system though, so it's not worth looking into a mode
that the system reports as unsupported to begin with. Let's leave this as
somewhat of a visual reminder for players to select 32-bit mode instead.
Alright, enough about the annoying parts of GDI and DirectDraw for now.
Let's stop looking back and start looking forward, to a time within this
Seihou revolution when we're going to have lots of new options in the main
menu. Due to the nature of delivering individual pushes, we can expect lots
of revisions to the config file format. Therefore, we'd like to have a
backward-compatible system that allows players to upgrade from any older
build, including the original 秋霜玉.exe, to a newer one. The
original game predominantly used single-byte values for all its options, but
we'd like our system to work with variables of any size, including strings
to store things like the
name of the selected MIDI device in a more robust way. Also, it's pure
evil to reset the entire configuration just because someone tried to
hex-edit the config file and didn't keep the checksum in mind.
It didn't take long for me to arrive at a common
Size()/Read()/Write() interface. By
using the same interface for both arrays and individual values, new config
file versions can naturally expand older ones by taking the array of option
references from the previous version and wrapping it into a new array,
together with the new options.
The classic way of implementing this in C++ involves a typical
object-oriented class hierarchy: An Option base class would
define the interface in the form of virtual abstract functions, and the
Value, Array, and ConfigVersion
subclasses would provide different implementations. This works, but
introduces quite a bit of boilerplate, not to mention the runtime bloat from
all the virtual functions which Visual C++ can't inline. Why should we do
any runtime dispatch here? We know the set of configuration options
at compile time, after all…
Let's try looking into the modern C++ toolbox and see if we can do better.
The only real challenge here is that the array type has to support
arbitrarily sized option value types, which sounds like a job for
template parameter packs. If we save these into a
std::tuple, we can then "iterate" over all options with std::apply
expressions, in a nice functional style.
I was amazed by just how clearly the "crazy" modern C++ approach with
template parameter packs, std::apply() over giant
std::tuples, and fold expressions beats a classic polymorphic
hierarchy of abstract virtual functions. With the interface moved into an
even optional concept, the class hierarchy can be completely
flattened, which surprisingly also makes the code easier to both read and
Here's how the new system works from the player's point of view:
The config files now use a kanji-less and explicitly forward-compatible
naming scheme, starting with SSG_V00.CFG in the P0251 build.
The format of this initial version simply includes all values from the
original 秋霜CFG.DAT without padding bytes or a checksum. Once
we release a new build that adds new config options, we go up to
SSG_V01.CFG, and so on.
When loading, the game starts at its newest supported config file
version. If that file doesn't exist, the game retries with each older
version in succession until it reaches the last file in the chain, which is
always the original 秋霜CFG.DAT. This makes it possible to
upgrade from any older Shuusou Gyoku build to a newer one while retaining
all your settings – including, most importantly, which shot types you
unlocked the Extra Stage with. The newly introduced settings will simply
remain at their initial default in this case.
When saving, the game always writes all versions it knows about,
down to and including the original 秋霜CFG.DAT, in the
respective version-specific format. This means that you can change options
in a newer build and they'll show up changed in older builds as well if they
were supported there.
And yes, this also means that we can stop writing the unsupported 32-bit bit
depth setting to 秋霜CFG.DAT, which would cause a validation
failure on the original build. This is now avoided by simply turning 32-bit
into 16-bit just for the configuration that gets saved to this file. And
speaking of validation failures…
This build also contains two more fixes that didn't fit into the big
DirectDraw or configuration categories:
The P0226 build had a bug that allowed invalid stages to be selected for
replay recording. If the ReplaySave option was
[O F F], pressing the ⬅️ left arrow key on the
option would overflow its value to 255. The effects of this weren't all too
serious: The game would simply stay on the Weapon Select screen for an
invalid stage number, or launch into the Extra Stage if you scrolled all the
way to 131. Still, it's fixed in this build.
The render time for the in-game music title is now roughly cut in half:
These 6 pushes still left several of Shuusou Gyoku's DirectDraw portability
issues unsolved, but I'd better look at them once I've set up a basic OpenGL
skeleton to avoid any more premature abstraction. Since the ultimate goal is
a Linux port, I might as well already start looking at the current best
platform layer libraries. SDL would be the standard choice here, and while
SDL_ttf looks regrettably misdesigned, the core SDL library seems to cover
all we could possibly want for Shuusou Gyoku, including a 2D renderer… wait,
Yup. Admittedly, I've been living under a rock as far as SDL is concerned,
and thus wasn't aware that SDL 2 introduced its own abstraction for 2D
rendering that just happens to almost exactly cover everything we need
for Shuusou Gyoku. This API even covers all of the game's Direct3D code,
which only draws alpha-blended, untextured, and pre-transformed
vertex-colored triangles and lines. It's the exact abstraction over OpenGL I
thought I had to write myself, and such a perfect match for this game that
it would be foolish to go for a custom OpenGL backend – especially since SDL
will automatically target the ideal graphics API for any given operating
Sadly, the one thing SDL_Renderer is missing is something equivalent to
pixel shaders, which we would need to replicate the 西方
Ｐｒｏｊｅｃｔ lens ball effect shown at startup. Looks like we have
to drop into a
completely separate, unaccelerated rendering mode and continue to
software-render this one effect before switching to hardware-accelerated
rendering for the rest of the game. But at least we can do that in a
cross-platform way, and don't have to bother with shading languages –
or, perhaps even worse, SDL's own shading
If we were extremely pedantic, we'd also have to do the same for the
📝 unused spiral effect that was originally intended for the staff roll.
Software rendering would be even more annoying there, since we don't
just have to software-render these staff sprites, but also the ending
picture and text, complete with their respective fade effects. And while I
typically do go the extra mile to preserve whatever code was present in
these games, keeping this effect would just needlessly drive up the
cost of the SDL backend. Let's just move this one to the museum of unused
code and no longer actively compile it. RIP spiral 🥲 At least you're
still preserved in lossless video form.
Now that SDL has become an integral part of Shuusou Gyoku's portability plan
rather than just being one potential platform layer among many, the optimal
order of tasks has slightly changed. If we stayed within the raw Win32 API
any longer than absolutely necessary, we'd only risk writing more
Win32-native code for things like audio streaming that we'd
then have to throw away and rewrite in SDL later. Next up, therefore:
Staying with Shuusou Gyoku, but continuing in a much more focused manner by
fixing the input system and starting the SDL migration with input and sound.
Well, well. My original plan was to ship the first step of Shuusou Gyoku
OpenGL support on the next day after this delivery. But unfortunately, the
complications just kept piling up, to a point where the required solutions
definitely blow the current budget for that goal. I'm currently sitting on
over 70 commits that would take at least 5 pushes to deliver as a meaningful
release, and all of that is just rearchitecting work, preparing the
game for a not too Windows-specific OpenGL backend in the first place. I
haven't even written a single line of OpenGL yet… 🥲
This shifts the intended Big Release Month™ to June after all. Now I know
that the next round of Shuusou Gyoku features should better start with the
SC-88Pro recordings, which are much more likely to get done within their
current budget. At least I've already completed the configuration versioning
system required for that goal, which leaves only the actual audio part.
So, TH04 position independence. Thanks to a bit of funding for stage
dialogue RE, non-ASCII translations will soon become viable, which finally
presents a reason to push TH04 to 100% position independence after
📝 TH05 had been there for almost 3 years. I
haven't heard back from Touhou Patch Center about how much they want to be
involved in funding this goal, if at all, but maybe other backers are
interested as well.
And sure, it would be entirely possible to implement non-ASCII translations
in a way that retains the layout of the original binaries and can be easily
compared at a binary level, in case we consider translations to be a
critical piece of infrastructure. This wouldn't even just be an exercise in
needless perfectionism, and we only have to look to Shuusou Gyoku to realize
why: Players expected
that my builds were compatible with existing SpoilerAL SSG files, which
was something I hadn't even considered the need for. I mean, the game is
open-source 📝 and I made it easy to build.
You can just fork the code, implement all the practice features you want in
a much more efficient way, and I'd probably even merge your code into my
But I get it – recompiling the game yields just yet another build that can't
be easily compared to the original release. A cheat table is much more
trustworthy in giving players the confidence that they're still practicing
the same original game. And given the current priorities of my backers,
it'll still take a while for me to implement proof by replay validation,
which will ultimately free every part of the community from depending on the
original builds of both Seihou and PC-98 Touhou.
However, such an implementation within the original binary layout would
significantly drive up the budget of non-ASCII translations, and I sure
don't want to constantly maintain this layout during development. So, let's
chase TH04 position independence like it's 2020, and quickly cover a larger
amount of PI-relevant structures and functions at a shallow level. The only
parts I decompiled for now contain calculations whose intent can't be
clearly communicated in ASM. Hitbox visualizations or other more in-depth
research would have to wait until I get to the proper decompilation of these
But even this shallow work left us with a large amount of TH04-exclusive
code that had its worst parts RE'd and could be decompiled fairly quickly.
If you want to see big TH04 finalization% gains, general TH04 progress would
be a very good investment.
The first push went to the often-mentioned stage-specific custom entities
that share a single statically allocated buffer. Back in 2020, I
📝 wrongly claimed that these were a TH05 innovation,
but the system actually originated in TH04. Both games use a 26-byte
structure, but TH04 only allocates a 32-element array rather than TH05's
64-element one. The conclusions from back then still apply, but I also kept
wondering why these games used a static array for these entities to begin
with. You know what they call an area of memory that you can cleanly
repurpose for things? That's right, a heap!
And absolutely no one would mind one additional heap allocation at the start
of a stage, next to the ones for all the sprites and portraits.
However, we are still running in Real Mode with segmented memory. Accessing
anything outside a common data segment involves modifying segment registers,
which has a nonzero CPU cycle cost, and Turbo C++ 4.0J is terrible at
optimizing away the respective instructions. Does this matter? Probably not,
but you don't take "risks" like these if you're in a permanent
The 4 📝 bits used in Marisa's Stage 4 boss
fight. Coincidentally also related to the rare Divide Error
crash in that fight.
Stage 4 Reimu's spinning orbs. Note how the game uses two different sets
of sprites just to have two different outline colors. This was probably
better than messing with the palette, which can easily cause unintended
effects if you only have 16 colors to work with. Heck, I have an entire blog post tag just to highlight
these cases. Capped to the full 32 entities.
The chasing cross bullets, seen in Phase 14 of the same Stage 6 Yuuka
fight. Featuring some smart sprite work, making use of point symmetry to
achieve a fluid animation in just 4 frames. This is
good-code in sprite form. Capped to 31 entities, because
the 32nd custom entity during this fight is defined to be…
The single purple pulsating and shrinking safety circle, seen in Phase 4 of
the same fight. The most interesting aspect here is actually still related
to the cross bullets, whose spawn function is wrongly limited to 32 entities
and could theoretically overwrite this circle. This
is strictly landmine territory though:
Yuuka never uses these bullets and the safety circle
She never spawns more than 24 cross bullets
All cross bullets are fast enough to have left the screen by the
time Yuuka restarts the corresponding subpattern
The cross bullets spawn at Yuuka's center position, and assign its
Q12.4 coordinates to structure fields that the safety circle interprets
as raw pixels. The game does try to render the circle afterward, but
since Yuuka's static position during this phase is nowhere near a valid
pixel coordinate, it is immediately clipped.
The flashing lines seen in Phase 5 of the Gengetsu fight,
telegraphing the slightly random bullet columns.
These structures only took 1 push to reverse-engineer rather than the 2 I
needed for their TH05 counterparts because they are much simpler in this
game. The "structure" for Gengetsu's lines literally uses just a single X
position, with the remaining 24 bytes being basically padding. The only
minor bug I found on this shallow level concerns Marisa's bits, which are
clipped at the right and bottom edges of the playfield 16 pixels earlier
than you would expect:
The remaining push went to a bunch of smaller structures and functions:
The structure for the up to 2 "thick" (a.k.a. "Master Spark") lasers. Much
saner than the
📝 madness of TH05's laser system while being
equally customizable in width and duration.
The structure for the various monochrome 16×16 shapes in the background of
the Stage 6 Yuuka fight, drawn on top of the checkerboard.
The rendering code for the three falling stars in the background of Stage 5.
The effect here is entirely palette-related: After blitting the stage tiles,
the 📝 1bpp star image is ORed
into only the 4th VRAM plane, which is equivalent to setting the
highest bit in the palette color index of every pixel within the star-shaped
region. This of course raises the question of how the stage would look like
if it was fully illuminated:
Most code that modifies a stage's tile map, and directly specifies tiles via
their top-left offset in VRAM.
Thanks to code alignment reasons, this forced a much longer detour into the
.STD format loader. Nothing all too noteworthy there since we're still
missing the enemy script and spawn structures before we can call .STD
"reverse-engineered", but maybe still helpful if you're looking for an
overview of the format. Also features a buffer overflow landmine if a .STD
file happens to contain more than 32 enemy scripts… you know, the usual
To top off the second push, we've got the vertically scrolling checkerboard
background during the Stage 6 Yuuka fight, made up of 32×32 squares. This
one deserves a special highlight just because of its needless complexity.
You'd think that even a performant implementation would be pretty simple:
Set the GRCG to TDW mode
Set the GRCG tile to one of the two square colors
Start with Y as the current scroll offset, and X
as some indicator of which color is currently shown at the start of each row
Iterate over all lines of the playfield, filling in all pixels that
should be displayed in the current color, skipping over the other ones
Count down Y for each line drawn
If Y reaches 0, reset it to 32 and flip X
At the bottom of the playfield, change the GRCG tile to the other color,
and repeat with the initial value of X flipped
The most important aspect of this algorithm is how it reduces GRCG state
changes to a minimum, avoiding the costly port I/O that we've identified
time and time again as one of the main bottlenecks in TH01. With just 2
state variables and 3 loops, the resulting code isn't that complex either. A
naive implementation that just drew the squares from top to bottom in a
single pass would barely be simpler, but much slower: By changing the GRCG
tile on every color, such an implementation would burn a low 5-digit number
of CPU cycles per frame for the 12×11.5-square checkerboard used in the
And indeed, ZUN retained all important aspects of this algorithm… but still
implemented it all in ASM, with a ridiculous layer of x86 segment arithmetic
on top? Which blows up the complexity to 4 state
variables, 5 nested loops, and a bunch of constants in unusual units. I'm
not sure what this code is supposed to optimize for, especially with that
rather questionable register allocation that nevertheless leaves one of the
general-purpose registers unused. Fortunately,
the function was still decompilable without too many code generation hacks,
and retains the 5 nested loops in all their goto-connected
glory. If you want to add a checkerboard to your next PC-98
demo, just stick to the algorithm I gave above.
(Using a single XOR for flipping the starting X offset between 32 and 64
pixels is pretty nice though, I have to give him that.)
This makes for a good occasion to talk about the third and final GRCG mode,
completing the series I started with my previous coverage of the
📝 RMW and
📝 TCR modes. The TDW (Tile Data Write) mode
is the simplest of the three and just writes the 8×1 GRCG tile into VRAM
as-is, without applying any alpha bitmask. This makes it perfect for
clearing rectangular areas of pixels – or even all of VRAM by doing a single
// Set up the GRCG in TDW mode.
// Fill the tile register with color #7 (0111 in binary).
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0xFF); // Plane 2: (G): (********)
outportb(0x7E, 0x00); // Plane 3: (E): ( )
// Set the 32 pixels at the top-left corner of VRAM to the exact contents of
// the tile register, effectively repeating the tile 4 times. In TDW mode, the
// GRCG ignores the CPU-supplied operand, so we might as well just pass the
// contents of a register with the intended width. This eliminates useless load
// instructions in the compiled assembly, and even sort of signals to readers
// of this code that we do not care about the source value.
*reinterpret_cast<uint32_t far *>(MK_FP(0xA800, 0)) = _EAX;
// Fill the entirety of VRAM with the GRCG tile. A simple C one-liner that will
// probably compile into a single `REP STOS` instruction. Unfortunately, Turbo
// C++ 4.0J only ever generates the 16-bit `REP STOSW` here, even when using
// the `__memset__` intrinsic and when compiling in 386 mode. When targeting
// that CPU and above, you'd ideally want `REP STOSD` for twice the speed.
memset(MK_FP(0xA800, 0), _AL, ((640 / 8) * 400));
However, this might make you wonder why TDW mode is even necessary. If it's
functionally equivalent to RMW mode with a CPU-supplied bitmask made up
entirely of 1 bits (i.e., 0xFF, 0xFFFF, or
0xFFFFFFFF), what's the point? The difference lies in the
hardware implementation: If all you need to do is write tile data to
VRAM, you don't need the read and modify parts of RMW mode
which require additional processing time. The PC-9801 Programmers'
Bible claims a speedup of almost 2× when using TDW mode over equivalent
operations in RMW mode.
And that's the only performance claim I found, because none of these old
PC-98 hardware and programming books did any benchmarks. Then again, it's
not too interesting of a question to benchmark either, as the byte-aligned
nature of TDW blitting severely limits its use in a game engine anyway.
Sure, maybe it makes sense to temporarily switch from RMW to TDW mode
if you've identified a large rectangular and byte-aligned section within a
sprite that could be blitted without a bitmask? But the necessary
identification work likely nullifies the performance gained from TDW mode,
I'd say. In any case, that's pretty deep
micro-optimization territory. Just use TDW mode for the
few cases it's good at, and stick to RMW mode for the rest.
So is this all that can be said about the GRCG? Not quite, because there are
4 bits I haven't talked about yet…
And now we're just 5.37% away from 100% position independence for TH04! From
this point, another 2 pushes should be enough to reach this goal. It might
not look like we're that close based on the current estimate, but a
big chunk of the remaining numbers are false positives from the player shot
control functions. Since we've got a very special deadline to hit, I'm going
to cobble these two pushes together from the two current general
subscriptions and the rest of the backlog. But you can, of course, still
invest in this goal to allow the existing contributions to go to something
… Well, if the store was actually open. So I'd better
continue with a quick task to free up some capacity sooner rather than
later. Next up, therefore: Back to TH02, and its item and player systems.
Shouldn't take that long, I'm not expecting any surprises there. (Yeah, I
know, famous last words…)
And so, the year unfortunately ended with yet another slow month. During the
MediaWiki upgrade, I was slowly decompiling the TH05 Sara fight on the side,
but stumbled over one interesting but high-maintenance detail there that
would really enhance her blog post. TH02 would need a lot of attention for
the basic rendering calls as well…
…so let's end the year with Shuusou Gyoku instead, looking at its most
critical issue in particular. As if that were the easy option here…
The game does not run properly on modern Windows systems due to its usage of
the ancient DirectDraw APIs, with issues ranging from unbearable slowdown to
glitched colors to the game not even starting at all. Thankfully, Shuusou
Gyoku is not the only ancient Windows game affected by these issues, and
people have developed a variety of generic DirectDraw wrappers and patches
for playing such games on modern systems. Out of all these, DDrawCompat is one of the
simpler solutions for Shuusou Gyoku in particular: Just drop its
ddraw proxy DLL into the game directory, and the game will run
as it's supposed to.
So let's just bundle that DLL with all my future Shuusou Gyoku releases
then? That would have been the quick and dirty option, coming with
Linux users might be annoyed by the potential need to configure a native
DLL override for ddraw.dll. It's not too much of an issue as we
could simply rename the DLL and replace the import with the new name.
However, doing that reproducibly would already involve changes to either the
DDrawCompat or Shuusou Gyoku build process.
Win32 API hooking is another potential point of failure in general,
requiring continual maintenance for new Windows versions. This is not even a
hypothetical concern: DDrawCompat does rely on particularly volatile Win32
API details, to the point that the recent Windows 11 22H2 update completely
broke it, causing a hang at startup that required a workaround.
But sure, it's still just a single third-party component. Keeping it up to
date doesn't sound too bad by itself…
…if DDrawCompat weren't evolving way beyond what we need to keep Shuusou
Gyoku running. Being a typical DirectDraw wrapper, it has always aimed to
solve all sorts of issues in old DirectDraw games. However, the latest
version, 0.4.0, has gone above and beyond in this regard, adding lots of
configuration options with default settings that actually
break Shuusou Gyoku.
To get a glimpse of how this is likely to play out, we only have to look at
the more mature DxWnd
project. In its expert mode, DxWnd features three rows of tabs, each packed
with checkboxes that toggle individual hacks, and most of these are
related to something that Shuusou Gyoku could be affected by. Imagine
checking a precise permutation of a three-digit number of checkboxes just to
keep an old game running at full speed on modern systems…
Finally, aesthetic and bloat considerations. If
📝 C++ fstreams were already too embarrassing
with the ~100 KB of bloat they add to the binary, a 565 KiB DLL is
even worse. And that's the old version 0.3.2 – version 0.4.0 comes in
at 2.43 MiB.
Fortunately, I had the budget to dig a bit deeper and figure out what
exactly DDrawCompat does to make Shuusou Gyoku work properly. Turns
out that among all the hooks and patches, the game only needs the most
central one: Enforcing a 32-bit display mode regardless of whatever lower
bit depth the game requests natively, combined with converting the game's
pixel buffer to 32-bit on the fly.
So does this mean that adding 32-bit to the game's list of supported bit
depths is everything we have to do?
Well, almost everything. Initially, this surprised me as well: With
all the if statements checking for precise bit depths, you
would think that supporting one more bit depth would be way harder in this
code base. As it turned out though, these conditional branches are not
really about 8-bit or 16-bit color for the most part, but instead
differentiate between two very distinct rendering approaches:
"8-bit" is a pure 2D mode with palettized colors,
while "16-bit" is a hybrid 2D/3D mode that uses Direct3D 2 on top of DirectDraw, with
3-channel RGB colors.
Consequently, most of these branches deal with differences between these two
approaches that couldn't be nicely abstracted away in pbg's renderer
interface: Specific palette changes that are exclusive to "8-bit" mode, or
certain entities and effects whose Direct3D draw calls in "16-bit" mode
require tailor-made approximations for the "8-bit" mode. Since our new
32-bit mode is equivalent to the 16-bit mode in all of these branches, I
only needed to replace the raw number comparisons with more meaningful
That only left a very small number of 2D raster effects that directly write
to or read from DirectDraw surface memory, and therefore do need to know the
bit size of each pixel. Thanks to std::variant and
std::visit(), adding 32-bit support becomes trivial here: By
rewriting the code in a generic manner that derives all offsets from the
template type, you only have to say hey,
I'd like to have 32-bit as well, and C++ will automatically
instantiate correct 32-bit variants of all bit depth-dependent code
There are only three features in the entire game that access pixel buffers
this way: a color key retrieval function, the lens ball animation on the
logo screen, and… the ending staff roll? Sure, the text sprites fade in and
out, but so does the picture next to it, using Direct3D alpha blending or
palette color ramping depending on the current rendering mode. Instead, the
only reason why these sprites directly access their pixel buffer is… an
unused and pretty wild spiral effect. 😮 It's still part of the code, and
only doesn't show up because the
parameters that control its timing were commented out before release:
Alright, 32-bit mode complete, let's set it as the default if possible… and
break compatibility to the original 秋霜CFG.DAT format in the
process? When validating this file, the original game only allows the
originally supported 8-bit or 16-bit modes. Setting the
BitDepth field to any other value causes the entire file
to be reset to its defaults, re-locking the Extra Stage in the process.
Introducing a backward-compatible version
system for 秋霜CFG.DAT was beyond the scope of this push.
Changing the validation to a per-field approach was a good small first step
to take though. The new build no longer validates the BitDepth
field against a fixed list, but against the actually supported bit depths on
your system, picking a different supported one if necessary. With the
original approach, this would have caused your entire configuration to fail
the validation check. Instead, you can now safely update to the new build
without losing your option settings, or your previously unlocked access to
the Extra Stage.
Side note: The validation limit for starting bombs is off by one, and the
one for starting lives check is off by two. By modifying
秋霜CFG.DAT, you could theoretically get new games to start with
7 lives and 3 bombs… if you then calculate a correct checksum for your
hacked config file, that is. 🧑💻
Interestingly, DirectDraw doesn't even indicate support for 8-bit or 16-bit
color on systems that are affected by the initially mentioned issues.
Therefore, these issues are not the fault of DirectDraw, but of
Shuusou Gyoku, as the original release requested a bit depth that it has
even verified to be unsupported. Unfortunately, Windows sides with
Sim City Shuusou Gyoku here: If you previously experimented with the
Windows app compatibility settings, you might have ended up with the
DWM8And16BitMitigation flag assigned to the full file path of
your Shuusou Gyoku executable in either
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Layers, or
As the term mitigation suggests, these modes are (poorly) emulated,
which is exactly what causes the issues with this game in the first place.
Sure, this might be the lesser evil from the point of view of an operating
system: If you don't have the budget for a full-blown DDrawCompat-style
DirectDraw wrapper, you might consider it better for users to have the game
run poorly than have it fail at startup due to incorrect API usage.
Controlling this with a flag that sticks around for future runs of a binary
is definitely suboptimal though, especially given how hard it
is to programmatically remove this flag within the binary itself. It
only adds additional complexity to the ideal clean upgrade path.
So, make sure to check your registry and manually remove these flags for the
time being. Without them, the new Config → Graphic menu will
correctly prevent you from selecting anything else but 32-bit on modern
After all that, there was just enough time left in this push to implement
basic locale independence, as requested by the Seihou development
Discord group, without looking into automatic fixes for previous mojibake
filenames yet. Combining std::filesystem::path with the native
Win32 API should be straightforward and bloat-free, especially with all the
abstractions I've been building, right?
Well, turns out that std::filesystem::path does not
actually meet my expectations. At least as long as it's not
constexpr-enabled, because you still get the unfortunate
conversion from narrow to wide encoding at runtime, even for globals with
static storage duration. That brings us back to writing our path abstraction
in terms of the regular std::string and
std::wstring containers, which at least allow us to enforce the
respective encoding at compile time. Even std::string_view only
adds to the complexity here, as its strings are never inherently
null-terminated, which is required by both the POSIX and Win32 APIs. Not to
mention dynamic filenames: C++20's std::format() would be the
obvious idiomatic choice here, but using it almost doubles the size
of the compiled binary… 🤮
In the end, the most bloat-free way of implementing C++ file I/O in 2023 is
still the same as it was 30 years ago: Call system APIs, roll a custom
abstraction that conditionally uses the L prefix, and pass
around raw pointers. And if you need a dynamic filename, just write the
dynamic characters into arrays at fixed positions. Just as PC-98 Touhou used
Oh, and the game's window also uses a Unicode title bar now.
And that's it for this push! Make sure to rename your configuration
(秋霜CFG.DAT), score (秋霜SC.DAT), and replay
(秋霜りぷ*.DAT) filenames if you were previously running the
game on a non-Japanese locale, and then grab the new build:
Next up: Starting the new year with all my plans hopefully working out for
once. TH05 Sara very soon, ZMBV code review afterward, low-hanging fruit of
the TH01 Anniversary Edition after that, and then kicking off TH02 with a
bunch of low-level blitting code.
Thanks to handlerug for
implementing and PR'ing the feature in a very clean way. That makes at least
two people I know who wanted to see feed support, so there are probably
a few more out there.
So, Shuusou Gyoku. pbg released the original source code for the first two
Seihou games back in February 2019, but notably removed the crucial
decompression code for the original packfiles due to… various unspecified
reasons, considerations, and implications. This vague
language and subsequent rejection of a pull request
to add these features back in were probably the main reasons why no one
has publicly done anything with this codebase since.
The only other fork I know about is Priw8's private fork from 2020, but only
informed me about it shortly after this push was funded. Both of them
might also contribute some features to my fork in the future if their time
In this fork, Priw8 replaced packfile decompression with raw reads from
directories with the pre-extracted contents of all the .DAT files. This
works for playing the game, but there are actually two more things that
require the original packfile code:
High scores are stored as a bitstream with every variable separated by
an alternating 0 or 1 bit, using the same bit-level access functions as the
packfile reader. That's a quite… unique form of obfuscation: It requires way
too much code to read and write the format, and doesn't even obfuscate the
data that well because you can still see clear patterns when opening
these scorefiles in a hex editor.
Replays are 2-"file" archives compressed using the same algorithm as the
packfile. The first "file" contains metadata like the shot type, stage, and
RNG seed, and the second one contains the input state for every frame.
We can surely implement our own simple and uncompressed formats for these
things, but it's not the best idea to build all future Shuusou Gyoku
features on top of a replay-incompatible fork. So, what do we do? On the one
hand, pbg expressed the clear wish to not include data reverse-engineered
from the original binary. On the other hand, he released the code under the
MIT license, which allows us to modify the code and distribute the results
in any way we wish.
So, let's meet in the middle, and go for a clean-room implementation of the
missing features as indicated by their usage, without looking at either the
original binary or wangqr's reverse-engineered code.
With incremental rebuilds being broken in the latest Visual Studio project
files as well, it made sense to start from scratch on pbg's last commit. Of
course, I can't pass up a chance to use
📝 Tup, my favorite build system for every
project I'm the main developer of. It might not fit Shuusou Gyoku as well as
it fits ReC98, but let's see whether it would be reasonable at all…
… and it's actually not too bad! Modern Visual Studio makes this a bit
harder than it should be with all the intermediate build artifacts you have
to keep track of. In the end though, it's still only 70
lines of Lua to have a nice abstraction for both Debug and Release
builds. With this layer underneath, the actual
Shuusou Gyoku-specific part can be expressed as succinctly as in any
other modern build system, while still making every compiler flag explicit.
It might be slightly slower than a traditional .vcxproj build
due to launching
one cl.exe process per translation unit, but the result is
way more reliable and trustworthy compared to anything that involves Visual
Studio project files. This simplicity paves the way for expanding the build
process to multiple steps, and doing all the static checking on translation
strings that I never got to do for thcrap-based patches. Heck, I might even
compile all future translations directly into the binary…
Every C++ build system will invariably be hated by someone, so I'd
say that your goal should always be to simplify the actually important parts
of your build enough to allow everyone else to easily adapt it to their
favorite system. This Tupfile definitely does a better job there than your
average .vcxproj file – but if you still want such a thing (or,
gasp, 🤮 CMake project files 🤮) for better Visual Studio IDE
integration, you should have no problem generating them for yourself.
There might still be a point in doing that because that's the one part that
unfortunately sucks about this approach. Visual Studio is horribly broken
for any nonstandard C++ project even in 2022:
Makefile projects can be nicely integrated with Debug and Release
configurations, but setting a later C++ language standard requires dumb
.vcxproj hacks that don't even work properly anymore.
Folder projects are ridiculously ugly: The Build toolbar is permanently
grayed out even if you configured a build task. For some reason,
configuring these tasks merely adds one additional element to a 9-element
context menu in the Solution Explorer. Also, why does the big IDE use a
different JSON schema than the perfectly functional and adequate one from
Visual Studio Code?
In both cases, IntelliSense doesn't work properly at all even if it
appears to be configured correctly, and Tup's dependency tracking appeared
to be weirdly cut off for the very final .PDB file. Interestingly though,
using the big Visual Studio IDE for just debugging a binary via
devenv bin/GIAN07.exe suddenly eliminates all the IntelliSense
issues. Looks like there's a lot of essential information stored in the .PDB
files that Visual Studio just refuses to read in any other context.
But now compare that to Visual Studio Code: Open it from the x64_x86
Cross Tools Command Prompt via code ., launch a build or
debug task, or browse the code with perfect IntelliSense. Three small
configuration files and everything just works – heck, you even get the Tup
progress bar in the terminal. It might be Electron bloatware and horribly
slow at times, but Visual Studio Code has long outperformed regular Visual
Studio in terms of non-debug functionality.
On to the compression algorithm then… and it's just textbook LZSS,
with 13 bits for the offset of a back-reference and 4 bits for its length?
Hardly a trade secret there. The hard parts there all come from unexpected
inefficiencies in the bitstream format:
Encoding back-references as offsets into an 8 KiB ring buffer dictionary
means that the most straightforward implementation actually needs an 8 KiB
array for the LZSS sliding window. This could have easily been done with
zero additional memory if the offset was encoded as the difference to the
current byte instead.
The packfile format stores the uncompressed size of every file in its
header, which is a good thing because you want to know in advance how much
heap memory to allocate for a specific file. Nevertheless, the original game
only stops reading bits from the packfile once it encountered a
back-reference with an offset of 0. This means that the compressor not only
has to write this technically unneeded back-reference to the end of the
compressed bitstream, but also ignore any potential other longest
back-reference with an offset of 0 within the file. The latter can
easily happen with a ring buffer dictionary.
The original game used a single BIT_DEVICE class with mode
flags for every combination of reading and writing memory buffers and
on-disk files. Since that would have necessitated a lot of error checking
for all (pseudo-)methods of this class, I wrote one dedicated small class
for each one of these permutations instead. To further emphasize the
clean-room property of this code, these use modern C++ memory ownership
features: std::unique_ptr for the fixed-size read-only buffers
we get from packfiles, std::vector for the newly compressed
buffers where we don't know the size in advance, and std::span
for a borrowed reference to an immutable region of memory that we want to
treat as a bitstream. Definitely better than using the native Win32
LocalAlloc() and LocalFree() allocator, especially
if we want to port the game away from Windows one day.
One feature I didn't use though: C++ fstreams, because those are trash.
These days, they would seem to be the natural
choice with the new std::filesystem::path type from C++17:
Correctly constructed, you can pass that type to an fstream constructor and
gain both locale independence on Windows and portability to
everything else, without writing any Windows-specific UTF-16 code. But even
in a Release build, fstreams add ~100 KB of locale-related bloat to the .EXE
which adds no value for just reading binary files. That's just too
embarrassing if you look at how much space the rest of the game takes up.
Writing your own platform layer that calls the Win32
CreateFileW(), ReadFile(), and
WriteFile() API functions is apparently still the way to go
even in 2022. And with std::filesystem::path still being a
welcome addition to C++, it's not too much code to write either.
This gets us file format compatibility with the original release… and a
crash as soon as the ending starts, but only in Release mode? As it turns
out, this crash is caused by an
access bug that was present even in the original game, and only turned
into a crash now because the optimizer in modern Visual Studio versions
reorders static data. As a result, the 6-element pFontInfo
array got placed in front of an ECL-related counter variable that then got
corrupted by the write to the 7th element, which subsequently
crashed the game with a read access to previously deallocated danmaku script
data. That just goes to show that these technical bugs are important
and worth fixing even if they don't cause issues in the original game. Who
knows how many of these will turn into crashes once we get to porting PC-98
So here we go, a new build of Shuusou Gyoku, compiled with Visual Studio
2022, and compatible with all original data formats:
Inside the regular Shuusou Gyoku installation directory, this binary works
as a full-fledged drop-in replacement for the original
秋霜玉.exe. It still has all of the original binary's problems
Separate Japanese locale emulation is still needed to correctly refer to
the original names of the configuration (秋霜CFG.DAT), score
(秋霜SC.DAT), and replay (秋霜りぷ*.DAT) files.
It's also required for the ending text to not render as mojibake.
Running the game at full speed and without graphical glitches on modern
Windows still requires a separate DirectDraw patch such as DDrawCompat. To
eliminate any remaining flickering, configure the game to use 16-bit
graphics in the Config → Graphic menu.
As well as some of its own:
The original screenshot feature is still missing, as it also wasn't part
of pbg's released source code.
So all in all, it's a strict downgrade at this point in time.
And more of a symbol that we can now start
doing actual work on this game. Seihou has been a fun change of pace, and I
hope that I get to do more work on the series. There is quite a lot to be
done with Shuusou Gyoku alone, and the 21 GitHub issues I've opened
are probably only scratching the surface.
However, all the required research for this one consumed more like 1⅔
pushes. Despite just one push being funded, it wouldn't have made sense to
release the commits or this binary in any earlier state. To repay this debt,
I'm going to put the next for Seihou towards the
small code maintenance and performance tasks that I usually do for free,
before doing any more feature and bugfix work. Next up: Improving video
playback on the blog, and maybe delivering some microtransaction work on the