Blog

📝 Posted:
💰 Funded by:
Ember2528, [Anonymous]
🏷️ Tags:

And then, the Shuusou Gyoku renderer rewrite escalated to another 10-push monster that delayed the planned Seihou Summer™ straight into mid-fall. Guess that's just how things go these days at my current level of quality. Testing and polish made up half of the development time of this new build, which probably doesn't surprise anyone who has ever dealt with GPUs and drivers…

  1. Codebase cleanup and portability
  2. Rolling out C++ Standard Library Modules
  3. Pearls uncovered by static analysis (← start here if you don't particularly care about C++)
  4. Finishing the new graphics architecture (and discovering every remaining 8-bit/16-bit inconsistency)
  5. Starting the SDL port with the simple things
  6. Porting the lens ball effect
  7. Adding scaled window and borderless fullscreen modes
  8. Screenshots, and how they force us to rethink scaling
  9. Achieving subpixel accuracy
  10. Future work
  11. Recreating the Sound Canvas VA BGM packs (now with panning delay)

But first, let's finally deploy C++23 Standard Library Modules! I've been waiting for the promised compile-time improvements of modules for 4 years now, so I was bound to jump at the very first possible opportunity to use them in a project. Unfortunately, MSVC further complicates such a migration by adding one particularly annoying proprietary requirement:

… which means we're now faced with hundreds of little warnings and C++ Core Guideline violations from pbg's code. Sure, we could just disable all warnings when compiling pbg's source files and get on with rolling out modules, because they would still count as "statically analyzed" in this case. :tannedcirno: But that's silly. As development continues and we write more of our own modern code, more and more of it will invariably end up within pbg's files, merging and intertwining with original game code. Therefore, not analyzing these files is bound to leave more and more potential issues undetected. Heck, I've already committed a static initialization order fiasco by accident that only turned into an actual crash halfway through the development of these 10 pushes. Static analysis would have caught that issue.
So let's meet in the middle. Focus on a sensible subset of warnings that we would appreciate in our own code or that could reveal bugs or portability issues in pbg's code, but disable anything that would lead to giant and dangerous refactors or that won't apply to our own code. For example, it would sure be nice to rewrite certain instances of goto spaghetti into something more structured, but since we ourselves won't use goto, it's not worth worrying about within a porting project.

After deduplicating lots of code to reduce the sheer number of warnings, the single biggest remaining group of issues were the C-style casts littered throughout the code. These combine the unconstrained unsafety of C with the fact that most of them use the classic uppercase integer types from <windows.h>, adding a further portability aspect to this class of issues.
The perhaps biggest problem about them, however, is that casts are a unary operator with its own place in the precedence hierarchy. If you don't surround them with even more brackets to indicate the exact order of operations, you can confuse and mislead the hell out of anyone trying to read your code. This is how we end up with the single most devious piece of arithmetic I've found in this game so far:

BYTE d = (BYTE)(t->d+4)/8;	// 修正 8 ごとで 32 分割だからズラシは 4
t->d is a BYTE as well.

If you don't look at vintage C code all day, this cast looks redundant at first glance. Why would you separately cast the result of this expression to the type of the receiving variable? However, casting has higher precedence than division, so the code actually downcasts the dividend, (t->d+4), not the result of the division. And why would pbg do that? Because the regular, untyped 4 is implicitly an int, C promotes t->d to int as well, thus avoiding the intended 8-bit overflow. If t->d is 252, removing the cast would therefore result in ((int{ 252 } + int{ 4 }) / 8) = 256 / 8 = 32, not the 0 we wanted to have. And since this line is part of the sprite selection for VIVIT-captured-'s feather bullets, omitting the cast has a visible effect on the game:

A circle of VIVIT-captured-'s feather bullets as shown shortly after the beginning of her final form, but with one feather bullet turned into a "YZ" sprite due to a removed downcast of the dividend in the angle calculationA circle of VIVIT-captured-'s feather bullets as shown shortly after the beginning of her final form, rendered correctly
The first file in GRAPH.DAT explains what we're seeing here.

So let's add brackets and replace the C-style cast with a C++ static_cast to make this more readable:

const auto d = (static_cast<uint8_t>(t->d + 4) / 8);

But that only addresses the precedence pitfall and doesn't tell us why we need that cast in the first place. Can we be more explicit?

const auto d = (((t->d + 4) & 0xFF) / 8);

That might be better, but still assumes familiarity with integer promotion for that mask to not appear redundant. What's the strongest way we could scream integer promotion to anyone trying to touch this code?

const auto d = (Cast::down_sign<uint8_t>(t->d + 4) / 8);
Of course, I also added a lengthy comment above this line.

Now we're talking! Cast::down_sign() uses static_asserts to enforce that its argument must be both larger and differently signed than the target type inside the angle brackets. This unmistakably clarifies that we want to truncate a promoted integer addition because the code wouldn't even compile if the argument was already a uint8_t. As such, this new set of casts I came up with goes even further in terms of clarifying intent than the gsl::narrow_cast() proposed by the C++ Core Guidelines, which is purely informational.

OK, so replacing C-style casts is better for readability, but why care about it during a porting project? Wouldn't it be more efficient to just typedef the <windows.h> types for the Linux code and be done with it? Well, the ECL and SCL interpreters provide another good reason not to do that:

case(ECL_SETUP): // 敵の初期化
	e->hp    = *(DWORD *)(&cmd[1]);
	e->score = *(DWORD *)(&cmd[1+4]);

In these instances, the DWORD type communicates that this codebase originally targeted Windows, and implies that the cmd buffer stores these 32-bit values in little-endian format. Therefore, replacing DWORD with the seemingly more portable uint32_t would actually be worse as it no longer communicates the endianness assumption. Instead, let's make the endianness explicit:

case(ECL_SETUP): // 敵の初期化
+	e->hp    = U32LEAt(&cmd[1 + 0]);
+	e->score = U32LEAt(&cmd[1 + 4]);
No surprises once we port this game to a big-endian system – and much fewer characters than a pedantic reinterpret_cast, too.

With that and another pile of improvements for my Tup building blocks, we finally get to deploy import std; across the codebase, and improve our build times by…
…not exactly the mid-three-digit percentages I was hoping for. Previously, a full parallel compilation of the Debug build took roughly 23.9s on my 6-year-old 6-core Intel Core i5-8400T. With modules, we now need to compile the C++ standard library a single time on every from-scratch rebuild or after a compiler version update, which adds an unparallelizable ~5.8s to the build time. After that though, all C++ code compiles within ~12.4s, yielding a still decent 92% speedup for regular development. 🎉 Let's look more closely into these numbers and the resulting state of the codebase:

But in the end, the halved compile times during regular development are well worth sacrificing IntelliSense for the time being… especially given that I am the only one who has to live in this codebase. 🧠 And besides, modules bring their own set of productivity boosts to further offset this loss: We can now freely use modern C++ standard library features at a minuscule fraction of their usual compile time cost, and get to cut down the number of necessary #include directives. Once you've experienced the simplicity of import std;, headers and their associated micro-optimization of #include costs immediately feels archaic. Try the equally undocumented /d1reportTime flag to get an idea of the compile time impact of function definitions and template instantiations inside headers… I've definitely been moving quite a few of those to .cpp files within these 10 pushes.

However, it still felt like the earliest possible point in time where doing this was feasible at all. Without LSP support, modules still feel way too bleeding-edge for a feature that was added to the C++ standard 4 years ago. This is why I only chose to use them for covering the C++ standard library for now, as we have yet to see how well GCC or Clang handle it all for the Linux port. If we run into any issues, it makes sense to polyfill any workarounds as part of the Tup building blocks instead of bloating the code with all the standard library header inclusions I'm so glad to have gotten rid of.
Well, almost all of them, because we still have to #include <assert.h> and <stdlib.h> because modules can't expose preprocessor macros and C++23 has no macro-less alternative for assert() and offsetof(). 🤦 [[assume()]] exists, but it's the exact opposite of assert(). How disappointing.


As expected, static analysis also brought a small number of pbg code pearls into focus. This list would have fit better into the static analysis section, but I figured that my audience might not necessarily care about C++ all that much, so here it is:


Alright, on to graphics! With font rendering and surface management mostly taken care of last year, the main focus for this final stretch was on all the geometric shapes and color gradients. pbg placed a bunch of rather game-specific code in the platform layer directly next to the Direct3D API calls, including point generation for circles and even the colors of gradient rectangles, gradient polygons, and the Music Room's spectrum analyzer. We don't want to duplicate any of this as part of the new SDL graphics layer, so I moved it all into a new game-level geometry system. By placing both the 8-bit and 16-bit approaches next to each other, this new system also draws more attention to the different approaches used at each bit depth.
So far, so boring. Said differences themselves are rather interesting though, as this refactor uncovered all of the remaining inconsistencies between the two modes:

  1. In 8-bit mode, the game draws circles by writing pixels along the accurate outline into the framebuffer. The hardware-accelerated equivalent for the 16-bit mode would be a large unwieldy point list, so the game instead approximates circles by drawing straight lines along a regular 32-sided polygon:
    Screenshot of Shuusou Gyoku's circle drawing in 8-bit mode as shown during the Marisa boss fightScreenshot of Shuusou Gyoku's circle drawing in 16-bit mode as shown during the Marisa boss fight
    It's not like the APIs prevent the 16-bit mode from taking the same approach as the 8-bit mode, so I suppose that pbg profiled this and concluded that lines offloaded to the GPU performed better than locking the framebuffer and writing pixels? Then again, given Shuusou Gyoku's comparatively high system requirements…

    For preservation and consistency reasons, the SDL backend will also go with the approximation, although we could provide the accurate rendering of 8-bit mode via point lists if there's interest.

  2. There's an off-by-one error in the playfield clipping region for Direct3D-rendered shapes, which ends at (511, 479) instead of (512, 480):
    Screenshot of Shuusou Gyoku's circle drawing in 8-bit mode as shown during the Marisa boss fightScreenshot of Shuusou Gyoku's circle drawing in 16-bit mode as shown during the Marisa boss fight
    The fix is obvious.
  3. There's an off-by-one error in the 8-bit rendering code for opaque rectangles that causes them to appear 1 pixel wider than in 16-bit mode. The red backgrounds behind the currently entered score are the only such boxes in the entire game; the transparent rectangles used everywhere else are drawn with the same width in both modes.
    Screenshot of Shuusou Gyoku's High Score entry menu in 8-bit mode, highlighting 1st place with red backgrounds that are 1 pixel wider than in 16-bit modeScreenshot of Shuusou Gyoku's High Score entry menu in 16-bit mode, highlighting 1st place with red backgrounds at the correct size passed by game code
    The game code also clearly asks for 400 and 14 pixels, respectively.
  4. If we move the nice and accurate 8-bit circle outlines closer to the edge of the playfield, we discover, you guessed it, yet another off-by-one error:
    Screenshot of Shuusou Gyoku's circle drawing in 8-bit mode as shown during the Marisa boss fight, this time with the circles closer to the right edge of the playfield to highlight the off-by-one error in their clipping conditionScreenshot of Shuusou Gyoku's circle drawing in 16-bit mode as shown during the Marisa boss fight and with a fixed playfield clipping region for Direct3D shapes, this time with the circles closer to the right edge of the playfield to compare their correct clipping against the off-by-one error in 8-bit mode
    No circle pixels at the right edge of the playfield. Obviously, I had to fix bug #2 in order for the line approximation to not also get clipped at the same coordinate.
  5. The final off-by-one clipping error can be found in the filled circle part of homing lasers in 8-bit mode, but it's so minor that it doesn't deserve its own screenshot.
  6. Also, how about 16-bit circles being off by one full rotation? pbg's code originally generated and rendered 720° worth of points, thus unnecessarily duplicating the number of lines rendered.

Now that all of the more complex geometry is generated as part of game code, I could simplify most of the engine's graphics layer down to the classic immediate primitives of early 3D rendering: Line strips, triangle strips, and triangle fans, although I'm retaining pbg's dedicated functions for filled boxes and single gradient lines in case a backend can or needs to use special abstractions for these. (Hint, hint…)


So, let's add an SDL graphics backend! With all the earlier preparation work, most of the SDL-specific sprite and geometry code turned out as a very thin wrapper around the, for once, truly simple function calls of the DirectMedia layer. Texture loading from the original color-keyed BMP files, for example, turned into a sequence of 7 straight-line function calls, with most of the work done by SDL_LoadBMP_RW(), SDL_SetColorKey(), and SDL_CreateTextureFromSurface(). And although SDL_LoadBMP_RW() definitely has its fair share of unnecessary allocations and copies, the whole sequence still loads textures ~300 µs faster than the old GDI and DirectDraw backend.

Being more modern than our immediate geometry primitives, SDL's triangle renderer only either renders vertex buffers as triangle lists or requires a corresponding index buffer to realize triangle strips and fans. On paper, this would require an additional memory allocation for each rendered shape. But since we know that Shuusou Gyoku never passes more than 66 vertices at once to the backend, we can be fancy and compute two constant index buffers at compile time. 🧠 SDL_RenderGeometryRaw() is the true star of the show here: Not only does it allow us to decouple position and color data compared to SDL's default packed vertex structure, but it even allows the neat size optimization of 8-bit index buffers instead of enforcing 32-bit ones.

By far the funniest porting solution can be found in the Music Room's spectrum analyzer, which calls for 144 1-pixel gradient lines of varying heights. SDL_Renderer has no API for rendering lines with multiple colors… which means that we have to render them as 144 quads with a width of 1 pixel. :onricdennat:

The spectrum analyzer in Shuusou Gyoku's Music Room, at 6× magnificationThe spectrum analyzer in Shuusou Gyoku's Music Room, with a wireframe render of the spectrum overlaid to demonstrate how the new SDL backend renders these gradient lines, at 6× magnification
The wireframe was generated via a raw glPolygonMode(GL_FRONT_AND_BACK, GL_LINE);

But all these simple abstractions have to be implemented somehow, and this is where we get to perhaps the biggest technical advantage of SDL_Renderer over pbg's old graphics backend. We're no longer locked into just a single underlying graphics API like Direct3D 2, but can choose any of the APIs that the team implemented the high-level renderer abstraction for. We can even switch between them at runtime!
On Windows, we have the choice between 3 Direct3D versions, 2 OpenGL versions, and the software renderer. And as we're going to see, all we should do here is define a sensible default and then allow players to override it in a dedicated menu:

Screenshot of the new API configuration submenu in the P0295 build of Shuusou Gyoku, highlighting the default OpenGL API
Huh, we default to OpenGL 2.1? Aren't we still on Windows? :thonk:

Since such a menu is pretty much asking for people to try every GPU ever with every one of these APIs, there are bound to be bugs with certain combinations. To prevent the potentially infinite workload, these bugs are exempt from my usual free bugfix policy as long as we can get the game working on at least one API without issues. The new initialization code should be resilient enough to automatically fall back on one of SDL's other driver APIs in case the default OpenGL 2.1 fails to initialize for whatever reason, and we can still fight about the best default API.

But let's assume the hopefully usual case of a functional GPU with at least decently written drivers where most of the APIs will work without visible issues. Which of them is the most performant/power-saving one on any given system? With every API having a slightly different idea about 3D rendering, there are bound to be some performance differences, and maybe these even differ between GPUs. But just how large would they be?
The answer is yes:

SystemFPS (lowest | median) / API
Intel Core i5-2520M (2011)
Intel HD Graphics 3000 (2011)
1120×840
66190Direct3D 9
150288OpenGL 2.1
329Software
Intel Core i5-8400T (2018)
Intel UHD Graphics 630 (2018)
1280×960
217485Direct3D 9
104331Direct3D 11
59143Direct3D 12
438569OpenGL 2.1
527Software
Intel Core i7-7700HQ (2017)
NVIDIA GeForce GTX 1070 Max Q, thermal-throttled (2017)
1280×960
141445Direct3D 9
155468Direct3D 11
218520Direct3D 12
528682OpenGL 2.1
652721OpenGL ES 2.0
444Software
Intel i7-4790 (2017)
NVIDIA GeForce GTX 1660 SUPER (2019)
640×480, scaled to 1760×1320
20042172Direct3D 9
11421249OpenGL 2.1
AMD Ryzen 5800X (2020)
Radeon RX 7900XTX (2022)
2560x1920
5551219Direct3D 9
7501440OpenGL 2.1
SystemFPS (lowest | median) / API
Intel Core i5-2520M (2011)
Intel HD Graphics 3000 (2011)
1120×840
3390Direct3D 9
63176OpenGL 2.1
453Software
Intel Core i5-8400T (2018)
Intel UHD Graphics 630 (2018)
1280×960
133379Direct3D 9
5872Direct3D 11
33105Direct3D 12
367503OpenGL 2.1
783Software
Intel Core i7-7700HQ (2017)
NVIDIA GeForce GTX 1070 Max Q, thermal-throttled (2017)
1280×960
19202Direct3D 9
134330Direct3D 11
218390Direct3D 12
510618OpenGL 2.1
339675OpenGL ES 2.0
884Software
Intel i7-4790 (2017)
NVIDIA GeForce GTX 1660 SUPER (2019)
640×480, scaled to 1760×1320
16361990Direct3D 9
6751047OpenGL 2.1
AMD Ryzen 5800X (2020)
Radeon RX 7900XTX (2022)
2560x1920
276487Direct3D 9
6811011OpenGL 2.1
Computed using pbg's original per-second debugging algorithm. Except for the Intel i7-4790 test, all of these use SDL's default geometry scaling mode as explained further below. The GeForce GTX 1070 could probably be twice as fast if it weren't inside a laptop that thermal-throttles after about 10 seconds of unlimited rendering.
The two tested replays decently represent the entire game: In Stage 6, the software renderer frequently drops into low 1-digit FPS numbers as it struggles with the blending effects used by the Laser shot type's bomb, whereas GPUs enjoy the absence of background tiles. In the Extra Stage, it's the other way round: The tiled background and a certain large bullet cancel emphasize the inefficiency of unbatched rendering on GPUs, but the software renderer has a comparatively much easier time.

And that's why I picked OpenGL as the default. It's either the best or close to the best choice everywhere, and in the one case where it isn't, it doesn't matter because the GPU is powerful enough for the game anyway.

If those numbers still look way too low for what Shuusou Gyoku is (because they kind of do), you can try enabling SDL's draw call batching by setting the environment variable SDL_RENDER_BATCHING to 1. This at least doubles the FPS for all hardware-accelerated APIs on the Intel UHD 630 in the Extra Stage, and astonishingly turns Direct3D 11 from the slowest API into by far the fastest one, speeding it up by 22× for a median FPS of 1617. I only didn't activate batching by default because it causes stability issues with OpenGL ES 2.0 on the same system. But honestly, if even a mid-range laptop from 13 years ago manages a stable 60 FPS on the default OpenGL driver while still scaling the game, there's no real need to spend budget on performance improvements.
If anything, these numbers justify my choice of not focusing on a specific one of these APIs when coding retro games. There are only very few fields that target a wider range of systems with their software than retrogaming, and as we've seen, each of SDL's supported APIs could be the optimal choice on some system out there.

The replays we used for testing: 2024-10-22-SH01-Stage-6-and-Extra-replays.zip


📝 Last year, it seemed as if the 西方Project logo screen's lens ball effect would be one of the more tricky things to port to SDL_Renderer, and that impression was definitely accurate.

The effect works by capturing the original 140×140 pixels under the moving lens ball from the framebuffer into a temporary buffer and then overwriting the framebuffer pixels by shifting and stretching the captured ones according to a pre-calculated table. With DirectDraw, this is no big deal because you can simply lock the framebuffer for read and write access. If it weren't for the fact that you need to either generate or hand-write different code for every support bit depth, this would be one of the most natural effects you could implement with such an API. Modern graphics APIs, however, don't offer this luxury because it didn't take long for this feature to become a liability. Even 20 years ago, you'd rather write this sort of effect as a pixel shader that would directly run on the GPU in a much more accelerated way. Which is a non-starter for us – we sure ain't breaking SDL's abstractions to write a separate shader for every one of SDL_Renderer's supported APIs just for a single effect in the logo screen.
As such, SDL_Renderer doesn't even begin to provide framebuffer locking. We can only get close by splitting the two operations:

Within these API limitations, we can now cobble together a first solution:

  1. Rely on render-to-texture being supported. This is the case for all APIs that are currently implemented for SDL 2's renderer and SDL 3 even made support mandatory, but who knows if we ever get our hands on one of the elusive SDL 2 console ports under NDA and encounter one of them that doesn't support it… :thonk:
  2. Create a 640×480 texture that serves as our editable framebuffer.
  3. Create a 140×140 buffer in main memory, serving as the input and output buffer for the effect. We don't need the full 640×480 here because the effect only modifies the pixels below the magnified 140×140 area and doesn't push them further outside.
  4. Retain the original main-memory 140×140 buffer from the DirectDraw implementation that captures the current frame's pixels under the lens ball before we modify the pixels.
  5. Each frame, we then
    1. render the scene onto 2),
    2. capture the magnified area using SDL_RenderReadPixels(), reading from 2) and writing to 3),
    3. copy 3) to 4) using a regular memcpy(),
    4. apply the lens effect by shifting around pixels, reading from 4) and writing to 3),
    5. write 3) back to 2), and finally
    6. use 2) as the texture for a quad that scales the texture to the size of the window.

Compared to the DirectDraw approach, this adds the technical insecurity of render-to-texture support, one additional texture, one additional fullscreen blit, at least one additional buffer, and two additional copies that comprise a round-trip from GPU to CPU and back. It surely would have worked, but the documentation suggestions and horror stories surrounding SDL_RenderReadPixels() put me off even trying that approach. Also, it would turn out to clash with an implementation detail we're going to look at later.
However, our scene merely consists of a 320×42 image on top of a black background. If we need the resulting pixels in CPU-accessible memory anyway, there's little point in hardware-rendering such a simple scene to begin with, especially if SDL lets you create independent software renderers that support the same draw calls but explicitly write pixels to buffers in regular system memory under your full control.
This simplifies our solution to the following:

  1. Create a 640×480 surface in main memory, acting as the target surface for SDL_CreateSoftwareRenderer(). But since the potentially hardware-accelerated renderer drivers can't render pixels from such surfaces, we still have to
  2. create an additional 640×480 texture in write-only GPU memory.
  3. Retain the original main-memory 140×140 buffer from the DirectDraw implementation that captures the current frame's pixels under the lens ball before we modify the pixels.
  4. Each frame, we then
    1. software-render the scene onto 1),
    2. capture the magnified area using a regular memcpy(), reading from 1) and writing to 3),
    3. apply the lens effect by shifting around pixels, reading from 3) and writing to 1),
    4. upload all of 1) onto 2), and finally
    5. use 2) as the texture for a quad that scales the texture to the size of the window.

This cuts out the GPU→CPU pixel transfer and replaces the second lens pixel buffer with a software-rendered surface that we can freely manipulate. This seems to require more memory at first, but this memory would actually come in handy for screenshots later on. It also requires the game to enter and leave the new dedicated software rendering mode to ensure that the 西方Project image gets loaded as a system-memory "texture" instead of a GPU-memory one, but that's just two additional calls in the logo and title loading functions.
Also, we would now software-render all of these 256 frames, including the fades. Since software rendering requires the 西方Project image to reside in main memory, it's hard to justify an additional GPU upload just to render the 127 frames surrounding the animation.

Still, we've only eliminated a single copy, and SDL_UpdateTexture() can and will do even more under the hood. Suddenly, SDL having its own shader language seems like the lesser evil, doesn't it? When writing it out like this, it sure looks as if hardware rendering adds nothing but overhead here. So how about full-on dropping into software rendering and handling the scaling from 640×480 to the window resolution in software as well? This would allow us to cut out steps 2) and d), leaving 1) as our one and only framebuffer.
It sure sounds a lot more efficient. But actually trying this solution revealed that I had a completely wrong idea of the inefficiencies here:

  1. We do want to hardware-render the rest of the game, so we'd need to switch from software to hardware at the end of the logo animation. As it turns out, this switch is a rather expensive operation that would add an awkward ~500 ms pause between logo and title screen.
  2. Most importantly, though: Hardware-accelerating the final scaling step is kind of important these days. SDL's CPU scaling implementation can get really slow if a bilinear filter is involved; on my system, software-scaling 62.5 frames per second by 1.75× to 1120×840 pixels increases CPU usage by ~10%-20% in Release mode, and even drops FPS to 50 in Debug mode.

This was perhaps the biggest lesson in this sudden 25-year jump from optimizing for a PC-98 and suffering under slow DirectDraw and Direct3D wrappers into the present of GPU rendering. Even though some drivers technically don't need these redundant CPU copies, a slight bit of added CPU time is still more than worth it if it means that we get to offload the actually expensive stuff onto the GPU.


But we all know that 4-digit frame rates aren't the main draw of rendering graphics through SDL. Besides cross-platform compatibility, the most useful aspect for Shuusou Gyoku is how SDL greatly simplifies the addition of the scaled window and borderless fullscreen modes you'd expect for retro pixel graphics on modern displays. Of course, allowing all of these settings to be changed in-engine from inside the Graphic options menu is the minimum UX comfort level we would accept here – after all, something like a separate DPI-aware dialog window at startup would be harder to port anyway.
For each setting, we can achieve this level of comfort in one of two ways:

  1. We could simply shut down SDL's underlying render driver, close the window, and reopen/reinitialize the window and driver, reloading any game graphics as necessary. This is the simplest way: We can just reuse our backend's full initialization code that runs at startup and don't need any code on top. However, it would feel rather janky and cheap.
  2. Or we could use SDL's various setter functions to only apply the single change to the specific setting… and anything that setting depends on. This would feel really smooth to use, but would require additional code with a couple of branches.

pbg's code already geared slightly towards 2) with its feature to seamlessly change the bit depth. And with the amount of budget I'm given these days, it should be obvious what I went with. This definitely wasn't trivial and involved lots of state juggling and careful ordering of these procedural, imperative operations, even at the level of "just" using high-level SDL API calls for everything. It must have undoubtedly been worse for the SDL developers; after all, every new option for a specific parameter multiplies the amount of potential window state transitions.
In the end though, most of it ended up working at our preferred high level of quality, leaving only a few cases where either SDL or the driver API forces us to throw away and recreate the window after all:

As for the actual settings, I decided on making the windowed-mode scale factor customizable at intervals of 0.25, or 160×120 pixels, up to the taskbar-excluding resolution of the current display the game window is placed on. Sure, restricting the factor to integer values is the idealistically correct thing to do, but 640×480 is a rather large source resolution compared to the retro consoles where integer scaling is typically brought up. Hence, such a limitation would be suboptimal for a large number of displays, most notably any old 720p display or those laptop screens with 1366×768 resolutions.
In the new borderless fullscreen mode, the configurable scaling factor breaks down into all three possible interpretations of "fitting the game window onto the whole screen":

What currently can't be configured is the image filter used for scaling. The game always uses nearest-neighbor at integer scaling factors and bilinear filtering at fractional ones.

Screenshot of the Integer fit option in the borderless fullscreen mode of the P0295 Shuusou Gyoku build, as captured on a 1280×720 displayScreenshot of the 4:3 fit option in the borderless fullscreen mode of the P0295 Shuusou Gyoku build, as captured on a 1280×720 displayScreenshot of the Stretch fit option in the borderless fullscreen mode of the P0295 Shuusou Gyoku build, as captured on a 1280×720 display
The three scaling options available in borderless fullscreen mode as rendered on a 1280×720 display, which is one of the worst display resolutions you could play this game on.
And yes – as the presence of the FullScr[Borderless] option implies, the new build also still supports exclusive, display mode-changing 640×480 boomer fullscreen. 🙌
That ScaleMode, though… :thonk:

And then, I was looking for one more small optional feature to complete the 9th push and came up with the idea of hotkeys that would allow changing any of these settings at any point. Ember2528 considered it the best one of my ideas, so I went ahead… but little did I know that moving these graphics settings out of the main menu would not only significantly reshape the architecture of my code, but also uncover more bugs in my code and even a replay-related one from the original game. Paraphrasing the release notes:

The original game had three bugs that affected the configured difficulty setting when playing the Extra Stage or watching an Extra Stage replay. When returning to the main menu from an Extra Stage replay, the configured difficulty would be overridden with either
  1. the difficulty selected before the last time the Extra Stage's Weapon Select screen was entered, or
  2. Easy, when watching the replay before having been to the Extra Stage's Weapon Select screen during one run of the program.
  3. Also, closing the game window during the Extra Stage (both self-played and replayed) would override the configured difficulty with Hard (the internal difficulty level of the Extra Stage).
This had always been slightly annoying during development as I'd been closing the game window quite a bit during the Extra Stage. But the true nature of this bug only became obvious once these hotkeys allowed graphics settings to be changed during the Extra Stage: pbg had been creating a copy of the full configuration structure just because lives, bombs, the difficulty level, and the input flags need to be overwritten with their values from the replay file, only to then recover the user's values by restoring the full structure to its state from before the replay.

But the award for the greatest annoyance goes to this SDL quirk that would reset a render target's clipping region when returning to raw framebuffer rendering, which causes sprites to suddenly appear in the two black 128-pixel sidebars for the one frame after such a change. As long as graphics settings were only available from the unclipped main menu, this quirk only required a single silly workaround of manually backing up and restoring the clipping region. But once hotkeys allowed these settings to be changed while SDL_Renderer clips all draw calls to the 384×480 playfield region, I had to deploy the same exact workaround in three additional places… 🥲 At least I wrote it in a way that allows it to be easily deleted if we ever update to SDL 3, where the team fixed the underlying issue.

In the end, I'm not at all confident in the resulting jumbled mess of imperative code and conditional branches, but at least it proved itself during the 1½ months this feature has existed on my machine. If it's any indication, the testers in the Seihou development Discord group thought it was fine at the beginning of October when there were still 8 bugs left to be discovered. :onricdennat:
As for the mappings themselves: F10 and F11 cycle the window scaling factor or borderless fullscreen fit, F9 toggles the ScaleMode described below, and F8 toggles the frame rate limiter. The latter in particular is very useful for not only benchmarking, but also as a makeshift fast-forward function for replays. Wouldn't rewinding also be cool?


So we've ported everything the game draws, including its most tricky pixel-level effect, and added windowed modes and scaling on top. That only leaves screenshots and then the SDL backend work would be complete. Now that's where we just call SDL_RenderReadPixels() and write the returned pixels into a file, right? We've been scaling the game with the very convenient SDL_RenderSetLogicalSize(), so I'd expect to get back the logical 640×480 image to match the original behavior of the screenshot key…
…except that we don't? Why do we only get back the 640×480 pixels in the top-left corner of the game's scaled output, right before it hits the screen? How unfortunate – if SDL forces us to save screenshots at their scaled output resolution, we'd needlessly multiply the disk space that these uncompressed .BMP files take up. But even if we did compress them, there should be no technical reason to blow up the pixels of these screenshots past the logical size we specified…

Taking a closer look at SDL_RenderSetLogicalSize() explains what's going on there. This function merely calculates a scale factor by comparing the requested logical size with the renderer's output size, as well as a viewport within the game window if it has a different aspect ratio than the logical size. Then, it's up to the SDL_Renderer frontend to multiply and offset the coordinates of each incoming vertex using these values.
Therefore, SDL_RenderReadPixels() can't possibly give us back a 640×480 screenshot because there simply is no 640×480 framebuffer that could be captured. As soon as the draw calls hit the render API and could be captured, their coordinates have already been transformed into the scaled viewport.

The solution is obvious: Let's just create that 640×480 image ourselves. We'd first render every frame at that resolution into a texture, and then scale that texture to the window size by placing it on a single quad. From a preservation standpoint, this is also the academically correct thing to do, as it ensures that the entire game is still rendered at its original pixel grid. That's why this framebuffer scaling mode is the default, in contrast to the geometry scaling that SDL comes with.

With integer scaling factors and nearest-neighbor filtering, we'd expect the two approaches to deliver exactly identical pixels as far as sprite rendering is concerned. At fractional resolutions though, we can observe the first difference right in the menu. While geometry scaling always renders boxes with sharp edges, it noticeably darkens the text inside the boxes because it separately scales and alpha-blends each shadowed line of text on top of the already scaled pixels below – remember, 📝 the shadow for each line is baked into the same sprite. Framebuffer scaling, on the other hand, doesn't work on layers and always blurs every edge, but consequently also blends together all pixels in a much more natural way:

Look closer, and you can even see texture coordinate glitches at the edges of the individual text line quads.
Screenshot of the new Graphic option menu in the Shuusou Gyoku P0295 build, as rendered at a scale factor of 3.75× using geometry scaling, showing off both the sharp edges on boxes and the darker, individually-blended lines of textScreenshot of the new Graphic option menu in the Shuusou Gyoku P0295 build, as rendered at a scale factor of 3.75× using framebuffer scaling, showing off the more natural bilinearly-filtered look that blends the entire screen together and results in brighter text

Surprisingly though, we don't see much of a difference with the circles in the Weapon Select screen. If geometry scaling only multiplies and offsets vertices, shouldn't the lines along the 32-sided polygons still be just one pixel thick? As it turns out, SDL puts in quite a bit of effort here: It never actually uses the API's line primitive when scaling the output, but instead takes the endpoints, rasterizes the line on the CPU, and turns each point on the resulting line into a quad the size of the scale factor. Of course, this completely nullifies pbg's original intent of approximating circles with lines for performance reasons. :onricdennat:
The result looks better and better the larger the window is scaled. On low fractional scale factors like 1.25×, however, lines end up looking truly horrid as the complete lack of anti-aliasing causes the 1.25×1.25-pixel point quads to be rasterized as 2 pixels rather than a single one at regular intervals:

Screenshot of the Wide Shot selection in Shuusou Gyoku's Weapon Select screen, as rendered at a scale factor of 1.25× using geometry scaling, showing off the inconsistent rasterization of point quads without anti-aliasingScreenshot of the Wide Shot selection in Shuusou Gyoku's Weapon Select screen, as rendered at a scale factor of 1.25× using framebuffer scaling
Also note how you can either have bright circle colors or bright text colors, but not both.
But once we move in-game, we can even spot differences at integer resolutions if we look closely at all the shapes and gradients. In contrast to lines, software-rasterizing triangles with different vertex colors would be significantly more expensive as you'd suddenly have to cover a triangle's entire filled area with point quads. But thanks to that filled nature, SDL doesn't have to bother: It can merely scale the vertex coordinates as you'd expect and pass them onto the driver. Thus, the triangles get rasterized at the output resolution and end up as smooth and detailed as the output resolution allows:

Screenshot of a long laser used by Shuusou Gyoku's Extra Stage midboss, rendered at a scale factor of 3× and using geometry scaling for smooth edges at the scaled resolutionScreenshot of a long laser used by Shuusou Gyoku's Extra Stage midboss, rendered at 640×480 and framebuffer-scaled to 3× of the original resolution
Note how the HP gauge, being a gradient, also looks smoother with geometry scaling, whereas the Evade gauge, being 9 additively-blended red boxes with decreasing widths, doesn't differ between the modes.
For an even smoother rendering, enable anti-aliasing in your GPU's control panel; SDL unfortunately doesn't offer an API-independent way of enabling it.

You might now either like geometry scaling for adding these high-res elements on top of the pixelated sprites, or you might hate it for blatantly disrespecting the original game's pixel grid. But the main reasons for implementing and offering both modes are technical: As we've learned earlier when porting the lens ball effect, render-to-texture support is technically not guaranteed in SDL 2, and creating an additional texture is technically a fallible operation. Geometry scaling, on the other hand, will always work, as it's just additional arithmetic.
If geometry scaling does find its fans though, we can use it as a foundation for further high-res improvements. After all, this mode can't ever deliver a pixel-perfect rendition of the original Direct3D output, so we're free to add whatever enhancements we like while any accuracy concerns would remain exclusive to framebuffer scaling.

Just don't use geometry scaling with fractional scaling factors. :tannedcirno: These look even worse in-game than they do in the menus: The glitching texture coordinates reveal both the boundaries of on-screen tiles as well as the edge pixels of adjacent tiles within the set, and the scaling can even discolor certain dithered transparency effects, what the…?!

That green color is supposed to be the color key of this sprite sheet… 🤨
Screenshot of a frame of Shuusou Gyoku's Stage 3 intro, rendered at a scale factor of 1.5× using geometry scaling, showing off tons of tilemap texture coordinate glitches and the usually half-transparent shadow of Gates' plane showing up in a strange green colorScreenshot of a frame of Shuusou Gyoku's Stage 3 intro, rendered at a scale factor of 1.5× using framebuffer scaling, with no graphical glitches

With both scaling paradigms in place, we now have a screenshot strategy for every possible rendering mode:

  1. Software-rendering (i.e., showing the 西方Project logo)?
    This is the optimal case. We've already rendered everything into a system-memory framebuffer anyway, so we can just take that buffer and write it to a file.

  2. Hardware-rendering at unscaled 640×480?
    Requires a transfer of the GPU framebuffer to the system-memory buffer we initially allocate for software rendering, but no big deal otherwise.

  3. Hardware-rendering with framebuffer scaling?
    As we've seen with the initial solution for the lens ball effect, flagging a texture as a render target thankfully always allows us to read pixels back from the texture, so this is identical to the case above.

  4. Hardware-rendering with geometry scaling?
    This is the initial case where we must indeed bite the bullet and save the screenshot at the scaled resolution because that's all we can get back from the GPU. Sure, we could software-scale the resulting image back to 640×480, but:

    • That would defeat the entire point of geometry scaling as it would throw away all the increased detail displayed in the screenshots above. Maybe that is something you'd like to capture if you deliberately selected this scale mode.
    • If we scaled back an image rendered at a fractional scaling factor, we'd lose every last trace of sharpness.

    The only sort of reasonable alternative: We could respond to the keypress by setting up a parallel 640×480 software renderer, rendering the next frame in both hardware and software in parallel, and delivering the requested screenshot with a 1-frame lag. This might be closer to what players expect, but it would make quite a mess of this already way too stateful graphics backend. And maybe, the lag is even longer than 1 frame because we simultaneously have to recreate all active textures in CPU-accessible memory…


Now that we can take screenshots, let's take a few and compare our 640×480 output to pbg's original Direct3D backend to see how close we got. Certain small details might vary across all the APIs we can use with SDL_Renderer, but at least for Direct3D 9, we'd expect nothing less than a pixel-perfect match if we pass the exact same vertices to the exact same APIs. But something seems to be wrong with the SDL backend at the subpixel level with any triangle-based geometry, regardless of which rendering API we choose…

Screenshot of the laser-heavy pattern of VIVIT-captured-'s first form, as rendered by pbg's original Direct3D backendScreenshot of the laser-heavy pattern of VIVIT-captured-'s first form, as rendered by the initial state of the SDL graphics backend, showing slightly displaced vertices compared to pbg's original Direct3D backend
As if each polygon was shifted slightly up and to the left…

The culprit is found quickly: SDL's Direct3D 9 driver displaces each vertex by (-0.5, -0.5) after scaling, which is necessary for Direct3D to perfectly match the output of what OpenGL, Direct3D 11, Direct3D 12, and the software renderer would display without this offset. SDL is probably right here, but still, pbg's code doesn't do this. So it's up to us to counteract this displacement by adding (1.0 /(2.0 * scale)) to every vertex. 🤷

The other, much trickier accuracy issue is the line rendering. We saw earlier that SDL software-rasterizes any lines if we geometry-scale, but we do expect it to use the driver's line primitive if we framebuffer-scale or regularly render at 640×480. And at one point, it did, until the SDL team discovered accuracy bugs in various OpenGL implementations and decided to just always software-rasterize lines by default to achieve identical rendered images regardless of the chosen API. Just like with the half-pixel offset above, this is the correct choice for new code, but the wrong one for accurately porting an existing Direct3D game.
Thankfully, you can opt into the API's native line primitive via SDL's hint system, but the emphasis here is on API. This hint can still only ensure a pixel-perfect match if SDL renders via any version of Direct3D and you either use framebuffer scaling or no scaling at all. OpenGL will draw lines differently, and the software renderer just uses the same point rasterizing algorithm that SDL uses when scaling.

Pixels written into the framebuffer along the accurate outline, as we've covered above. Also note the slightly brighter color compared to the 3D-rendered variants.
The original Direct3D line rendering used in pbg's original code, touching a total of 568 pixels.
OpenGL's line rendering gets close, but still puts 16 pixels into different positions. Still, 97.2% of points are accurate to the original game.
The result of SDL's software line rasterizer, which you'd still see in the P0295 build when using either the software renderer or geometry scaling with any API. Slightly more accurate than OpenGL in this particular case with only 14 diverging pixels, matching 97.5% of the original circle.
As another alternative, SDL also offers a mode that renders each line as two triangles. This method naturally scales to any scale factor, but ends up drawing slightly thicker diagonals. You can opt into this mode via SDL's hint system by setting the environment variable SDL_RENDER_LINE_METHOD to 3.
The triangle method would also fit great with the spirit of geometry scaling, rendering smooth high-res circles analogous to the laser examples we saw earlier. This is how it would look like with the game scaled to 3200×2400… yeah, maybe we do want the point list after all, you can clearly see the 32 corners at this scale.
Screenshot of Shuusou Gyoku's Homing Missile weapon option with the circular selection cursor as rendered by pbg's original 8-bit drawing code Screenshot of Shuusou Gyoku's Homing Missile weapon option with the circular selection cursor as rendered by Direct3D's line drawing algorithm Screenshot of Shuusou Gyoku's Homing Missile weapon option with the circular selection cursor as rendered by OpenGL's line drawing algorithm Screenshot of Shuusou Gyoku's Homing Missile weapon option with the circular selection cursor as rendered by SDL's point-based line drawing algorithm Screenshot of Shuusou Gyoku's Homing Missile weapon option with the circular selection cursor as rendered by SDL's triangle-based line drawing algorithm at a scale factor of 1× Screenshot of Shuusou Gyoku's Homing Missile weapon option with the circular selection cursor as rendered by SDL's triangle-based line drawing algorithm at a geometry scale factor of 5×.

Replacing circles with point lists, as mentioned earlier, won't solve everything though, because Shuusou Gyoku also has plenty of non-circle lines:

The final 5-instance frame of Shuusou Gyoku's pre-boss WARNING wireframe animation against a black background, as rendered by pbg's original 8-bit drawing code The final 5-instance frame of Shuusou Gyoku's pre-boss WARNING wireframe animation against a black background, as rendered by Direct3D's line drawing algorithm The final 5-instance frame of Shuusou Gyoku's pre-boss WARNING wireframe animation against a black background, as rendered by OpenGL's line drawing algorithm The final 5-instance frame of Shuusou Gyoku's pre-boss WARNING wireframe animation against a black background, as rendered by SDL's point-based line drawing algorithm The final 5-instance frame of Shuusou Gyoku's pre-boss WARNING wireframe animation against a black background, as rendered by SDL's triangle-based line drawing algorithm at a scale factor of 1× The final 5-instance frame of Shuusou Gyoku's pre-boss WARNING wireframe animation against a black background, as rendered by SDL's triangle-based line drawing algorithm at a geometry scale factor of 5×.
6884 pixels touched by the Direct3D line renderer, a 98.3% match by the OpenGL rasterizer with 119 diverging pixels, and a 97.9% match by the SDL rasterizer with 147 diverging pixels. Looks like OpenGL gets better the longer the lines get, making line render method #2 the better choice even on non-Direct3D drivers.

So yeah, this one's kind of unfortunate, but also very minor as both OpenGL's and SDL's algorithms are at least 97% accurate to the original game. For now, this does mean that you'll manually have to change SDL_Renderer's driver from the OpenGL default to any of the Direct3D ones to get those last 3% of accuracy. However, I strongly believe that everyone who does care at this level will eventually read this sentence. And if we ever actually want 100% accuracy across every driver, we can always reverse-engineer and reimplement the exact algorithm used by Direct3D as part of our game code.


That completes the SDL renderer port for now! As all the GitHub issue links throughout this post have already indicated, I could have gone even further, but this is a convincing enough state for a first release. And once I've added a Linux-native font rendering backend, removed the few remaining <windows.h> types, and compiled the whole thing with GCC or Clang as a 64-bit binary, this will be up and running on Linux as well.

If we take a step back and look at what I've actually ended up writing during these SDL porting endeavors, we see a piece of almost generic retro game input, audio, window, rendering, and scaling middleware code, on top of SDL 2. After a slight bit of additional decoupling, most of this work should be reusable for not only Kioh Gyoku, but even the eventual cross-platform ports of PC-98 Touhou.
Perhaps surprisingly, I'm actually looking forward to Kioh Gyoku now. That game seems to require raw access to the underlying 3D API due to a few effects that seem to involve a Z coordinate, but all of these are transformed in software just like the few 3D effects in Shuusou Gyoku. Coming from a time when hardware T&L wasn't a ubiquitous standard feature on GPUs yet, both games don't even bother and only ever pass Z coordinates of 0 to the graphics API, thus staying within the scope of SDL_Renderer. The only true additional high-level features that Kioh Gyoku requires from a renderer are sprite rotation and scaling, which SDL_Renderer conveniently supports as well. I remember some of my backers thinking that Kioh Gyoku was going to be a huge mess, but looking at its code and not seeing a separate 8-bit render path makes me rather excited to be facing a fraction of Shuusou Gyoku's complexity. The 3D engine sure seems featureful at the surface, and the hundreds of source files sure feel intimidating, but a lot of the harder-to-port parts remained unused in the final game. Kind of ironic that pbg wrote a largely new engine for this game, but we're closer to porting it back to our own enhanced, now almost fully cross-platform version of the Shuusou Gyoku engine.

Speaking of 8-bit render paths though, you might have noticed that I didn't even bother to port that one to SDL. This is certainly suboptimal from a preservation point of view; after all, pbg specifically highlights in the source code's README how the split between palettized 8-bit and direct-color 16-bit modes was a particularly noteworthy aspect of the period in time when this game was written:

8bit/16bitカラーの混在、MIDI再生関連、浮動小数点数演算を避ける、あたりが懐かしポイントになるかと思います。

Times have changed though, and SDL_Renderer doesn't even expose the concept of rendering bit depth at the API level. 📝 If we remember the initial motivation for these Shuusou Gyoku mods, Windows ≥8 doesn't even support anything below 32-bit anymore, and neither do most of SDL_Renderer's hardware-accelerated drivers as far as texture formats are concerned. While support for 24-bit textures without an alpha channel is still relatively common, only the Linux DirectFB driver might support 16-bit and 8-bit textures, and you'd have to go back to the PlayStation Vita, PlayStation 2, or the software renderer to find guaranteed 16-bit support.

Therefore, full software rendering would be our only option. And sure enough, SDL_Renderer does have the necessary palette mapping code required for software-rendering onto a palettized 8-bit surface in system memory. That would take care of accurately constraining this render path to its intended 256 colors, but we'd still have to upconvert the resulting image to 32-bit every frame and upload it to GPU for hardware-accelerated scaling. This raises the question of whether it's even worth it to have 8-bit rendering in the SDL port to begin with if it will be undeniably slower than the GPU-accelerated direct-color port. If you think it's still a worthwhile thing to have, here is the issue to invest in.

In the meantime though, there is a much simpler way of continuing to preserve the 8-bit mode. As usual, I've kept pbg's old DirectX graphics code working all the way through the architectural cleanup work, which makes it almost trivial to compile that old backend into a separate binary and continue preserving the 8-bit mode in that way.
This binary is also going to evolve into the upcoming Windows 98 backport, and will be accompanied by its own SDL DLL that throws out the Direct3D 11, 12, OpenGL 2, and WASAPI backends as they don't exist on Windows 98. I've already thrown out the SSE2 and AVX implementations of the BLAKE3 hash function in preparation, which explains the smaller binary size. These Windows 98-compatible binaries will obviously have to remain 32-bit, but I'm undecided on whether I should update the regular Windows build to a 64-bit binary or keep it 32-bit:

I'm open to strong opinions that sway me in one or the other direction, but I'm not going to do both – unless, of course, someone subscribes for the continued maintenance of three Windows builds. 😛

Speaking about SDL, we'll probably want to update from SDL 2 to SDL 3 somewhere down the line. It's going to be the future, cleans up the API in a few particularly annoying places, and adds a Vulkan driver to SDL_Renderer. Too bad that the documentation still deters me from using the audio subsystem despite the significant improvements it made in other regards…
For now, I'm still staying on SDL 2 for two main reasons:

Finally, I decided against a Japanese translation of the new menu options for now because the help text communicates too much important information. That will have to wait until we make the whole game translatable into other languages.


📝 I promised to recreate the Sound Canvas VA packs once I know about the exact way real hardware handles the 📝 invalid Reverb Macro messages in ZUN's MIDI files, and what better time to keep that promise than to tack it onto the end of an already long overdue delivery. For some reason, Sound Canvas VA exhibited several weird glitches during the re-rendering processes, which prompted some rather extensive research and validation work to ensure that all tracks generally sound like they did in the previous version of the packages. Figuring out why this patch was necessary could have certainly taken a push on its own…

Interestingly enough, all these comparisons of renderings against each other revealed that the fix only makes a difference in a lot fewer than the expected 34 out of 39 MIDIs. Only 19 tracks – 11 in the OST and 8 in the AST – actually sound different depending on the Reverb Macro, because the remaining 15 set the reverb effect's main level to 0 and are therefore unaffected by the fix.
And then, there is the Stage 1 theme, which only activates reverb during a brief portion of its loop:

Screenshot of the all SysEx messages appearing in the original MIDI file of the OST version of フォルスストロベリー, Shuusou Gyoku's Stage 1 theme, as decoded by DominoScreenshot of the all SysEx messages appearing in the a SysEx-fixed MIDI file of the OST version of フォルスストロベリー, Shuusou Gyoku's Stage 1 theme, as decoded by Domino
As visualized by 📝 Domino.

Thus, this track definitely counts toward the 11 with a distinct echo version. But comparing that version against the no-echo one reveals something truly mind-blowing: The Sound Canvas VA rendering only differs within exactly the 8 bars of the loop, and is bit-by-bit identical anywhere else. 🤯 This is why you use softsynths.

This is the OST version, but it works just as well with the AST.
This is the OST version, but it works just as well with the AST.
Since the no-echo and echo BGM packs are aligned in both time and volume, you can reproduce this result – and explore the differences for any other track across both soundtracks – by simply phase-inverting a no-echo variant file and mixing it into the corresponding echo file. Obviously, this works best with the FLAC files.
Since the no-echo and echo BGM packs are aligned in both time and volume, you can reproduce this result – and explore the differences for any other track across both soundtracks – by simply phase-inverting a no-echo variant file and mixing it into the corresponding echo file. Obviously, this works best with the FLAC files. Trying it with the lossy versions gets surprisingly close though, and simultaneously reveals the infamous Vorbis pre-echo on the drums.

So yeah, the fact that ZUN enabled reverb by suddenly increasing the level for just this 8-bar piano solo erases any doubt about the panning delay having been a quirk or accident. There is no way this wasn't done intentionally; whether the SC-88Pro's default reverb is at 0 or 40 barely makes an audible difference with all the notes played in this section, and wouldn't have been worth the unfortunate chore of inserting another GS SysEx message into the sequence. That's enough evidence to relegate the previous no-echo Sound Canvas VA packs to a strictly unofficial status, and only preserve them for reference purposes. If you downloaded the earlier ones, you might want to update… or maybe not if you don't like the echo, it's all about personal preference at the end of the day.

While we're that deep into reproducibility, it makes sense to address another slight issue with the March release. Back then, I rendered 📝 our favorite three MIDI files, the AST versions of the three Extra Stage themes, with their original long setup area and then trimmed the respective samples at the audio level. But since the MIDI-only BGM pack features a shortened setup area at the MIDI level, rendering these modified MIDI files yourself wouldn't give you back the exact waveforms. 📝 As PCM behaves like a lollipop graph, any change to the position of a note at a tempo that isn't an integer factor of the sampling rate will most likely result in completely different samples and thus be uncomparable via simple phase-cancelling.
In our case though, all three of the tracks in question render with a slightly higher maximum peak amplitude when shortening their MIDI setup area. Normally, I wouldn't bother with such a fluctuation, but remember that シルクロードアリス is by far the loudest piece across both soundtracks, and thus defines the peak volume that every other track gets normalized to.
But wait a moment, doesn't this mean that there's maybe a setup area length that could yield a lower or even much lower peak amplitude? :thonk:

And so I tested all setup area lengths at regular intervals between our target 2-beat length and ZUN's original lengths, and indeed found a great solution: When manipulating the setup area of the Extra Stage theme to an exact length of 2850 MIDI pulses, the conversion process renders it with a peak amplitude of 1.900, compared to its previous peak amplitude of 2.130 from the March release. That translates to an extra +0.56 dB of volume tricked out of all other tracks in the AST! :onricdennat: Yeah, it's not much, but hey, at least it's not worse than what it used to be. The shipped MIDIs of the Extra Stage themes still don't correspond to the rendered files, but now this is at least documented together with the MIDI-level patch to reproduce the exact optimal length of the setup area.
Still, all that testing effort for tracks that, in my subjective opinion, don't even sound all that good… The resulting shrill resonant effects stick out like a sore thumb compared to the more basic General MIDI sound of every other track across both soundtrack variants. Once again, unofficial remixes such as Romantique Tp's one edit to 二色蓮花蝶 ~ Ancients can be the only solution here. As far as preservation is concerned, this is as good as it gets, and my job here is done.

Then again, now that I've further refined (and actually scripted) the loop construction logic, I'd love to also apply it to Kioh Gyoku's MIDI soundtrack once its codebase is operational. Obviously, there's much less of an incentive for putting SC-88Pro recordings back into that game given that Kioh Gyoku already comes with an official (and, dare I say, significantly more polished) waveform soundtrack. And even if there was an incentive, it might not extend to a separate Sound Canvas VA version: As frustrating as ZUN's sequencing techniques in the final three Shuusou Gyoku Extra Stage arrangements are when dealing with rendered output, the fact that he reserved a lot more setup space to fit the more detailed sound design of each Kioh Gyoku track is a good thing as far as real-hardware playback is concerned. Consequently, the Romantique Tp recordings suffer far less from 📝 the SC-88Pro's processing lag issues, and thus might already constitute all the preservation anyone would ever want.
Once again though, generous MIDI setup space also means that Kioh Gyoku's MIDI soundtrack has lots of long and awkward pauses at the beginning of stages before the music starts. The two worst offenders here are 天鵞絨少女戦 ~ Velvet Battle and 桜花之恋塚 ~ Flower of Japan, with a 3:429s pause each. So, preserving the MIDI soundtrack in its originally intended sound might still be a worthwhile thing to fund if only to get rid of those pauses. After all, we can't ever safely remove these pauses at the MIDI level unless users promise that they use a GS-supporting device.

What we can do as part of the game, however, is hotpatch the original MIDI files from Shuusou Gyoku's MUSIC.DAT with the Reverb Macro fix. This way, the fix is also available for people who want to listen to the OST through their own copy of Sound Canvas VA or a SC-8850 and don't want to download recordings. This isn't necessary for the AST because we can simply bake the fix into the MIDI-only BGM pack, but we can't do this for the OST due to copyright reasons. This hotpatch should be an option just because hotpatching MIDIs is rather insidious in principle, but it's enabled by default due to the evidence we found earlier.
The game currently pauses when it loses focus, which also silences any currently playing MIDI notes. Thus, we can verify the active reverb type by switching between the game and VST windows:

Maximum volume recommended.
Still saying Panning Delay, even though we obviously hear the default reverb. A clear bug in the Sound Canvas VA UI.

Anything I forgot? Oh, right, download links:

Next up: You decide! This delivery has opened up quite a bit of budget, so this would be a good occasion to take a look at something else while we wait for a few more funded pushes to complete the Shuusou Gyoku Linux port. With the previous price increases effectively increasing the monetary value of earlier contributions, it might not always be exactly obvious how much money is needed right now to secure another push. So I took a slight bit out of the Anything funds to add the exact € amount to the crowdfunding log.
In the meantime, I'll see how far I can get with porting all of the previous SDL work back to Windows 98 within one push-equivalent microtransaction, and do some internal website work to address some long-standing pain points.