⮜ Blog

⮜ List of tags

Showing all posts tagged

📝 Posted:
🚚 Summary of:
P0266, P0267, P0268, P0269, P0270, P0271, P0272, P0273, P0274, P0275, P0276, P0277
Commits:
(mly) eb2b0c8...b356884, (mly) b356884...1c70db0, (mly) 1c70db0...db0c195, (BGM packs) 2f9bce5...45087c2, (Seihou) P0256...a9ca081, (Seihou) a9ca081...8db918f, (Seihou) 8db918f...3de48ab, (Seihou) 3de48ab...9467705, (Seihou) 9467705...241a6c9, (Seihou) 241a6c9...P0275, (Seihou) dbc369f...883ac40, (Seihou) 883ac40...6ac72f3
💰 Funded by:
Ember2528, [Anonymous]
🏷 Tags:

📝 Over two years since the previous largest delivery, we've now got a new record in every regard: 12 pushes across 5 repos, 215 commits, and a blog post with over 14,000 words and 48 pieces of media. 😱 Who would have thought that the superficially simple task of putting SC-88Pro recordings into Shuusou Gyoku would actually mainly focus on deep research into the underlying MIDI files? I don't typically cover much music-related content because it's a non-issue as far as PC-98 Touhou code is concerned, so it's quite fitting how extensive this one turned out. So here we go, the result of virtually unlimited funding and patience:

  1. The SC-88Pro recording controversy
  2. Undefined SysEx behavior
  3. Resolving the controversy, and making a choice (contains personal opinion)
  4. A Unix-style command-line MIDI filter (in Rust BTW)
  5. Visualizing MIDI files (for science, and not for playing them on a keyboard)
  6. Shuusou Gyoku's individual loop quirks 🎺
  7. Rewriting pbg's MIDI code
  8. Putting together the BGM packs
  9. Outgrowing miniaudio (and raging about single-file C libraries for a while)
  10. Remaining implementation details
  11. Pricing changes (and no, not everything's getting more expensive)

So where's the controversy? Romantique Tp obviously made the best and most careful real-hardware SC-88Pro recordings of all of ZUN's old MIDIs, including the original (OST) and arranged (AST) soundtrack of Shuusou Gyoku, right? Surely all I have to do now is to cut them into seamless loops to save a bit of disk space, and then put them into the game? Let's start at the end of the track list with the name registration theme, since it's light on instruments and has an obvious loop point that will be easy to spot in the waveform. But, um… wait a moment, that very first drum note comes a bit late, doesn't it?

This can also be heard in Romantique Tp's YouTube upload.
At a notated tempo of 96 BPM, these first four beats should take exactly 2.5 seconds, which they do in this seamlessly looping softsynth rendering.

That's… not quite the accuracy and perfection I was expecting. :thonk: But I think I know what we're seeing and hearing there. Let's look at the first few MIDI events on the drum channel:

Delta	Pulse	 Beat	Channel	Event
 +540	   960	  2:000	      1	Controller { CC   0, value   0 }
   +0	   960	  2:000	      1	Controller { CC  32, value   0 }
   +0	   960	  2:000	      1	ProgramChange {  37 }
[…]
   +0	   960	  2:000	      2	Controller { CC   0, value   0 }
   +0	   960	  2:000	      2	Controller { CC  32, value   0 }
   +0	   960	  2:000	      2	ProgramChange {  19 }
[…]
   +0	   960	  2:000	      3	Controller { CC   0, value   0 }
   +0	   960	  2:000	      3	Controller { CC  32, value   0 }
   +0	   960	  2:000	      3	ProgramChange {   6 }
[…]
   +0	   960	  2:000	      4	Controller { CC   0, value   0 }
   +0	   960	  2:000	      4	Controller { CC  32, value   0 }
   +0	   960	  2:000	      4	ProgramChange {   2 }
[…]
Delta	Pulse	 Beat	Channel	Event
   +0	  960	2:000	     10	Controller { CC   0, value   0 }
   +0	  960	2:000	     10	Controller { CC  32, value   0 }
   +0	  960	2:000	     10	ProgramChange {  25 }
   +0	  960	2:000	     10	Controller { CC   7, value 127 }
   +0	  960	2:000	     10	Controller { CC  11, value 127 }
   +0	  960	2:000	     10	Controller { CC  10, value  64 }
   +0	  960	2:000	     10	Controller { CC  91, value  80 }
   +0	  960	2:000	     10	Controller { CC  93, value  40 }
   +0	  960	2:000	     10	NoteOn { Key  42, Vel.  94 }
   +0	  960	2:000	     10	NoteOn { Key  36, Vel. 110 }
   +1	  961	2:001	     10	NoteOn { Key  42, Vel.   0 }
   +0	  961	2:001	     10	NoteOn { Key  36, Vel.   0 }
 +119	 1080	2:120	     10	NoteOn { Key  42, Vel.  34 }
   +1	 1081	2:121	     10	NoteOn { Key  42, Vel.   0 }
 +119	 1200	2:240	     10	NoteOn { Key  42, Vel.  64 }
   +0	 1200	2:240	     10	NoteOn { Key  36, Vel.  64 }
Also, the fact that GS doesn't put its drums on a non-general voice bank and instead relies on external channel configuration to differentiate drums from pitched instruments is making this Yamaha kid uncontrollably furious. 🤬

Yup. That's the sound of a vintage hardware synth being slow and taking a two-digit number of milliseconds to process a barrage of simultaneous Program Change messages, playing a MIDI file that doesn't take this reality into account and expects program changes to happen instantly.
I can only speak from my own experience of writing MIDIs for hardware synths here, but having the first note displaced by 50 ms is very much not the way a composer would have intended the music to be heard if the note is clearly notated to occur on the beat. If you had told me about such an issue when playing one of my MIDIs on a certain synth, I would have thanked you for the bug report! And I would have promptly released a fixed version of the MIDI with the Program Change events moved back by a beat or two. In the case of Shuusou Gyoku's MIDIs, this wouldn't even have added any additional delay in-game, as all of these files already start with at least one beat of leading silence to make room for setting Roland-specific synth parameters.

OK, but that's just a single isolated bass drum hit. If we wanted to, we could even fix this issue ourselves by splicing the same note from around the loop end point. Maybe this is just an isolated case and the rest of Romantique Tp's recordings are fine? Well…

Again, check Romantique Tp's YouTube upload for proof.
By the way, this seamless audio player is what consumed most of the two website pushes this time. The rest went to the slightly redesigned main page, whose progress bars now use the cap bar style and the GitHub badge colors.

This one is even worse. Here, the delay is so long relative to the tempo of the piece that the intended five drum hits pretty much turn into four.

This type of issue doesn't even have to be isolated to the very beginning of a piece. A few of the tracks in both the OST and AST start with an anacrusis on just one or two channels and leave the Program Change event barrage at the beginning of the first full measure. In 幻想科学 ~ Doll's Phantom for example, this creates a flam-like glitch where the bass on channel 2 is pretty much on time, but the crash hit on channel 10 only follows 50 ms later, after the SC-88Pro took its sweet time to process all the Program Change events on the channels between:

This is from the arranged soundtrack for a change. In that one, ZUN at least fixed the issue in the final three MIDIs (シルクロードアリス, 魔女達の舞踏会, and 二色蓮花蝶 ~ Ancients) that closed out this rearranging project in May 2001, which spread out their per-channel setup events over at least a single measure before playing any note.

Let's listen to that at half speed:

Romantique Tp's YouTube upload.
Still on point.

Sure, all of this is barely noticeable in casual listening, but very noticeable if you're the one who now has to cut these recordings into seamless loops. And these are just the most obvious timing issues that can be easily pinpointed and documented – the actual worst aspects are all the minor tempo and timing fluctuations throughout most of the pieces. With recordings that deviate ever so slightly from the tempo defined in the MIDI files, you can no longer rely on mathematically exact sample positions when cutting loops. Even if those positions do work out from time to time, there'd pretty much always be a discontinuity in the waveform at both ends of the loop, manifesting as a clearly audible click. In the end, the only way of finding good loop points in existing recordings involves straining your ears and listening very, very closely to avoid any audible glitches. 😩

But if you've taken a look at the second tabs in the clips above, you will have noticed that we don't necessarily have to be stuck with recordings from real hardware. In late 2015, Roland released Sound Canvas VA, a VST plugin that emulates the classic core of Roland's old Sound Canvas lineup, including the SC-88Pro. As long as we run such a software synthesizer through a quality VST host, a purely software-based solution should be way superior for recording looped BGM:

Any drawbacks? For our use case, all of them are found in the abysmal software quality of everything around the synth engine. As it's typical for the VST industry, Sound Canvas VA is excessively DRM'd – it takes multiple seconds to start up, and even then only allows a single process to run at any given time, immediately quitting every process beyond the first one with a misleading Parameter File1 Read Error message box. I totally believe anyone who claims that this makes SCVA more annoying than real hardware when composing new music. Retro gamers also dislike how Roland themselves no longer sells the 32-bit builds they used to offer for the first few versions. These old versions are now exclusively available through resellers, or on the seven seas.
But as far as the SC-88Pro emulation is concerned, there don't seem to be any technical reasons against it. There is a long thread over at VOGONS discussing all sorts of issues, but you have to dig quite deep to find any clear descriptions of bugs in SCVA's synth engine. Everything I found either only applies to the SC-55 emulation and not the SC-88Pro, was fixed by Roland in the meantime, or turned out to be a fixable bug in a MIDI file.

Nevertheless, Romantique Tp has a very negative opinion about SCVA, getting quite angry and defensive in this instance where someone favorably compared SCVA to their recordings. Edit (2024-03-10): These days, Romantique Tp has a much more favorable opinion on SCVA as well.
8 years after their release, however, the community unanimously accepts the Romantique Tp recordings as the intended way to listen to ZUN's old MIDIs, so choosing Sound Canvas VA for our Shuusou Gyoku builds might be a bad idea purely for PR reasons. At best, people would slightly wonder why I intentionally went with the opposite of the accepted reference recordings, but at worst, this entire project could face a violent backlash…


But wait, we've already heard one obvious difference between the real SC-88Pro and Sound Canvas VA. Let's listen to the very first clip again:

Ha! You can clearly hear a panning echo in the real-hardware recording that is missing from the Sound Canvas VA rendering. That's an obvious case of a core system effect not being reproduced correctly. If even that's undeniably broken, who knows which other subtle bugs SCVA suffers from, right? Case closed, Romantique Tp was right all along, SCVA is trash, real hardware reigns supreme :godzun:

Actually, let's look closer into this one. Panning delay effects like this are typically reverb-related, but General MIDI only specifies a single controller to specify the per-channel reverb level from 0 to 127. Any specific characteristics of the reverb therefore have to be configured using vendor-specific system-exclusive messages, or SysEx for short.
So it's down to one of the four SysEx messages at the beginning of the MIDI file:

Delta	Pulse	 Beat	Event
   +0	    0	0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)
 +240	  240	0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)
 +120	  360	0:360	SysEx(41 10 42 12 40 01 33 0F 7D F7)
  +60	  420	0:420	SysEx(41 10 42 12 40 01 34 30 5B F7)

Since these byte strings represent Roland-specific instructions, we can't learn anything from a raw MIDI event dump alone here. No problem though, let's just load these files into some old MIDI sequencer that targeted Roland synths, open its MIDI event list, and then they will be automatically decoded into a human-readable representation…
…or at least that's what I expected. In Yamaha land, XGworks has done that for Yamaha's own XG SysEx messages ever since 1997:

Screenshot of the MIDI Event Viewer in Yamaha's XGworks, showing off its automatic XG SysEx decoding feature.
No configuration required. You can even edit the textual Value1 representation and XGworks parses it back into the closest supported value!

But for Roland synths, there's… nothing similar? Seriously? 😶 Roland fanboys, how do you even live?! I mean, they are quick to recommend the typical bloated and sluggish big-name DAWs that take up multiple gigabytes of disk space, but none of the ones I tried seemed to have this feature. They can't have possibly been flinging around raw byte strings for the past 33 years?!
But once you look more into today's MIDI community, it becomes clear that this is exactly what they've been doing. Why else would so many people use the word complicated to describe Roland SysEx, or call it an old school/cryptic communication protocol in hexadecimal format? The latter is particularly hilarious because if you removed the word cryptic, this might as well describe all of MIDI, not just SysEx. :tannedcirno: Everything about this is a tooling issue, and Yamaha showed how easily it could have been solved. Instead, we get Sound Canvas experts, who should know more about the ecosystem than I do, making the incredible mental leap from "my DAW doesn't decode or easily generate SysEx" to "SysEx is antiquated" to "please just lift up these settings to the VST level and into my proprietary DAW's proprietary project format, that would be so much better"

Thankfully that's not entirely true. After some more digging and configuration, I found a somewhat workable solution involving a comparatively modern sequencer called Domino:

  1. Download either Domino's original Japanese version or the partial English translation. The .zip file on the release page contains a full standalone build.
  2. Open the File → Preferences menu and associate your MIDI output device with a module map. This makes sense for SysEx encoding/generation since it can limit the options in the UI to what's actually available on your target hardware, but is also required for selecting the respective SysEx map into Domino's SysEx decoder. There is no technical reason for this because SC-88Pro SysEx messages can be uniquely identified by the three vendor, device, and model ID bytes that every message starts with, but would be too easy and user-friendly. The perception of SysEx being a black art must be upheld at all costs.
    Screenshot of Domino's MIDI-OUT window, complete with garbled text
    I've kept the garbled text of the partial translation to emphasize the sheer amount of jank involved in this entire process.
  3. Load a MIDI file and let Domino "analyze" it:
    Screenshot of Domino's analysis message box
  4. Strangely enough, this will take quite a while – on my system, this analysis step runs at a speed of roughly 4.25 KB/s of MIDI data. Yes, kilobytes.
  5. Unfortunately, "control change macro restoration" also seems to mean that you don't get to see any raw bytes when selecting the respective MIDI track in the UI, but at least we get what we were looking for:
    Screenshot of the four SysEx messages of タイトルドメイド, Shuusou Gyoku's name registration theme, as decoded by Domino
    …for the most part?
    Pulse	Event
        0	SysEx(41 10 42 12 40 00 7F 00 41 F7)
      240	SysEx(41 10 42 12 40 01 30 14 7B F7)
      360	SysEx(41 10 42 12 40 01 33 0F 7D F7)
      420	SysEx(41 10 42 12 40 01 34 30 5B F7)

Alright, that's something we can work with. The GS Reset message is something that every Roland GS MIDI should start with, but it's immediately followed by a message that Domino failed to decode? The two subsequent reverb parameters make sense, but panning delays typically have more parameters than just a reverb level and time.
That unknown SysEx message shares much of the same bytes with the decoded ones though. So let's do what we maybe should have done all along, return to caveman, and check the SC-88Pro manual:

The relevant section from page 194. We can see how the address and value correspond to bytes 5-7 and 8 in the SysEx messages. Byte 9 is a checksum and byte 10 signals the end of the message.

And that's where we find what this particular issue boils down to. The missing SysEx message is clearly intended to be a Reverb Macro command, whose value can range from 0 to 7 inclusive on the SC-88Pro, but ZUN tries to specify Reverb Macro #14h, or 20 in decimal. The SC-88Pro manual does not specify what happens if a SysEx message wants to write an invalid value to a valid address, which means that we've firmly entered the territory of undefined behavior.
Edit (2024-03-10): Romantique Tp confirmed that the real SC-88Pro clamps these Reverb Macro IDs to the supported range of 0-7. Therefore, the appropriate course of action for guaranteeing the same sound on other Roland synths would be to fix the MIDI file and specify Reverb Macro #7 instead. But since this behavior remains technically undefined, we can still argue about ZUN's intention behind specifying the Reverb Macro like this:

In fact, 32 out of the 39 MIDIs across both of Shuusou Gyoku's soundtrack use this invalid Reverb Macro. The only ones that don't are


And that's where this quest seemed to end, until Romantique Tp themselves came in and suggested that I take a closer look at the GS Advanced Editor, or GSAE for short.

The splash screen of GSAE version 4.01e.
Make sure to connect a MIDI input device before starting GSAE, or it will silently crash immediately after this splash screen. At least it accepts any controller, so this might just be a bug instead of the typical user-hostile kind of hardware dongle DRM that is pervasive in today's synth industry. 1999 would seem a bit too early for that, thankfully.

I was aware of this tool, but hadn't initially considered it because it's always described as just a SysEx generator/encoder. In fact, the very existence of such a tool made no sense to me at first, and seemed to prove my point that the usability of GS SysEx was wholly inferior to what I was used to in Yamaha land. Like, why not build at least a tiny and stripped-down MIDI sequencer around this functionality that would allow you to insert SC-88Pro-specific messages at any point within a sequence, and not just the beginning? I can see the need for such a tool in today's world of closed-source DAWs where hardware MIDI modules are niche and retro and are only kept alive by a small community of enthusiasts. But why would its developers guarantee that MIDI composers would have to hop between programs even back in 1997? I can only imagine that they saw how every just slightly advanced MIDI sequencer or DAW back then already used its own project format instead of raw Standard MIDI Files, and assumed that composers would therefore be program-hopping anyway?
However, GSAE does support the import of settings from a MIDI file and features a SysEx history window that decodes every newly processed Roland SysEx byte string, which is all I was looking for. So let's throw in that same MIDI and…

Screenshot of GSAE's SysEx history window,showing the results of sending a GS Reverb Macro #20 message
That's the result of sending just the single F0 41 10 42 12 40 01 30 14 7B F7 message at the top.

Now that's some wild numbers. An equally invalid Reverb Character, and Reverb Level and Time values that even exceed their defined range of 0-127? Could it be that GSAE emulates the real-hardware response to invalid Reverb Macros here, and gives us the exact reverb setting we can hear in Romantique Tp's recording? This could even be the reason why GSAE is still used and recommended within today's Roland MIDI sequencing scene, and hasn't been supplanted by some more modern open-source tool written by the community.

In any case, these values have to come from somewhere, so let's reverse-engineer GSAE and figure out the logic behind them. Shoutout to IDR for being a great help with its automatic generation of IDC debug symbols for the Delphi standard library, and even including a few names of application-level widget class methods by reading Delphi-specific type information from the binary. This little sub-project made me also come around to appreciating Ghidra, whose decompiler and data type manager helped a lot and allowed me to find the relevant code section within just a few hours.
A~nd it turns out that the values all come from out-of-bounds accesses into arrays on the stack. :onricdennat: If we combine 25, 235, and 132 back into a 32-bit value, we get 0x19EB84, which is the virtual address of the relevant function's stack frame base pointer.
But it gets even more hilarious: If you enable debug text output via Option → Other Options → SMF → Insert text events to setup measures and export these imported settings back into a MIDI file, GSAE not only retains these invalid Reverb Macro IDs, but stringifies them via a simple lookup into a hardcoded string pointer array, again without any bounds checks. The effects of this are roughly what you would expect:

In the end, we have Domino not decoding the Reverb Macro message, and GSAE, the premier SysEx tool for Roland synths, responding to it in even more undefined and clearly bugged ways than real hardware apparently does. That's two programs confirming that whatever ZUN intended was never supposed to work reliably. And while we still don't know exactly what these reverb parameters are supposed to be, these observations solve the mystery as far as I'm concerned, and solidify my personal opinion on the matter.


So what do we do now, and which version do we go with? Optimally, I'd offer both versions and turn this controversy into a personal choice so that everybody wins… and Ember2528 agreed and generously provided all the funding to make it happen. 💸
If you haven't picked your favorite yet, here are some final arguments:

The Romantique Tp recordings certainly have something going for them with their provenance of coming from real hardware, and the care that Romantique Tp put into manually recording every single track, warts and all. I wholeheartedly agree that preserving the raw sound of playing the MIDI files into the hardware without thinking about bugs or quirks is an important angle to take when it comes to preservation. It's good that these recordings exist – after all, you wouldn't know which musical elements you'd possibly be missing in an emulation if you have nothing to compare it to. Even the muffled sound in the half-speed clip above can be an argument in their favor, as the SC-88Pro's DAC operates at 32 kHz and you wouldn't expect any meaningful frequency content between 16,000 and 22,050 Hz to begin with. Any frequency content in that range that does remain in Romantique Tp's recording is simply 📝 rolled-off imaging noise added during the ADC's resampling process.
All this is why they are a definite improvement over kaorin's 2007 recordings of only the AST, which used to be the previous reference recordings within the community. Those had all of the same timing issues and more, in addition to being so excessively volume-boosted that 0.15% of the samples across the entire soundtrack ended up clipped. That's 6.25 seconds out of 68:39m being lost to pure digital noise.

Most importantly though: ZUN himself said that only the real SC-88Pro will play back these files as he intended them to sound. This quote is likely where the tagline of Romantique Tp's entire recording project came from in the first place:

> 全てのデエタはSC-88ProもしくはSC-8850(ロオランド社)にて最適に聴けるように調整してあります > それ以外の音源でも、作者の意図した音ではない場合があります。 — ZUN on 東方幻想的音楽, his old MIDI page

However. ZUN is not exactly known for accurately and carefully preserving the legacy of his series, or really doing anything beyond parading his old games as unobtainable showpieces at conventions. With all the issues we've seen, preferring real hardware is ultimately just that: an angle, and a preference. This is why I disagree with the heavy and uncritical advertising that is mainly responsible for elevating the Romantique Tp recordings to their current reference status within the community, especially if at least half of the alleged superiority of real hardware is founded on undefined behavior that can easily be fixed in the MIDI files themselves if people only bothered to look.

Here's where I stand: MIDI files are digital sheet music first and foremost, not an inferior version of tracker modules where the samples are sold separately. As such, the specific synth a MIDI file was written for is merely a secondary property of the composition – and even more so if the MIDI file contains little to nothing in terms of sound design and mostly restricts itself to the basic feature set of General MIDI. In turn, synth quirks and bugs are not a defined part of the composition either, unless they are clearly annotated and documented in the file itself. And most importantly: If the MIDI file specifies a certain timing and a recording fails to reproduce that timing, then that recording is not an accurate representation of the MIDI file.
In that regard, Sound Canvas VA is not only the closest alternative to the real thing, as a few people in the MIDI and retrogaming scene do have to admit, but superior to the real thing. I'll gladly take clarity and perfect timing accuracy in exchange for minor differences in effects, especially if the MIDI file does not explicitly and correctly define said effects to begin with. If I want a panning delay as part of the reverb, I add the respective and correct SysEx message to define one – and if I don't, I do not care about the reverb. You might still get a panning delay on a certain synth, and you might even prefer how it sounds, but it's ultimately a rendering artifact and not a consciously intended part of the composition. In that way, it's similar to the individual flavor a musician adds to a performance of a piece of classical music.
And as far as the differences in frequency response and resonant filters are concerned: In Yamaha land, these are exactly the main distinguishing factors between vintage WF-192XG sound cards (resembling the real SC-88Pro in these characteristics) and the S-YXG50 softsynth (resembling SCVA). Once I found out about that softsynth and how much clearer it sounded in comparison, I sold that old PCI sound card soon after.

In the interest of preservation though, there's still one more unexplored solution that could be the ideal middle ground between the two approaches:

  1. Play the MIDIs through a real-hardware SC-88Pro again
  2. Capture the actually observed system-exclusive settings that fall within the synth's supported and documented ranges
  3. Insert them back into the MIDI file, creating a new bugfixed version
  4. Re-record that bugfixed version through Sound Canvas VA

Edit (2024-03-10): And since Romantique Tp has confirmed what exactly happens on real hardware, I'm going to do exactly that. These bugfixed Sound Canvas VA renderings will be a free bonus of the single next Shuusou Gyoku push, and will add another angle to the preservation of these soundtracks. In the meantime though, the Sound Canvas VA packs will sound like they do in the preview videos above.

Or, you know… Maybe none of this actually matters. Here's beatMARIO streaming some Shuusou Gyoku gameplay using what looks like a real-hardware SC-8850, which plays these MIDIs with occasionally noticeably different instrument patches and no panning delay in the name registration theme, and he still enjoyed every second of it. Imagine undefined SysEx behavior not even being consistent within the same family of Roland synths… nah, I'm done arguing, let's get back to the actual work and cut some loops.


Just to be clear: I'm not suggesting that Romantique Tp should have been the one to cut their recordings into loops, or even just the one who defined where the loop points are supposed to be. On the surface, this seems to be a non-issue, and you'd just pick a point wherever each track appears to loop, right? But with 39 MIDIs to cut and all the financial support from Ember2528, it made sense to also solve this problem more thoroughly, and algorithmically detect provably correct loop points for all of these files. Who knows, maybe we even find some surprises that make it all worth it?
This is the algorithm I came up with:

Of course, this algorithm isn't perfect and won't work for every MIDI file out there. It doesn't consider things like differently ordered events within the same MIDI pulse, (non-)registered parameter numbers, or the effect that SysEx messages can have on the state of individual channels. The latter would require the general SysEx decoding logic that I would have liked to have for the research above… actually, let's add an issue and add the project to the order form. I'd really like to see a comprehensive open-source cross-vendor SysEx decoder library in my lifetime.

As for the implementation, I was happy to write some Rust again for a change, as it's a great fit for these standalone greenfield command-line tools that don't have to directly interact with the legacy C++ code bases that this project usually deals with. It's even better if the foundational functionality is not just available in a crate, but in four, with the community already having gone through multiple iterations to arrive at a tried and tested winner. Who knows, maybe I even get to rewrite this website in it one day? Just for the sheer meme value of doing so, of course.
I also enjoyed this a lot from a technical point of view:

This algorithm works well for the long MIDI files of Shuusou Gyoku's OST that all contain multiple duplicates of their loop section, but it quickly reaches its limit with the AST. Following the classic two-loop + fade-out format, that soundtrack was meant to be played back in generic MIDI players, and not to actually be put back into the game in looped form. Since the loop algorithm did, in fact, find inconsistencies even in the OST, two copies of the apparent loop are sometimes not enough to prove cases where the actual loop ends much later than you think it does. In a few cases, it would be enough to simply remove all volume change events from the fade-out to prove the actual loop, but in others, the algorithm would need MIDI event data far past the end of the fade-out.

However, just giving up and not looping any of these tracks would be equally unfortunate. So how about shifting the question, from what's the best loop in this MIDI file to what's the best loop if the MIDI didn't fade out and instead repeated its apparent second loop a third time? As long as the detected loop in such a pre-processed file ends before the repeated range, it's still a valid loop in terms of the unmodified original.
Ideally, we want to do this pre-processing programmatically with the same Rust library instead of manually editing the MIDI. Many sequencers (and especially XGworks) apply significant changes to a MIDI file's internal structure when saving its internal representation back to a MIDI file, which might even mess with our loop algorithm. So it would be very nice to have a more trustworthy tool that applies only the edit we actually want, and perfectly retains the rest of the MIDI.

And that's how this sub-project turned into a small suite of command-line MIDI operations in the classic Unix filter/pipeline style: Each command reads a MIDI file from stdin, transforms it, and outputs text or the resulting MIDI file on stdout. This way, we gain maximum transparency and reproducibility as I can document the unique pre-processing steps for each AST track by simply providing the command lines. And sure, we're re-encoding and re-decoding the full MIDI sequence at every step along such a pipeline, but computers are fast, Rust and the midly library in particular are ⚡ blazingly fast ⚡, and the usability benefits of this pipeline model far outweigh any theoretical performance drops.
Here's the full list of commands that made it into the resulting mly tool:

This feature set should strike a good balance between not spending too much of the Shuusou Gyoku budget on tangential problems, but still offering a decent solution for the problem at hand. As a counterexample, the obvious killer feature – deserializing a dump back into a Standard MIDI File – would have gone way past the budget. While there are crates that free you from the need to write manual parsing code for basic data structures, they would instead require a lot of attribute boilerplate – and if the library that provided the structures doesn't already come with these attributes, you now have to duplicate all the structures, and convert back and forth between the original structures and your copies. Not to mention that we'd still have to write code for the high-level structure of the dump output…

If we put it all together, this is what we can do:

$ <ssg_02.mid mly loop-find
Best loop in note space: 4 events (between event #[117, 121[ and [121, 125[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m
Loop start: event   117 / pulse   1680 / beat   3:240 / 0:01:400m
  Loop end: event   121 / pulse   1920 / beat   4:000 / 0:01:600m

$ <ssg_02.mid mly cut 466: | mly loop-unfold 240: | mly -r 44100 loop-find
Track #0: Removing events #[16439, 19881[
Track #0: Repeating events #[8344, 16439[ at the end of the sequence
Best loop in note space: 8095 events (between event #[5625, 13720[ and [13720, 21815[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m
Loop start: event  5625 / pulse  75361 / beat 157:001 / 1:03:531m
  Loop end: event 13720 / pulse 183841 / beat 383:001 / 2:34:726m

Best loop in recording space:  8095 events (between event #[5709, 13804[ and [13804, 21899[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m / sample    35280.00
Loop start: event  5709 / pulse  77280 / beat 161:000 / 1:05:163m / sample  2873667.66
  Loop end: event 13804 / pulse 185760 / beat 387:000 / 2:36:358m / sample  6895375.27

Translation:


So, where are these loop quirks that justify why some of these audio files are longer than you'd think they should be? Just listing them as text wouldn't really communicate just how minor these are. It would be much nicer to visualize them in a way that highlights the exact inconsistencies within a fixed range of MIDI measures. Screenshots of MIDI sequencer or DAW windows won't capture these aspects all too well because these programs are geared toward fine-grained editing of single tracks, not visualization of details across all channels.

Screenshot of the first 8 measures of Shuusou Gyoku's Stage 1 theme (フォルスストロベリー) in its OST version, as visualized by REAPER's piano roll
REAPER's piano roll nicely snaps to a certain range, but good luck picking out the individual lines from the single volume lane at the bottom of the screen, or spotting a 7-point difference. Not to mention that CC #11 (Expression) makes up an equal part of a channel's final perceived volume, which is the metric we'd actually want to visualize.

Typical MIDI visualizers, however, are on the complete opposite end of the spectrum. In recent years, MIDI visualization has become synonymous with the typical Synthesia style of YouTube videos with a big keyboard at the bottom, note bars flying in from the top, and optional fancy effects once those notes hit the top of the keyboard. The Black MIDI community has been churning out tons of identically looking MIDI visualizers in recent years that mainly seem to differ in the programming language they're written in, and in how well they can cope with the blackest of black MIDIs.
Thankfully, most of these visualizers are open-source and have small and manageable codebases. The project with the most GitHub stars and the most generic name seemed to be the best starting point for hacking in the missing features, despite using GLSL shaders which I had no prior experience with. It was long overdue that I did something with GLSL though – it added a nice educational aspect to these hacks, and it still was easier than deciphering whatever the fastest and hyper-optimized Rust visualizer is doing.
Still, this visualizer needed a total of 18 small features and bugfixes to be actually usable for demonstrating Shuusou Gyoku's loop quirks. As such, these hacks turned into yet another tangential sub-project that could have easily consumed another two pushes if I cleaned up the code and published the result. But that would have really gone way past the budget for something that people might not even care about. So here's what we're going to do:


Alright then! Here's how to read the visualizations:



Before we package up these looped soundtracks, let's take a quick look at how they would be shown off in the Music Room. The Seihou Music Rooms carry over the per-channel keyboards from TH05, add the current per-channel volume, expression, and pan pot values, and top it off with a fake spectrum analyzer. All of these visualizations rely on MIDI data, and the Music Room would feel very dull and boring without them. Just look at Kioh Gyoku, whose Music Room basically turns into a still image in WAVE mode.
Retaining these visualizations even when playing waveform BGM was very important for me, and not just because it would make for a unique high-quality feature that would break new ground. It can also double as proof that the waveform versions are, in fact, in perfect sync with both the MIDIs they are based on, and, by extension, the respective stage scripts.
However, this would require the game to process the MIDIs and update the internal visualization state without simultaneously playing them back through the WinMM / MME / midiOut*() API. And just like graphics and text rendering, Shuusou Gyoku's original code came with zero architectural separation between platform-independent processing logic and platform-specific playback…

So I accidentally rewrote almost the entire MIDI code to achieve said separation. :tannedcirno: This also provided a great occasion to modernize this code and add some much-needed robustness for potential MIDI mods, while retaining the original code's approach of iterating over raw SMF byte streams. It might all have been very excessive for a delivery that was supposed to be just about waveform BGM support, but on the plus side, MIDI output is now portable to any other system's MIDI API as well.

Surprisingly though, it was Shuusou Gyoku's original MIDI timing that quickly turned out to be rather inaccurate, and not the waveforms. The exact numbers vary depending on the piece, but the game played back every MIDI about 1% slower than notated, adding about 2 or 3 seconds to their total playback time after 5 minutes. Tempo changes in particular were the biggest causes of desynchronizations with the waveforms… :thonk:
To understand how this can happen to begin with, we have to look closer at how you're supposed to use the midiOut*() API. This API is as low-level as it gets, only covering the transmission of a single MIDI message to the selected output device right now. There is no concept of note timing at this low level, so it's completely up to the program to parse delta times and tempo change events out of the MIDI file and correctly time the calls to this API for each MIDI message. With all the code that runs between the API and the actual renderer of the synth for every single message, the resulting timing can only ever be an approximation of the MIDI file. This doesn't really matter for the timescales and polyphony levels of typical music because, again, computers are fast, but such an API is fundamentally unsuitable for accurately playing back even just a moderately complex million-note Black MIDI. :onricdennat:

Shuusou Gyoku handles this required manual timing in the simplest possible way: It runs a MIDI processing function (Mid_Proc() in the code) at an interval of 10 ms, which processes and instantly sends out all MIDI events that have occurred at any point within the last 10 ms, maintaining merely their order. This explains not only why the original game incremented its MIDI TIMER by multiples of 10, but also the infamous missing drums when playing the soundtrack through the Microsoft GS Wavetable Synth:

But while sending MIDI events in such quantized chunks might not be perfect, it can't be the cause behind multi-second playback slowdowns. Instead, this issue has to boil down to the way Shuusou Gyoku times each individual message, and specifically how it converts between MIDI pulse units and real-time (milli)seconds. pbg's original MIDI code chose to do this in an equally confusing and inaccurate way: it kept two counters that tracked the current MIDI pulse before and after the latest tempo change, used the value of the latter counter to decide which events to process, and only added the pulse equivalent of 10 ms to this counter at the end of Mid_Proc() in the then current tempo. The commit message for my rewritten algorithm details the problems with this approach using nice ASCII art in case you're interested, but in short, the main problem lies in how the single final addition can only consider a single tempo change within each call to Mid_Proc(). If a MIDI file contains tempo ramps with less than 10 ms between each different tempo, the original game would only use the last of these tempo values as the basis for converting the entire 10 ms back into MIDI pulses. Not to mention that maybe MIDI pulses aren't the best unit in a game that still 📝 treats the FPU as lava and doesn't use any fixed-point means of increasing the resolution of the 10 ms→pulse division either…

On the contrary, it's much more accurate to immediately convert every encountered MIDI delta time to a real-time quantity and use that unit for event timing, especially if we want to restrict ourselves to integer math. Signed 64-bit integers are enough to fit the product of the slowest possible MIDI tempo ((224 - 1) µs per quarter note) and the highest possible MIDI delta time (228 - 1) at nanosecond precision (103), with one bit to spare. Then, we arrive at a much simpler timing algorithm:

The additive nature of this timer not only naturally allows more than one event to happen within a single Mid_Proc() call, but also averages out any minor timing inconsistencies across the length of a track.

This new algorithm did improve the overall timing accuracy, but only barely, shaving off just ≈100 ms of the total duration. Turns out that the main source behind the slowness was hiding somewhere else entirely, in the single line that deserializes tempo values from MIDI's big-endian representation into the native integer format:

assert(length_of_tempo_message == 3);
uint32_t tempo = 0;
for(int i = 0; i < length_of_tempo_message; i++) {
-	tempo += ((tempo << 8) + (*track_data++));
+	tempo  = ((tempo << 8) + (*track_data++));
}

Yup – the original code performed two additions per byte, which incorrectly added the interim value at every byte to the final result, and yielded a tempo that is ≈0.8% / ≈1 BPM slower than notated in the MIDI file, matching the number we were looking for. That's why the |/OR operator is the safer one to use in such a bit-twiddling context…
But now I'm curious. This is such a tiny bug that is bound to remain unnoticed until someone compares the game's MIDI output to another renderer. It must have certainly made it into other games whose MIDI code is based on Shuusou Gyoku's, or that pbg was involved with. And sure enough, not only did this bug survive Kioh Gyoku's OOP refactoring, but it even traveled into Windows Touhou, where it remained in every single game that supported MIDI playback. Now we know for a fact that pbg's Program Support role in the TH06 credits involved sharing ready-made, finished code with ZUN:

Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH06Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH07Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH08Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH09Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH10
The broken tempo deserialization in the respective latest full versions of TH06 through TH10. And yes, that's TH10 – even though TH09's trial version was the last game to ship MIDI versions of its soundtrack, TH10 still contained all of pbg's MIDI code that originated back in Shuusou Gyoku, before TH11 finally removed it.
Amusingly, ZUN's compiler even started optimizing the combination of left-shifting and addition to a multiplication with 257 for TH09, which even sort of highlights this bug if you're used to reading x86 ASM.

That leaves support for MIDI loop points as the only missing feature for syncing MIDI data with a looping waveform track. While it didn't require all too much code, pbg's original zero-copy approach of iterating over raw MIDI data definitely injected a lot of complexity into the required branches. Multi-track/SMF Type 1 files require quite a bit of extra thought to correctly calculate delta times across loop boundaries that reach past the end of the respective track, while still allowing the real-time delta values to be resynchronized at tempo changes within the loop – and yes, 3 of ZUN's 19 arranged MIDI files actually do use more than one track, so this wasn't just about maximizing MIDI compatibility for mods. I stuck to the original approach mostly as a challenge and to prove that it's possible without first parsing the entire MIDI sequence into a friendlier internal representation, but I absolutely do not recommend this to anyone else. :tannedcirno:

After hardcoding the loop points detected by mly into the binary, we only need to call Mid_Proc() once per frame in the Music Room and pass the frame delta time instead of the 10 ms constant. And then, we get this:

The MIDI TIMER now shows off the arguably more interesting current MIDI pulse value rather than just formatting the PASSED TIME in milliseconds. Ironically, displaying this value in a constantly counting way takes more effort now – the new nanosecond-based timing code doesn't use any measure of total MIDI pulses anymore, and they don't naturally fall out of the algorithm either. Instead, the code remembers the total pulse value of the last event it processed and adds the real-time duration that has passed since, similar to the original timing algorithm.
This naturally causes the timer to jump from the loop end pulse to the loop start pulse, proving that Mid_Proc() is in fact looping the sequence.

Alright, now we know what to package:

Unfortunately, we still haven't reached the end of the complications and weird issues that haunt Shuusou Gyoku's music:

  1. The original game reads the in-game track title directly out of the first Sequence Name event of the playing MIDI file. The waveform equivalent would be the Vorbis comment TITLE tag, which therefore should exactly match the original track's title, down to the exact placement of whitespace. As usual, if I emphasize minor things like this, it's not without reason: 幻想科学 ~ Doll's Phantom inconsistently uses halfwidth spaces at both sides of the , and wouldn't fit into the Music Room's limited space otherwise.

  2. However, the AST MIDI files jam a bunch of other metadata into their Sequence Names, roughly following the format
    【 $title 】 from 秋霜玉  for sc88Pro comp.ZUN
    The track titles should definitely not appear in this format in-game, but how do we get rid of this format without hardcoding either the names or the magic to parse the names out of this format? :thonk:
  3. The absolute state of GS SysEx tooling rears its ugly head one final time in three of the AST MIDIs, which for some reason are missing the Roland vendor prefix byte in all of their SysEx messages and are therefore undeniably bugged. There even seemed to be another SysEx-related bug which Romantique Tp explained away, but not this one:

    ssg_04.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 14 78 F7)
    0:420	SysEx(   10 42 12 40 01 34 50 3B F7)

    ssg_05.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 00 0C F7)
    0:420	SysEx(   10 42 12 40 01 34 14 77 F7)

    ssg_10.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 00 0C F7)
    0:420	SysEx(   10 42 12 40 01 34 60 2B F7)

    ssg_04.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 14 78 F7)	Reverb Level 20
    0:420	SysEx(41 10 42 12 40 01 34 50 3B F7)	Reverb Time 80

    ssg_05.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 00 0C F7)	Reverb Level 0
    0:420	SysEx(41 10 42 12 40 01 34 14 77 F7)	Reverb Time 20

    ssg_10.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 00 0C F7)	Reverb Level 0
    0:420	SysEx(41 10 42 12 40 01 34 60 2B F7)	Reverb Time 96
    The irony of using invalid Reverb Macros within already invalid SysEx messages is not lost on me.

    This is something we should fix even before running these files through Sound Canvas VA in order to render these with the reverb settings that ZUN clearly (and, for once, unironically) intended.

  4. For perfect preservation of the original BGM/gameplay synchronicity, it makes sense for the waveform versions to retain the leading 1 or 2 beats of silence that the original MIDI files use for their SysEx setup. While some of the AST tracks use a slightly different tempo compared to their OST counterparts, they would still be largely in sync as ZUN didn't rearrange the layout of their setup area… except for, once again, the three tracks used in the Extra Stage. :zunpet: Marisa's and Reimu's boss themes aren't too bad with their 4 beats of setup, but シルクロードアリス takes the cake with a whopping 12 beats of leading silence. That's 5 seconds from the start of the Extra Stage to the first note you'd hear. 🐌

2) and 4) could theoretically be worked around in Shuusou Gyoku's MIDI code, but there's no way around editing the MIDI files themselves as far as 3) is concerned. Thus, it makes sense to apply all of the workarounds to the AST MIDIs as part of the BGM build process – parsing the titles out of the 【brackets】, inserting the Roland vendor prefix byte where necessary, and compressing the setup bars in the Extra Stage themes to match their OST counterparts. Adding any hidden magic to the MIDI code would only have needlessly increased complexity and/or annoyed some modder in the future who would then have to work around it.
Ideally, these edits would involve taking the mly dump output, performing the necessary replacements at a plaintext level, and rebuilding the result back into a MIDI file, bu~t we're unfortunately missing the latter feature. Luckily, someone else had the same idea 13 years ago and wrote a tool in C that does exactly what we need. Getting it to compile in 2024 only required fixing a typical C thing… why are students and boomers defending this antique of a language again? 🙄

The single most glaring issue, however, is the drastic difference in volume between the individual tracks in both soundtracks. While Romantique Tp had to normalize each track to the maximum possible volume individually as a consequence of the recording process, the Sound Canvas VA renderings reveal just how inconsistent the volume levels of these MIDI files really are:

The peak amplitudes of every track in both soundtracks, as rendered by Sound Canvas VA at maximum volume. Looking at these, you might think that kaorin's 2007 recordings were purposely trying to preserve the clipping that would come out of an SC-88Pro if you don't manually adjust the volume knob for each song, but those recordings are still much louder than even these numbers.

So how do we interpret this? Is this a bug, because no one in their right mind would want their music to clip on purpose, and that in turn means that everything about these volume levels is arbitrary and unintentional? Or is this a quirk, and ZUN deliberately chose these volume levels for compositional reasons? It certainly would make sense for the name registration theme.
Once again, the AST version of シルクロードアリス is the worst offender in this regard as well, but it might also provide some evidence for the quirk interpretation. The fact that almost all of its MIDI channels blast away at full volume might have been an accident that could have gone unnoticed if the volume knob of ZUN's SC-88Pro was turned rather low during the time he arranged this piece, but the excessive left-panning must have been deliberate. Even Romantique Tp agrees:

Stereo waveform of the Sound Canvas VA rendering of Shuusou Gyoku's Extra Stage theme (シルクロードアリス), highlighting the excessive left-panningStereo waveform of Romantique Tp's recording of Shuusou Gyoku's Extra Stage theme (シルクロードアリス), highlighting the excessive left-panning
It might have even made compositional sense if Silk Road Alice was supposed to be a "Western-style piece", but it's not. :zunpet:

And that's with the volume already normalized. Because this one channel of this one track is almost twice as loud as anything else in the AST, we would consequently have to bring down the volume of every other arranged track and the right channel of the same track by almost 50% if we wanted to maintain the volume differences between the individual tracks of the AST. In the process, we lose almost one entire bit of dynamic range. At this rate, you might even consider remixing and remastering the entire thing, but that would involve so many creative decisions to definitely fall into fanfiction territory…

However, normalizing each track to a peak level of 0 dBFS makes much more sense for in-game playback if you consider how loud Shuusou Gyoku's sound effects are. Once again, the best solution would involve offering both versions, but should we really add two more SCVA BGM packs just to cover volume differences? :thonk:
ReplayGain solves this exact problem for regular music listening in a non-destructive way by writing the per-track and per-album gain levels into an audio file's metadata. Since we need metadata support for titles anyway, we can do something similar, albeit not exactly the same for two reasons:

And so, we hard-apply the volume-level gain during the conversion from 32-bit float to FLAC to preserve the volume differences between the tracks, calculate the track-level GAIN FACTOR based on the resulting peak levels, add a volume normalization toggle to the Sound / Config menu, enable it by default, and thus make everyone happy. ✅

The final interesting tidbit in building these packages can be found in the way the Sound Canvas VA recordings are looped. When manually cutting loops, you always have to consider that the intro might end with unique notes that aren't present at the end of the loop, which will still be fading out at the calculated loop start point. This necessitates shifting the loop start point by a few bars until these notes are no longer audible – or you could simply ignore the issue because ZUN's compositions are so frantic that no one would ever notice. :onricdennat:
With the separate intro and loop files generated by mly, on the other hand, the reverb/release trails are immediately visible and, after trimming trailing silence, exactly define the number of samples that the calculated loop start point needs to be shifted by. The .loop file then remains always exactly as long, in samples, as the duration of the loop reported by mly. If a piece happens to have a constant tempo whose beat duration corresponds to an integer number of samples, we get some very satisfying, round loop durations out of this process. ☺️


So let's play it all back in-game… and immediately run into two unexpected miniaudio limitations, what the…?!

  1. miniaudio uses a fixed linear function for its fade-out envelope, and doesn't offer anything else? We might not even want a logarithmic one this time because symmetry with MIDI's simple quadratic curve would be neat, but we sure don't want a linear function – those stay near the original volume for too long, and then turn quiet way too quickly.
  2. There is no way to access FLAC metadata from miniaudio's public API, even though the library bundles the author's own FLAC library which has this feature?

📝 Back when I evaluated miniaudio, I alluded that I consider single-file C libraries to be massively overrated, and this is exactly why: Once they grow as massive as miniaudio (how ironic), they can quickly lead to their authors treating their dependencies as implementation details and melting down the interfaces that would naturally arise. In a regular library, dr_flac would be a separate, proper dependency, and the API would have a way to initialize a stream from an externally loaded drflac object. But since the C community collectively pretends that multi-file libraries are a burden on other developers, miniaudio ended up with dr_flac copy-pasted into its giant single file, with a silly ma_ namespacing prefix added to all its functions. And why? Did we have to move so far in the other direction just because CMake doesn't support globbing? That's a symptom of CMake not actually solving any problem, not a valid architectural decision that libraries should bend around. 🙄
So unless we fork and hack around in miniaudio, there's now no way around depending on a second, regular copy of dr_flac. Which has now led to the same project organization bloat that single-file libraries originally set out to prevent…

Sigh. At this rate, it makes more sense to just copy-paste and adapt the old BGM streaming code I wrote for thcrap in late 2018, which used dr_flac directly, and extend it with metadata support. With the streaming code moved out of the platform layer and into game logic, it also makes much more sense to implement the squared fade-out curve at that same level instead of copy-pasting and adjusting an unhealthy amount of miniaudio's verbose C code.
While I'm doing the same for the old Vorbis streaming code, it would also make sense to rewrite that one to use stb_vorbis instead of the old libogg+libvorbis reference libraries. There's no need to add two more dependencies if miniaudio already comes with stb_vorbis.c, and that library is widely acclaimed. So, integration should be a breeze, right?
Well, surprise, rarely have I seen a C library so actively hostile toward being integrated. Both of its API variants are completely unreasonable:

What happened to the tried-and-true idea of providing a structure with read, tell, and seek callbacks, and then providing an optional variant for C FILE* handles if you absolutely must? Sure, the whole point of Vorbis is to be small and nobody these days would care about spending a few MB on keeping an entire Vorbis file in memory, but come on. If pulldata made the deliberate and opinionated choice to only support buffers of complete Vorbis streams and argued in the name of simplicity that hand-coded disk streaming isn't worth it in this day and age, I might have even been convinced. And this is from the guy who popularized the concept of single-file C libraries in the first place? :thonk:

Oh well, tupblocks go brrr. libvorbis definitely shows its age with all the old command-line tools in the lib/ directory that they never moved away and that we now have to remove from our glob. But even that just adds a single line to the Tupfile, and then we get to enjoy its much friendlier API. That sure beats the almost 800 lines of code that miniaudio had to write to integrate stb_vorbis… which I can't even link because the file is too big for GitHub. 🤷
At this point, it would have even made sense to upgrade from a 24-year-old lossy codec to an 11-year-old lossy codec and use Opus instead, since the enforced 48,000 Hz sampling rate is a non-issue when you control the entire audio pipeline. But let's keep compatibility with existing thcrap mods for now.

The last time I added dependencies, 📝 I wondered whether just downloading and extracting official Windows binary builds might be superior to pasting batch script duct tape over the usability issues of Git submodules. However, I still wanted to try out Git's sparse checkout feature before, in an attempt to remove all the unneeded bloat… and as it turned out, this might just be the idealistic and perfect nirvana of vendoring libraries in C++ projects. I particularly like how the limitations of its default mode (always checking out all files within each directory level that shows up in a filter) can be turned into a guideline about how to structure a repository: All non-essential stuff that consumers of your code might not need – tests, high-level documentation, or optional features – should go into a subdirectory where it can be easily filtered.
And that's how the size of our libs/ directory went down from 82.7 MiB in the P0256 build to 30.4 MiB in the P0275 build, despite adding 4 more libraries in the latter. Now if only this didn't require even more duct tape to actually set up shallow clones correctly

In the end, the Windows build ended up using only a single one of the miniaudio features that DirectSound doesn't have, and that's the ability to use the more modern WASAPI instead of DirectSound. We're still going to use miniaudio for the Linux port, but as far as Windows is concerned, it would be quite nice to backport BGM streaming to the game's original DirectSound backend. The P0275 build is pushing 1 MiB of binary size for a game that originally came in a 220 KiB binary, so it would remove a noticeable amount of bloat from GIAN07.EXE, but it would also allow waveform BGM to work in the Windows 98-compatible i586 build. If that sounds cool to you, this is the issue you want to fund.


That only left some logic and UI busywork to put it all together, which means that we've almost reached the end of things to talk about! Here's what it all looks like:



After half a year of being bought out way past the cap, I've finally got some small room left for new orders again. If it weren't for this blog post and the required research and web development work, this delivery would have probably come out in early January, taking half the time it ended up taking. So I really have to start factoring the blog posts into the push prices in a better and fairer way.
Meanwhile, the hate toward my day job only keeps growing, but there's little point in looking for a new one as long as ReC98 remains this motivating and complex. It leaves pretty much no cognitive room for any similarly demanding job. Thus, I want 2024 to be the year where ReC98 either becomes profitable enough to be my only full-time job, or where we conclusively find out that it can't, I go look for a better day job, and ReC98 shifts to a slower pace. Here's the plan:

With the new price of per push, this means that there's now a small window in which you can get a full push worth of functionality for , until the current cap is filled up again.

Next up: Probably TH02's endings to relax a bit. Maybe we're also getting some new Touhou-related contributions?

📝 Posted:
🚚 Summary of:
P0252, P0253, P0254, P0255, P0256, P0257
Commits:
(Seihou) P0251...e98feef, (Seihou) e98feef...24df71c, (Seihou) 24df71c...b7b863b, (Seihou) b7b863b...2b8218e, (Seihou) 2b8218e...P0256, (Website) b79c667...e2ba49b
💰 Funded by:
Arandui, Ember2528, [Anonymous]
🏷 Tags:

And now we're taking this small indie game from the year 2000 and porting its game window, input, and sound to the industry-standard cross-platform API with "simple" in its name.

Why did this have to be so complicated?! I expected this to take maybe 1-2 weeks and result in an equally short blog post. Instead, it raised so many questions that I ended up with the longest blog post so far, by quite a wide margin. These pushes ended up covering so many aspects that could be interesting to a general and non-Seihou-adjacent audience, so I think we need a table of contents for this one:

  1. Evaluating Zig
  2. Visual Studio doesn't implement concepts correctly?
  3. Reusable building blocks for Tup
  4. Compiling SDL 2
  5. The new frame rate limiter
  6. Audio via SDL or SDL_mixer? (Nope, neither)
  7. miniaudio
  8. Resampling defective sound effects (including FLAC not always being lossless)
  9. Joypad input with SDL
  10. Restoring the original screenshot feature
  11. Integer math in hand-written ASM

Before we can start migrating to SDL, we of course have to integrate it into the build somehow. On Linux, we'd ideally like to just dynamically link to a distribution's SDL development package, but since there's no such thing on Windows, we'd like to compile SDL from source there. This allows us to reuse our debug and release flags and ensures that we get debug information, without needing to clone build scripts for every C++ library ever in the process or something.
So let's get my Tup build scripts ready for compiling vendored libraries… or maybe not? Recently, I've kept hearing about a hot new technology that not only provides the rare kind of jank-free cross-compiling build system for C/C++ code, but innovates by even bundling a C++ compiler into a single 279 MiB package with no further dependencies. Realistically replacing both Visual Studio and Tup with a single tool that could target every OS is quite a selling point. The upcoming Linux port makes for the perfect occasion to evaluate Zig, and to find out whether Tup is still my favorite build system in 2023.

Even apart from its main selling point, there's a lot to like about Zig:

However, as a version number of 0.11.0 might already suggest, the whole experience was then bogged down by quite a lot of issues:

So for the time being, I still prefer Tup. But give it maybe two or three years, and I'm sure that Zig will eventually become the best tool for resurrecting legacy C++ codebases. That is, if the proposed divorce of the core Zig compiler from LLVM isn't an indication that the productive parts of the Zig community consider the C/C++ building features to be "good enough", and are about to de-emphasize them to focus more strongly on the actual Zig language. Gaining adoption for your new systems language by bundling it with a C/C++ build system is such a great and unique strategy, and it almost worked in my case. And who knows, maybe Zig will already be good enough by the time I get to port PC-98 Touhou to modern systems.

(If you came from the Zig wiki, you can stop reading here.)


A few remnants of the Zig experiment still remain in the final delivery. If that experiment worked out, I would have had to immediately change the execution encoding to UTF-8, and decompile a few ASM functions exclusive to the 8-bit rendering mode which we could have otherwise ignored. While Clang does support inline assembly with Intel syntax via -fms-extensions, it has trouble with ; comments and instructions like REP STOSD, and if I have to touch that code anyway… (The REP STOSD function translated into a single call to memcpy(), by the way.)

Another smaller issue was Visual Studio's lack of standard library header hygiene, where #including some of the high-level STL features also includes more foundational headers that Clang requires to be included separately, but I've already known about that. Instead, the biggest shocker was that Visual Studio accepts invalid syntax for a language feature as recent as C++20 concepts:

// Defines the interface of a text rendering session class. To simplify this
// example, it only has a single `Print(const char* str)` method.
template <class T> concept Session = requires(T t, const char* str) {
	t.Print(str);
};

// Once the rendering backend has started a new session, it passes the session
// object as a parameter to a user-defined function, which can then freely call
// any of the functions defined in the `Session` concept to render some text.
template <class F, class S> concept UserFunctionForSession = (
	Session<S> && requires(F f, S& s) {
		{ f(s) };
	}
);

// The rendering backend defines a `Prerender()` method that takes the
// aforementioned user-defined function object. Unfortunately, C++ concepts
// don't work like this: The standard doesn't allow `auto` in the parameter
// list of a `requires` expression because it defines another implicit
// template parameter. Nevertheless, Visual Studio compiles this code without
// errors.
template <class T, class S> concept BackendAttempt = requires(
	T t, UserFunctionForSession<S> auto func
) {
	t.Prerender(func);
};

// A syntactically correct definition would use a different constraint term for
// the type of the user-defined function. But this effectively makes the
// resulting concept unusable for actual validation because you are forced to
// specify a type for `F`.
template <class T, class S, class F> concept SyntacticallyFixedBackend = (
	UserFunctionForSession<F, S> && requires(T t, F func) {
		t.Prerender(func);
	}
);

// The solution: Defining a dummy structure that behaves like a lambda as an
// "archetype" for the user-defined function.
struct UserFunctionArchetype {
	void operator ()(Session auto& s) {
	}
};

// Now, the session type disappears from the template parameter list, which
// even allows the concrete session type to be private.
template <class T> concept CorrectBackend = requires(
	T t, UserFunctionArchetype func
) {
	t.Prerender(func);
};
Here's a Godbolt link, configured with both Visual Studio and Clang compilers.

What's this, Visual Studio's infamous delayed template parsing applied to concepts, because they're templates as well? Didn't they get rid of that 6 years ago? You would think that we've moved beyond the age where compilers differed in their interpretation of the core language, and that opting into a current C++ standard turns off any remaining antiquated behaviors…


So let's actually get my Tup build scripts ready for compiling vendored libraries, because the 📝 previous 70 lines of Lua definitely weren't. For this use case, we'd like to have some notion of distinct build targets that can have a unique set of compilation and linking flags. We'd also like to always build them in debug and release versions even if you only intend to build your actual program in one of those versions – with the previous system of specifying a single version for all code, Tup would delete the other one, which forces a time-consuming and ultimately needless rebuild once you switch to the other version.

The solution I came up with treats the set of compiler command-line options like a tree whose branches can concatenate new options and/or filter the versions that are built on this branch. In total, this is my 4th attempt at writing a compiler abstraction layer for Tup. Since we're effectively forced to write such layers in Lua, it will always be a bit janky, but I think I've finally arrived at a solid underlying design that might also be interesting for others. Hence, I've split off the result into its own separate repository and added high-level documentation and a documented example. And yes, that's a Code Nutrition label! I've wanted to add one of these ever since I first heard about the idea, since it communicates nicely how seriously such an open-source project should be taken. Which, in this case, is actually not all too seriously, especially since development of the core Tup project has all but stagnated. If Zig does indeed get better and better at being a Clang frontend/build system, the only niches left for Tup will be Visual Studio-exclusive projects, or retrocoding with nonstandard toolchains (i.e., ReC98). Quite ironic, given Tup's Unix heritage…
Oh, and maybe general Makefile-like tasks where you just want to run specific programs. Maybe once the general hype swings back around and people start demanding proper graph-based dependency tracking instead of just a command runner


Alright, alternatives evaluated, build system ready, time to include SDL! Once again, I went for Git submodules, but this time they're held together by a batch file that ensures that the intended versions are checked out before starting Tup. Git submodules have a bad rap mainly because of their usability issues, and such a script should hopefully work around them? Let's see how this plays out. If it ends up causing issues after all, I'll just switch to a Zig-like model of downloading and unzipping a source archive. Since Windows comes with curl and tar these days, this can even work without any further dependencies, and will also remove all the test code bloat.

Compiling SDL from a non-standard build system requires a bit of globbing to include all the code that is being referenced, as well as a few linker settings, but it's ultimately not much of a big deal. I'm quite happy that it was possible at all without pre-configuring a build, but hey, that's what maintaining a Visual Studio project file does to a project. :tannedcirno:
By building SDL with the stock Windows configuration, we then end up with exactly what the SDL developers want us to use… which is a DLL. You can statically link SDL, but they really don't want you to do that. So strongly, in fact, that they not merely argue how well the textbook advantages of dynamic linking have worked for them and gamers as a whole, but implemented a whole dynamic API system that enforces overridable dynamic function loading even in static builds. Nudging developers to their preferred solution by removing most advantages from static linking by default… that's certainly a strategy. It definitely fits with SDL's grassroots marketing, which is very good at painting SDL as the industry standard and the only reliable way to keep your game running on all originally supported operating systems. Well, at least until SDL 3 is so stable that SDL 2 gets deprecated and won't receive any code for new backends…

However, dynamic linking does make sense if you consider what SDL is. Offering all those multiple rendering, input, and sound backends is what sets it apart from its more hip competition, and you want to have all of them available at any time so that SDL can dynamically select them based on what works best on a system. As a result, everything in SDL is being referenced somewhere, so there's no dead code for the linker to eliminate. Linking SDL statically with link-time code generation just prolongs your link time for no benefit, even without the dynamic API thwarting any chance of SDL calls getting inlined.
There's one thing I still don't like about all this, though. The dynamic API's table references force you to include all of SDL's subsystems in the DLL even if your game doesn't need some of them. But it does fit with their intention of having SDL2.dll be swappable: If an older game stopped working because of an outdated SDL2.dll, it should be possible for anyone to get that game working again by replacing that DLL with any newer version that was bundled with any random newer game. And since that would fail if the newer SDL2.dll was size-optimized to not include some of the subsystems that the older game required, they simply removed (or de-prioritized) the possibility altogether. Maybe that was their train of thought? You can always just use the official Windows DLL, whose whole point is to include everything, after all. 🤷

So, what do we get in these 1.5 MiB? There are:

Unfortunately, SDL 2 also statically references some newer Windows API functions and therefore doesn't run on Windows 98. Since this build of Shuusou Gyoku doesn't introduce any new features to the input or sound interfaces, we can still use pbg's original DirectSound and DirectInput code for the i586 build to keep it working with the rest of the platform-independent game logic code, but it will start to lag behind in features as soon as we add support for SC-88Pro BGM or more sophisticated input remapping. If we do want to keep this build at the same feature level as the SDL one, we now have a choice: Do we write new DirectInput and DirectSound code and get it done quickly but only for Shuusou Gyoku, or do we port SDL 2 to Windows 98 and benefit all other SDL 2 games as well? I leave that for my backers to decide.


Immediately after writing the first bits of actual SDL code to initialize the library and create the game window, you notice that SDL makes it very simple to gradually migrate a game. After creating the game window, you can call SDL_GetWindowWMInfo() to retrieve HWND and HINSTANCE handles that allow you to continue using your original DirectDraw, DirectSound, and DirectInput code and focus on porting one subsystem at a time.
Sadly, D3DWindower can no longer turn SDL's fullscreen mode into a windowed one, but DxWnd still works, albeit behaving a bit janky and insisting on minimizing the game whenever its window loses focus. But in exchange, the game window can surprisingly be moved now! Turns out that the originally fixed window position had nothing to do with the way the game created its DirectDraw context, and everything to do with pbg blocking the Win32 "syscommand" that allows a window to be moved. By deleting a system menu… seriously?! Now I'm dying to hear the Raymond Chen explanation for how this behavior dates back to an unfortunate decision during the Win16 days or something.
As implied by that commit, I immediately backported window movability to the i586 build.

However, the most important part of Shuusou Gyoku's main loop is its frame rate limiter, whose Win32 version leaves a bit of room for improvement. Outside of the uncapped [おまけ] DrawMode, the original main loop continuously checks whether at least 16 milliseconds have elapsed since the last simulated (but not necessarily rendered) frame. And by that I mean continuously, and deliberately without using any of the Windows system facilities to sleep the process in the meantime, as evidenced by a commented-out Sleep(1) call. This has two important effects on the game:

Unsurprisingly, SDL features a delay function that properly sleeps the process for a given number of milliseconds. But just specifying 16 here is not exactly what we want:

  1. Sure, modern computers are fast, but a frame won't ever take an infinitely fast 0 milliseconds to render. So we still need to take the current frame time into account.
  2. SDL_Delay()'s documentation says that the wake-up could be further delayed due to OS scheduling.

To address both of these issues, I went with a base delay time of 15 ms minus the time spent on the current frame, followed by busy-waiting for the last millisecond to make sure that the next frame starts on the exact frame boundary. And lo and behold: Even though this still technically wastes up to 1 ms of CPU time, it still dropped CPU usage into the 0%-2% range during gameplay on my Intel Core i5-8400T CPU, which is over 5 years old at this point. Your laptop battery will appreciate this new build quite a bit.


Time to look at audio then, because it sure looks less complicated than input, doesn't it? Loading sounds from .WAV file buffers, playing a fixed number of instances of every sound at a given position within the stereo field and with optional looping… and that's everything already. The DirectSound implementation is so straightforward that the most complex part of its code is the .WAV file parser.
Well, the big problem with audio is actually finding a cross-platform backend that implements these features in a way that seamlessly works with Shuusou Gyoku's original files. DirectSound really is the perfect sound API for this game:

The last point can't really be an argument against anything, but we'd still be left with 7 other boxes that a cross-platform alternative would have to tick. We already picked SDL for our portability needs, so how does its audio subsystem stack up? Unfortunately, not great:

OK, sure, but you're not supposed to use it for anything more than a single stream of audio. SDL_mixer exists precisely to cover such non-trivial use cases, and it even supports sound effect looping and panning with just a single function call! But as far as the rest of the library is concerned, it manages to be an even bigger disappointment than raw SDL audio:

There is a fork that does add support for an arbitrary number of music streams, but the rest of its features leave me questioning the priorities and focus of this project. Because surely, when I think about missing features in an audio backend, I immediately think about support for a vast array of chiptune file formats… 🤪
And wait, what, they merged this piece of bloat back into the official SDL_mixer library?! Thanks for opening up a vast attack surface for potential security vulnerabilities in code that would never run for the majority of users, just to cover some niche formats that nobody would seriously expect in a general audio library. And that's coming from someone who loves listening to that stuff!
At this rate, I'm expecting SDL_mixer to gain a mail client by the end of the decade. Hmm, what's the closest audio thing to a mail client… oh, right, WebRTC! Yeah, let's just casually drop a giant part of the Chromium codebase into SDL_mixer, what could possibly go wrong?

This dire situation made me wonder if SDL was the wrong choice for Shuusou Gyoku to begin with. Looking at other low-level cross-platform game libraries, you'll quickly notice that all of them come with mostly equally capable 2D renderers these days, and mainly differentiate themselves in minute API details that you'd only notice upon a really close look.
raylib is another one of those libraries and has been getting exceptionally popular in recent years, to the point of even having more than twice as many GitHub stars as SDL. By restricting itself to OpenGL, it can even offer an abstraction for shaders, which we'd really like for the 西方Project lens ball effect.
In the case of raylib's audio system, the lack of sound effect looping is the minute API detail that would make it annoying to use for Shuusou Gyoku. But it might be worth a look at how raylib implements all this if it doesn't use SDL… which turned out to be the best look I've taken in a long time, because raylib builds on top of miniaudio which is exactly the kind of audio library I was hoping to find. Let's check the list from above:

Oh, and it's written by the same developer who also wrote the best FLAC library back in 2018. And that's despite them being single-file C libraries, which I consider to be massively overrated…

The drawback? Similar to Zig, it's only on version 0.11.18, and also focuses on good high-level documentation at the expense of an API reference. Unlike Zig though, the three issues I ran into turned out to be actual and fixable bugs: Two minor ones related to looping of streamed sounds shorter than 2 seconds which won't ever actually affect us before we get into BGM modding, and a critical one that added high-frequency corruption to any mono sound effect during its expansion to stereo. The latter took days to track down – with symptoms like these, you'd immediately suspect the bug to lie in the resampler or its low-pass filter, both of which are so much more of a fickle and configurable part of the conversion chain here. Compared to that, stereo expansion is so conceptually simple that you wouldn't imagine anyone getting it wrong.
While the latter PR has been merged, the fix is still only part of the dev branch and hasn't been properly released yet. Fortunately, raylib is not affected by this bug: It does currently ship version 0.11.16 of miniaudio, but its usage of the library predates miniaudio's high-level API and it therefore uses a different, non-SSE-optimized code path for its format conversions.

The only slightly tricky part of implementing a miniaudio backend for Shuusou Gyoku lies in setting up multiple simultaneously playing instances for each individual sound. The documentation and answers on the issue tracker heavily push you toward miniaudio's resource manager and its file abstractions to handle this use case. We surely could turn Shuusou Gyoku's numeric sound effect IDs into fake file names, but it doesn't really fit the existing architecture where the sound interface just receives in-memory .WAV file buffers loaded from the SOUND.DAT packfile.
In that case, this seems to be the best way:


As a side effect of hunting that one critical bug in miniaudio, I've now learned a fair bit about audio resampling in general. You'll probably need some knowledge about basic digital signal behavior to follow this section, and that video is still probably the best introduction to the topic.

So, how could this ever be an issue? The only time I ever consciously thought about resampling used to be in the context of the Opus codec and its enforced sampling rate of 48,000 Hz, and how Opus advocates claim that resampling is a solved problem and nothing to worry about, especially in the context of a lossy codec. Still, I didn't add Opus to thcrap's BGM modding feature entirely because the mere thought of having to downsample to 44,100 Hz in the decoder was off-putting enough. But even if my worries were unfounded in that specific case: Recording the Stereo Mix of Shuusou Gyoku's now two audio backends revealed that apparently not every audio processing chain features an Opus-quality resampler…

If we take a look at the material that resamplers actually have to work with here, it quickly becomes obvious why their results are so varied. As mentioned above, Shuusou Gyoku's sound effects use rather low sampling rates that are pretty far away from the 48,000 Hz your audio device is most definitely outputting. Therefore, any potential imaging noise across the extended high-frequency range – i.e., from the original Nyquist frequencies of 11,025 Hz/5,512.5 Hz up to the new limit of 24,000 Hz – is still within the audible range of most humans and can clearly color the resulting sound.
But it gets worse if the audio data you put into the resampler is objectively defective to begin with, which is exactly the problem we're facing with over half of Shuusou Gyoku's sound effects. Encoding them all as 8-bit PCM is definitely excusable because it was the turn of the millennium and the resulting noise floor is masked by the BGM anyway, but the blatant clipping and DC offsets definitely aren't:

KEBARI TAME LASER LASER2 BOMB SELECT HIT CANCEL WARNING SBLASER BUZZ MISSILE JOINT DEAD SBBOMB BOSSBOMB ENEMYSHOT HLASER TAMEFAST WARP
<code>SOUND.DAT</code>, file 1/20<code>SOUND.DAT</code>, file 2/20<code>SOUND.DAT</code>, file 3/20<code>SOUND.DAT</code>, file 4/20<code>SOUND.DAT</code>, file 5/20<code>SOUND.DAT</code>, file 6/20<code>SOUND.DAT</code>, file 7/20<code>SOUND.DAT</code>, file 8/20<code>SOUND.DAT</code>, file 9/20<code>SOUND.DAT</code>, file 10/20<code>SOUND.DAT</code>, file 11/20<code>SOUND.DAT</code>, file 12/20<code>SOUND.DAT</code>, file 13/20<code>SOUND.DAT</code>, file 14/20<code>SOUND.DAT</code>, file 15/20<code>SOUND.DAT</code>, file 16/20<code>SOUND.DAT</code>, file 17/20<code>SOUND.DAT</code>, file 18/20<code>SOUND.DAT</code>, file 19/20<code>SOUND.DAT</code>, file 20/20
Waveforms for all 20 of Shuusou Gyoku's sound effects, in the order they appear inside SOUND.DAT and with their internal names. We can see quite an abundance of clipping, as well as a significant DC offset in WARNING, BUZZ, JOINT, SBBOMB, and BOSSBOMB.

Wait a moment, true peaks? Where do those come from? And, equally importantly, how can we even observe, measure, and store anything above the maximum amplitude of a digital signal?

The answer to the first question can be directly derived from the Xiph.org video I linked above: Digital signals are lollipop graphs, not stairsteps as commonly depicted in audio editing software. Converting them back to an analog signal involves constructing a continuous curve that passes through each sample point, and whose frequency components stay below the Nyquist frequency. And if the amplitude of that reconstructed wave changes too strongly and too rapidly, the resulting curve can easily overshoot the maximum digital amplitude of 0 dBFS even if none of the defined samples are above that limit.

But I can assure you that I did not create the waveform images above by recording the analog output of some speakers or headphones and then matching the levels to the original files, so how did I end up with that image? It's not an Audacity feature either because the development team argues that there is no "true waveform" to be visualized as every DAC behaves differently. While this is correct in theory, we'd be happy just to get a rough approximation here.
ffmpeg's ebur128 filter has a parameter to measure the true peak of a waveform and fairly understandable source code, and once I looked at it, all the pieces suddenly started to make sense: For our purpose of only looking at digital signals, 💡 resampling to a floating-point signal with an infinite sampling rate is equivalent to a DAC. And that's exactly what this filter does: It picks 192,000 Hz and 64-bit float as a format that's close enough to the ideal of "analog infinity" for all practical purposes that involve digital audio, and then simply converts each incoming 100 ms of audio and keeps the sample with the largest floating-point value.

So let's store the resampled output as a FLAC file and load it into Audacity to visualize the clipped peaks… only to find all of them replaced with the typical kind of clipping distortion? 😕 Turns out that I've stumbled over the one case where the FLAC format isn't lossless and there's actually no alternative to .WAV: FLAC just doesn't support floating-point samples and simply truncates them to discrete integers during encoding. When we measured inter-sample peaks above, we weren't only resampling to a floating-point format to avoid any quantization to discrete integer values, but also to make it possible to store amplitudes beyond the 0 dBFS point of ±1.0 in the first place. Once we lose that ability, these amplitudes are clipped to the maximum value of the integer bit depth, and baked into the waveform with no way to get rid of them again. After all, the resampled file now uses a higher sampling rate, and the clipping distortion is now a defined part of what the sound is.
Finally, storing a digital signal with inter-sample peaks in a floating-point format also makes it possible for you to reduce the volume, which moves these peaks back into the regular, unclipped amplitude range. This is especially relevant for Shuusou Gyoku as you'll probably never listen to sound effects at full volume.

Now that we understand what's going on there, we can finally compare the output of various resamplers and pick a suitable one to use with miniaudio. And immediately, we see how they fall into two categories:

miniaudio only comes with a linear resampler – but so does DirectSound as it turns out, so we can get actually pretty close to how the game sounded originally:

All of Shuusou Gyoku's sound effects combined and resampled into a single 48,000 Hz / 32-bit float .WAV file, using GoldWave's File Merger tool. By converting to 32-bit float first and then resampling, the conversion preserved the exact frequency range of the original 22,050 Hz and 11,025 Hz files, even despite clipping. There are small noise peaks across the entire frequency range, but they only occur at the exact boundary between individual sound effects. These are a simple result of the discontinuities that naturally occur in the waveform when concatenating signals that don't start or end at a 0 sample.
As mentioned above, you'll only get this sound out of your DAC at lower volumes where all of the resampled peaks still fit within 0 dBFS. But you most likely will have reduced your volume anyway, because these effects would be ear-splittingly loud otherwise.
The result of converting 1️⃣ into FLAC. The necessary bit depth conversion from 32-bit float to 16-bit integers clamps any data above 0 dBFS or ±1.0f to the discrete -32,678 32,767, the maximum value of such an integer. The resulting straight lines at maximum amplitude in the time domain then turn into distortion across the entire 24,000 Hz frequency domain, which then remains a part of the waveform even at lower volumes. The locations of the high-frequency noise exactly match the clipped locations in the time-domain waveform images above.
The resulting additional distortion can be best heard in BOSSBOMB, where the low source frequency ensures that any distortion stays firmly within the hearing range of most humans.
All of Shuusou Gyoku's sound effects as played through DirectSound and recorded through Stereo Mix. DirectSound also seems to use a linear low-pass filter that leaves quite a bit of high-frequency noise in the signals, making these effects sound crispier than they should be. Depending on where you stand, this is either highly inaccurate and something that should be fixed, or actually good because the sound effects really benefit from that added high end. I myself am definitely in the latter camp – and hey, this sound is the result of original game code, so it is accurate at least in that regard. :tannedcirno:
All of Shuusou Gyoku's sound effects as converted by miniaudio and directly saved to a file, with the same low-pass filter setting used in the P0256 build. This first-order low-pass filter is a decent approximation of DirectSound's resampler, even though it sounds slightly crispier as the high-frequency noise is boosted a little further. By default, miniaudio would use a 4th-order low-pass filter, so this is the second-lowest resampling quality you can get, short of disabling the low-pass filter altogether.
Conversion results when using miniaudio's 8th-order low-pass filter for resampling, the highest quality supported. This is the closest we can get to the reference conversion without using a custom resampler. If we do want to go for perfect accuracy though, we might as well go for 1️⃣ directly?

These spectrum images were initially created using ffmpeg's -lavfi showspectrumpic=mode=combined:s=1280x720 filter. The samples appear in the same order as in the waveform above.

And yes, these are indeed the first videos on this blog to have sound! I spent another push on preparing the 📝 video conversion pipeline for audio support, and on adding the highly important volume control to the player. Web video codecs only support lossy audio, so the sound in these videos will not exactly match the spectrum image, but the lossless source files do contain the original audio as uncompressed PCM streams.


Compared to that whole mess of signals and noise, keyboard and joypad input is indeed much simpler. Thanks to SDL, it's almost trivial, and only slightly complicated because SDL offers two subsystems with seemingly identical APIs:

To match Shuusou Gyoku's original WinMM backend, we'd ideally want to keep the best aspects from both APIs but without being restricted to SDL_GameController's idea of a controller. The Joy Pad menu just identifies each button with a numeric ID, so SDL_Joystick would be a natural fit. But what do we do about directional controls if SDL_Joystick doesn't tell us which joypad axes correspond to the X and Y directions, and we don't have the SDL-recommended configuration UI yet? Doing that right would also mean supporting POV hats and D-pads, after all… Luckily, all joypads we've tested map their main X axis to ID 0 and their main Y axis to ID 1, so this seems like a reasonable default guess.

Fortunately, there is a solution for our exact issue. We can still try to open a joypad via SDL_GameController, and if that succeeds, we can use a function to retrieve the SDL_Joystick ID for the main X and Y axis, close the SDL_GameController instance, and keep using SDL_Joystick for the rest of the game.
And with that, the SDL build no longer needs DirectInput 7, certain antivirus scanners will no longer complain about its low-level keyboard hook, and I turned the original game's single-joypad hot-plugging into multi-joypad hot-plugging with barely any code. 🎮

The necessary consolidation of the game's original input handling uncovered several minor bugs around the High Score and Game Over screen that I sufficiently described in the release notes of the new build. But it also revealed an interesting detail about the Joy Pad screen: Did you know that Shuusou Gyoku lets you unbind all these actions by pressing more than one joypad button at the same time? The original game indicated unbound actions with a [Button 0] label, which is pretty confusing if you have ever programmed anything because you now no longer know whether the game starts numbering buttons at 0 or 1. This is now communicated much more clearly.

Joypad button unbinding in the original version of Shuusou Gyoku, indicated by a rather confusing [Button 0] labelJoypad button unbinding in the P0256 build of Shuusou Gyoku, using a much clearer [--------] label
ESC is not bound to any joypad button in either screenshot, but it's only really obvious in the P0256 build.

With that, we're finally feature-complete as far as this delivery is concerned! Let's send a build over to the backers as a quick sanity check… a~nd they quickly found a bug when running on Linux and Wine. When holding a button, the game randomly stops registering directional inputs for a short while on some joypads? Sounds very much like a Wine bug, especially if the same pad works without issues on Windows.
And indeed, on certain joypads, Wine maps the buttons to completely different and disconnected IDs, as if it simply invents new buttons or axes to fill the resulting gaps. Until we can differentiate joypad bindings per controller, it's therefore unlikely that you can use the same joypad mapping on both Windows and Linux/Wine without entering the Joy Pad menu and remapping the buttons every time you switch operating systems.

Still, by itself, this shouldn't cause any issues with my SDL event handling code… except, of course, if I forget a break; in a switch case. 🫠
This completely preventable implicit fallthrough has now caused a few hours of debugging on my end. I'd better crank up the warning level to keep this from ever happening again. Opting into this specific warning also revealed why we haven't been getting it so far: Visual Studio did gain a whole host of new warnings related to the C++ Core Guidelines a while ago, including the one I was looking for, but actually getting the compiler to throw these requires activating a separate static analysis mode together with a plugin, which significantly slows down build times. Therefore I only activate them for release builds, since these already take long enough. :onricdennat:

But that wasn't the only step I took as a result of this blunder. In addition, I now offer free fixes for regressions in my mod releases if anyone else reports an issue before I find it myself. I've already been following this policy 📝 earlier this year when mu021 reported the unblitting bug in the initial release of the TH01 Anniversary Edition, and merely made it official now. If I was the one who broke a thing, I'll fix it for free.


Since all that input debugging already started a 5th push, I might as well fill that one by restoring the original screenshot feature. After all, it's triggered by a key press (and is thus related to the input backend), reads the contents of the frame buffer (and is thus related to the graphics backend), and it honestly looks bad to have this disclaimer in the release notes just because we're one small feature away from 100% parity with pbg's original binary.
Coincidentally, I had already written code to save a DirectDraw surface to a .BMP file for all the debugging I did in the last delivery, so we were basically only missing filename generation. Except that Shuusou Gyoku's original choice of mapping screenshots to the PrintScreen key did not age all too well:

As a result, both Arandui and I independently arrived at the idea of remapping screenshots to the P key, which is the same screenshot key used by every Windows Touhou game since TH08.

The rest of the feature remains unchanged from how it was in pbg's original build and will save every distinct frame rendered by the game (i.e., before flipping the two framebuffers) to a .BMP file as long as the P key is being held. At a 32-bit color depth, these screenshots take up 1.2 MB per frame, which will quickly add up – especially since you'll probably hold the P key for more than 1/60 of a second and therefore end up saving multiple frames in a row. We should probably compress them one day.


Since I already translated some of Shuusou Gyoku's ASM code to C++ during the Zig experiment, it made sense to finish the fifth push by covering the rest of those functions. The integer math functions are used all throughout the game logic, and are the main reason why this goal is important for a Linux port, or any port to a 64-bit architecture for that matter. If you've ever read a micro-optimization-related blog post, you'll know that hand-written ASM is a great recipe that often results in the finest jank, and the game's square root function definitely delivers in that regard, right out of the gate.
What slightly differentiates this algorithm from the typical definition of an integer square root is that it rounds up: In real numbers, √3 is ≈ 1.73, so isqrt(3) returns 2 instead of 1. However, if the result is always rounded down, you can determine whether you have to round up by simply squaring the calculated root and comparing it to the radicand. And even that is only necessary if the difference between the two doesn't naturally fall out of the algorithm – which is what also happens with Shuusou Gyoku's original ASM code, but pbg didn't realize this and squared the result regardless. :tannedcirno:

That's one suboptimal detail already. Let's call the original ASM function in a loop over the entire supported range of radicands from 0 to 231 and produce a list of results that I can verify my C++ translation against… and watch as the function's linear time complexity with regard to the radicand causes the loop to run for over 15 hours on my system. 🐌 In a way, I've found the literal opposite of Q_rsqrt() here: Not fast, not inverse, no bit hacks, and surely without the awe-inspiring kind of WTF.
I really didn't want to run the same loop over a literal C++ translation of the same algorithm afterward. Calculating integer square roots is a common problem with lots of solutions, so let's see if we can go better than linear.

And indeed, Wikipedia also has a bitwise algorithm that runs in logarithmic time, uses only additions, subtractions, and bit shifts, and even ends up with an error term that we can use to round up the result as necessary, without a multiplication. And this algorithm delivers the exact same results over the exact same range in… 50 seconds. 🏎️ And that's with the I/O to print the first value that returns each of the 46,341 different square root results.

"But wait a moment!", I hear you say. "Why are you bothering with an integer square root algorithm to begin with? Shouldn't good old round(sqrt(x)) from <math.h> do the trick just fine? Our CPUs have had SSE for a long time, and this probably compiles into the single SQRTSD instruction. All that extra floating-point hardware might mean that this instruction could even run in parallel with non-SSE code!"
And yes, all of that is technically true. So I tested it, and my very synthetic and constructed micro-benchmark did indeed deliver the same results in… 48 seconds. :thonk: That's not enough of a difference to justify breaking the spirit of treating the FPU as lava that permeates Shuusou Gyoku's code base. Besides, it's not used for that much to begin with:

After a quick C++ translation of the RNG function that spells out a 32-bit multiplication on a 32-bit CPU using 16-bit instructions, we reach the final pieces of ASM code for the 8-bit atan2() and trapezoid rendering. These could actually pass for well-written ASM code in how they express their 64-bit calculations: atan8() prepares its 64-bit dividend in the combined EDX and EAX registers in a way that isn't obvious at all from a cursory look at the code, and the trapezoid functions effectively use Q32.32 subpixels. C++ allows us to cleanly model all these calculations with 64-bit variables, but unfortunately compiles the divisions into a call to a comparatively much more bloated 64-bit/64-bit-division polyfill function. So yeah, we've actually found a well-optimized piece of inline assembly that even Visual Studio 2022's optimizer can't compete with. But then again, this is all about code generation details that are specific to 32-bit code, and it wouldn't be surprising if that part of the optimizer isn't getting much attention anymore. Whether that optimization was useful, on the other hand… Oh well, the new C++ version will be much more efficient in 64-bit builds.

And with that, there's no more ASM code left in Shuusou Gyoku's codebase, and the original DirectXUTYs directory is slowly getting emptier and emptier.


Phew! Was that everything for this delivery? I think that was everything. Here's the new build, which checks off 7 of the 15 remaining portability boxes:

:sh01: Shuusou Gyoku P0256

Next up: Taking a well-earned break from Shuusou Gyoku and starting with the preparations for multilingual PC-98 Touhou translatability by looking at TH04's and TH05's in-game dialog system, and definitely writing a shorter blog post about all that…

📝 Posted:
🚚 Summary of:
P0217
Commits:
(Seihou) pbg...P0217
💰 Funded by:
Arandui
🏷 Tags:

First of all: This blog is now available as a web feed, in three different formats!

Thanks to handlerug for implementing and PR'ing the feature in a very clean way. That makes at least two people I know who wanted to see feed support, so there are probably a few more out there.


So, Shuusou Gyoku. pbg released the original source code for the first two Seihou games back in February 2019, but notably removed the crucial decompression code for the original packfiles due to… various unspecified reasons, considerations, and implications. :thonk: This vague language and subsequent rejection of a pull request to add these features back in were probably the main reasons why no one has publicly done anything with this codebase since.

The only other fork I know about is Priw8's private fork from 2020, but only because WishMakers informed me about it shortly after this push was funded. Both of them might also contribute some features to my fork in the future if their time allows it.
In this fork, Priw8 replaced packfile decompression with raw reads from directories with the pre-extracted contents of all the .DAT files. This works for playing the game, but there are actually two more things that require the original packfile code:

We can surely implement our own simple and uncompressed formats for these things, but it's not the best idea to build all future Shuusou Gyoku features on top of a replay-incompatible fork. So, what do we do? On the one hand, pbg expressed the clear wish to not include data reverse-engineered from the original binary. On the other hand, he released the code under the MIT license, which allows us to modify the code and distribute the results in any way we wish.
So, let's meet in the middle, and go for a clean-room implementation of the missing features as indicated by their usage, without looking at either the original binary or wangqr's reverse-engineered code.


With incremental rebuilds being broken in the latest Visual Studio project files as well, it made sense to start from scratch on pbg's last commit. Of course, I can't pass up a chance to use 📝 Tup, my favorite build system for every project I'm the main developer of. It might not fit Shuusou Gyoku as well as it fits ReC98, but let's see whether it would be reasonable at all…

… and it's actually not too bad! Modern Visual Studio makes this a bit harder than it should be with all the intermediate build artifacts you have to keep track of. In the end though, it's still only 70 lines of Lua to have a nice abstraction for both Debug and Release builds. With this layer underneath, the actual Shuusou Gyoku-specific part can be expressed as succinctly as in any other modern build system, while still making every compiler flag explicit. It might be slightly slower than a traditional .vcxproj build due to launching one cl.exe process per translation unit, but the result is way more reliable and trustworthy compared to anything that involves Visual Studio project files. This simplicity paves the way for expanding the build process to multiple steps, and doing all the static checking on translation strings that I never got to do for thcrap-based patches. Heck, I might even compile all future translations directly into the binary…

Every C++ build system will invariably be hated by someone, so I'd say that your goal should always be to simplify the actually important parts of your build enough to allow everyone else to easily adapt it to their favorite system. This Tupfile definitely does a better job there than your average .vcxproj file – but if you still want such a thing (or, gasp, 🤮 CMake project files 🤮) for better Visual Studio IDE integration, you should have no problem generating them for yourself.
There might still be a point in doing that because that's the one part that unfortunately sucks about this approach. Visual Studio is horribly broken for any nonstandard C++ project even in 2022:

In both cases, IntelliSense doesn't work properly at all even if it appears to be configured correctly, and Tup's dependency tracking appeared to be weirdly cut off for the very final .PDB file. Interestingly though, using the big Visual Studio IDE for just debugging a binary via devenv bin/GIAN07.exe suddenly eliminates all the IntelliSense issues. Looks like there's a lot of essential information stored in the .PDB files that Visual Studio just refuses to read in any other context. :thonk:

But now compare that to Visual Studio Code: Open it from the x64_x86 Cross Tools Command Prompt via code ., launch a build or debug task, or browse the code with perfect IntelliSense. Three small configuration files and everything just works – heck, you even get the Tup progress bar in the terminal. It might be Electron bloatware and horribly slow at times, but Visual Studio Code has long outperformed regular Visual Studio in terms of non-debug functionality.


On to the compression algorithm then… and it's just textbook LZSS, with 13 bits for the offset of a back-reference and 4 bits for its length? Hardly a trade secret there. The hard parts there all come from unexpected inefficiencies in the bitstream format:

  1. Encoding back-references as offsets into an 8 KiB ring buffer dictionary means that the most straightforward implementation actually needs an 8 KiB array for the LZSS sliding window. This could have easily been done with zero additional memory if the offset was encoded as the difference to the current byte instead.
  2. The packfile format stores the uncompressed size of every file in its header, which is a good thing because you want to know in advance how much heap memory to allocate for a specific file. Nevertheless, the original game only stops reading bits from the packfile once it encountered a back-reference with an offset of 0. This means that the compressor not only has to write this technically unneeded back-reference to the end of the compressed bitstream, but also ignore any potential other longest back-reference with an offset of 0 within the file. The latter can easily happen with a ring buffer dictionary.

The original game used a single BIT_DEVICE class with mode flags for every combination of reading and writing memory buffers and on-disk files. Since that would have necessitated a lot of error checking for all (pseudo-)methods of this class, I wrote one dedicated small class for each one of these permutations instead. To further emphasize the clean-room property of this code, these use modern C++ memory ownership features: std::unique_ptr for the fixed-size read-only buffers we get from packfiles, std::vector for the newly compressed buffers where we don't know the size in advance, and std::span for a borrowed reference to an immutable region of memory that we want to treat as a bitstream. Definitely better than using the native Win32 LocalAlloc() and LocalFree() allocator, especially if we want to port the game away from Windows one day.

One feature I didn't use though: C++ fstreams, because those are trash. :tannedcirno: These days, they would seem to be the natural choice with the new std::filesystem::path type from C++17: Correctly constructed, you can pass that type to an fstream constructor and gain both locale independence on Windows and portability to everything else, without writing any Windows-specific UTF-16 code. But even in a Release build, fstreams add ~100 KB of locale-related bloat to the .EXE which adds no value for just reading binary files. That's just too embarrassing if you look at how much space the rest of the game takes up. Writing your own platform layer that calls the Win32 CreateFileW(), ReadFile(), and WriteFile() API functions is apparently still the way to go even in 2022. And with std::filesystem::path still being a welcome addition to C++, it's not too much code to write either.

This gets us file format compatibility with the original release… and a crash as soon as the ending starts, but only in Release mode? As it turns out, this crash is caused by an out-of-bounds array access bug that was present even in the original game, and only turned into a crash now because the optimizer in modern Visual Studio versions reorders static data. As a result, the 6-element pFontInfo array got placed in front of an ECL-related counter variable that then got corrupted by the write to the 7th element, which subsequently crashed the game with a read access to previously deallocated danmaku script data. That just goes to show that these technical bugs are important and worth fixing even if they don't cause issues in the original game. Who knows how many of these will turn into crashes once we get to porting PC-98 Touhou?


So here we go, a new build of Shuusou Gyoku, compiled with Visual Studio 2022, and compatible with all original data formats:

:sh01: Shuusou Gyoku P0217

Inside the regular Shuusou Gyoku installation directory, this binary works as a full-fledged drop-in replacement for the original 秋霜玉.exe. It still has all of the original binary's problems though:

As well as some of its own:

So all in all, it's a strict downgrade at this point in time. :onricdennat: And more of a symbol that we can now start doing actual work on this game. Seihou has been a fun change of pace, and I hope that I get to do more work on the series. There is quite a lot to be done with Shuusou Gyoku alone, and the 21 GitHub issues I've opened are probably only scratching the surface.
However, all the required research for this one consumed more like 1⅔ pushes. Despite just one push being funded, it wouldn't have made sense to release the commits or this binary in any earlier state. To repay this debt, I'm going to put the next for Seihou towards the small code maintenance and performance tasks that I usually do for free, before doing any more feature and bugfix work. Next up: Improving video playback on the blog, and maybe delivering some microtransaction work on the side?

📝 Posted:
🚚 Summary of:
P0137
Commits:
07bfcf2...8d953dc
💰 Funded by:
[Anonymous]
🏷 Tags:

Whoops, the build was broken again? Since P0127 from mid-November 2020, on TASM32 version 5.3, which also happens to be the one in the DevKit… That version changed the alignment for the default segments of certain memory models when requesting .386 support. And since redefining segment alignment apparently is highly illegal and absolutely has to be a build error, some of the stand-alone .ASM translation units didn't assemble anymore on this version. I've only spotted this on my own because I casually compiled ReC98 somewhere else – on my development system, I happened to have TASM32 version 5.0 in the PATH during all this time.
At least this was a good occasion to get rid of some weird segment alignment workarounds from 2015, and replace them with the superior convention of using the USE16 modifier for the .MODEL directive.

ReC98 would highly benefit from a build server – both in order to immediately spot issues like this one, and as a service for modders. Even more so than the usual open-source project of its size, I would say. But that might be exactly because it doesn't seem like something you can trivially outsource to one of the big CI providers for open-source projects, and quickly set it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular 64-bit Windows 10 and DOSBox running the exact build tools from the DevKit. Ideally, though, such a server should really run the optimal configuration of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step to run natively, which already is something that no popular CI service out there offers. Then, we'd optimally expand to Linux, every other Windows version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd be a lot. An experimental project all on its own, with additional hosting costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest there is once the store reopens (which will be at the beginning of May, at the latest). That aside, it would 📝 also be a great project for outside contributors!


So, technical debt, part 8… and right away, we're faced with TH03's low-level input function, which 📝 once 📝 again 📝 insists on being word-aligned in a way we can't fake without duplicating translation units. Being undecompilable isn't exactly the best property for a function that has been interesting to modders in the past: In 2018, spaztron64 created an ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human multiplayer mode: 2021-04-04-TH03-WASD-2player.zip However, this remapping attempt remained quite limited, since we hadn't (and still haven't) reached full position independence for TH03 yet. There's quite some potential for size optimizations in this function, which would allow more BIOS key groups to already be used right now, but it's not all that obvious to modders who aren't intimately familiar with x86 ASM. Therefore, I really wouldn't want to keep such a long and important function in ASM if we don't absolutely have to…

… and apparently, that's all the motivation I needed? So I took the risk, and spent the first half of this push on reverse-engineering TCC.EXE, to hopefully find a way to get word-aligned code segments out of Turbo C++ after all.

And there is! The -WX option, used for creating DPMI applications, messes up all sorts of code generation aspects in weird ways, but does in fact mark the code segment as word-aligned. We can consider ourselves quite lucky that we get to use Turbo C++ 4.0, because this feature isn't available in any previous version of Borland's C++ compilers.
That allowed us to restore all the decompilations I previously threw away… well, two of the three, that lookup table generator was too much of a mess in C. :tannedcirno: But what an abuse this is. The subtly different code generation has basically required one creative workaround per usage of -WX. For example, enabling that option causes the regular PUSH BP and POP BP prolog and epilog instructions to be wrapped with INC BP and DEC BP, for some reason:

a_function_compiled_with_wx proc
	inc 	bp    	; ???
	push	bp
	mov 	bp, sp
	    	      	; [… function code …]
	pop 	bp
	dec 	bp    	; ???
	ret
a_function_compiled_with_wx endp

Luckily again, all the functions that currently require -WX don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty close: snd_se_reset(void) is one of the functions that require word alignment. Previously, it shared a translation unit with the immediately following snd_se_play(int new_se), which does take a parameter, and therefore would have had its prolog and epilog code messed up by -WX. Since the latter function has a consistent (and thus, fakeable) alignment, I simply split that code segment into two, with a new -WX translation unit for just snd_se_reset(void). Problem solved – after all, two C++ translation units are still better than one ASM translation unit. :onricdennat: Especially with all the previous #include improvements.

The rest was more of the usual, getting us 74% done with repaying the technical debt in the SHARED segment. A lot of the remaining 26% is TH04 needing to catch up with TH03 and TH05, which takes comparatively little time. With some good luck, we might get this done within the next push… that is, if we aren't confronted with all too many more disgusting decompilations, like the two functions that ended this push. If we are, we might be needing 10 pushes to complete this after all, but that piece of research was definitely worth the delay. Next up: One more of these.

📝 Posted:
🚚 Summary of:
P0113, P0114
Commits:
150d2c6...6204fdd, 6204fdd...967bb8b
💰 Funded by:
Lmocinemod
🏷 Tags:

Alright, tooling and technical debt. Shouldn't be really much to talk about… oh, wait, this is still ReC98 :tannedcirno:

For the tooling part, I finished up the remaining ergonomics and error handling for the 📝 sprite converter that Jonathan Campbell contributed two months ago. While I familiarized myself with the tool, I've actually ran into some unreported errors myself, so this was sort of important to me. Still got no command-line help in there, but the error messages can now do that job probably even better, since we would have had to write them anyway.

So, what's up with the technical debt then? Well, by now we've accumulated quite a number of 📝 ASM code slices that need to be either decompiled or clearly marked as undecompilable. Since we define those slices as "already reverse-engineered", that decision won't affect the numbers on the front page at all. But for a complete decompilation, we'd still have to do this someday. So, rather than incorporating this work into pushes that were purchased with the expectation of measurable progress in a certain area, let's take the "anything goes" pushes, and focus entirely on that during them.

The second code segment seemed like the best place to start with this, since it affects the largest number of games simultaneously. Starting with TH02, this segment contains a set of random "core" functions needed by the binary. Image formats, sounds, input, math, it's all there in some capacity. You could maybe call it all "libzun" or something like that? But for the time being, I simply went with the obvious name, seg2. Maybe I'll come up with something more convincing in the future.


Oh, but wait, why were we assembling all the previous undecompilable ASM translation units in the 16-bit build part? By moving those to the 32-bit part, we don't even need a 16-bit TASM in our list of dependencies, as long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit Windows version. 🎉 Which is certainly the most user-visible improvement in all of these two pushes. :onricdennat:


Back in 2015, I already decompiled all of TH02's seg2 functions. As suggested by the Borland compiler, I tried to follow a "one translation unit per segment" layout, bundling the binary-specific contents via #include. In the end, it required two translation units – and that was even after manually inserting the original padding bytes via #pragma codestring… yuck. But it worked, compiled, and kept the linker's job (and, by extension, segmentation worries) to a minimum. And as long as it all matched the original binaries, it still counted as a valid reconstruction of ZUN's code. :zunpet:

However, that idea ultimately falls apart once TH03 starts mixing undecompilable ASM code inbetween C functions. Now, we officially have no choice but to use multiple C and ASM translation units, with maybe only just one or two #includes in them…

…or we finally start reconstructing the actual seg2 library, turning every sequence of related functions into its own translation unit. This way, we can simply reuse the once-compiled .OBJ files for all the binaries those functions appear in, without requiring that additional layer of translation units mirroring the original segmentation.
The best example for this is TH03's almost undecompilable function that generates a lookup table for horizontally flipping 8 1bpp pixels. It's part of every binary since TH03, but only used in that game. With the previous approach, we would have had to add 9 C translation units, which would all have just #included that one file. Now, we simply put the .OBJ file into the correct place on the linker command line, as soon as we can.

💡 And suddenly, the linker just inserts the correct padding bytes itself.

The most immediate gains there also happened to come from TH03. Which is also where we did get some tiny RE% and PI% gains out of this after all, by reverse-engineering some of its sprite blitting setup code. Sure, I should have done even more RE here, to also cover those 5 functions at the end of code segment #2 in TH03's MAIN.EXE that were in front of a number of library functions I already covered in this push. But let's leave that to an actual RE push 😛


All in all though, I was just getting started with this; the real gains in terms of removed ASM files are still to come. But in the meantime, the funding situation has become even better in terms of allowing me to focus on things nobody asked for. 🙂 So here's a slightly better idea: Instead of spending two more pushes on this, let's shoot for TH05 MAINE.EXE position independence next. If I manage to get it done, we'll have a 100% position-independent TH05 by the time -Tom- finishes his MAIN.EXE PI demo, rather than the 94% we'd get from just MAIN.EXE. That's bound to make a much better impression on all the people who will then (re-)discover the project.

📝 Posted:
🚚 Summary of:
P0001
Commits:
e447a2d...150d2c6
💰 Funded by:
GhostPhanom
🏷 Tags:

(tl;dr: ReC98 has switched to Tup for the 32-bit build. You probably want to get 💾 this build of Tup, and put it somewhere in your PATH. It's optional, and always will be, but highly recommended.)


P0001! Reserved for the delivery of the very first financial contribution I've ever received for ReC98, back in January 2018. GhostPhanom requested the exact opposite of immediate results, which motivated me to go on quite a passionate quest for the perfect ReC98 build system. A quest that went way beyond the crowdfunding…

Makefiles are a decent idea in theory: Specify the targets to generate, the source files these targets depend on and are generated from, and the rules to do the generating, with some helpful shorthand syntax. Then, you have a build dependency graph, and your make tool of choice can provide minimal rebuilds of only the targets whose sources changed since the last make call. But, uh… wait, this is C/C++ we're talking about, and doesn't pretty much every source file come with a second set of dependent source files, namely, every single #include in the source file itself? Do we really have to duplicate all these inside the Makefile, and keep it in sync with the source file? 🙄

This fact alone means that Makefiles are inherently unsuited for any language with an #include feature… that is, pretty much every language out there. Not to mention other aspects like changes to the compilation command lines, or the build rules themselves, all of which require metadata of the previous build to be persistently stored in some way. I have no idea why such a trash technology is even touted as a viable build tool for code.

But wait! Most make implementations, including Borland's, do support the notion of auto-dependency information, emitted by the compiler in a specific format, to provide make with the additional list of #includes. Sure, this should be a basic feature of any self-respecting build tool, and not something you have to add as an extension, but let's just set our idealism aside for a moment. Well, too bad that Borland's implementation only works if you spell out both the source➜object and the object➜binary rules, which loses the performance gained from compiling multiple translation units in a single BCC or TCC process. And even then, it tends to break in that DOS VM you're probably using. Not to mention, again, all the other aspects that still remain unsolved.

So, I decided to just write my own build system, tailor-made for the needs of ReC98's 16-bit build process, and combining a number of experimental ideas. Which is still not quite bug-free and ready for public use, given that the entire past year has kept me busy with actual tangible RE and PI progress. What did finally become ready, however, is the improvement for the 32-bit build part, and that's what we've got here.


💭 Now, if only there was a build system that would perfectly track dependencies of any compiler it calls, by injecting code and hooking file opening syscalls. It'd be completely unrealistic for it to also run on DOS (and we probably don't want to traverse a graph database in a cycle-limited DOSBox), but it would be perfect for our 32-bit build part, as long as that one still exists.

Turns out Tup is exactly that system. In practice, its low-level nature as a make replacement does limit its general usefulness, which is why you probably haven't seen it used in a lot of projects. But for something like ReC98 with its reliance on outdated compilers that aren't supported by any decent high-level tool, it's exactly the right tool for the job. Also, it's completely beyond me how Ninja, the most popular make replacement these days, was inspired by Tup, yet went a step back to parsing the specific dependency information produced by gcc, Clang, and Visual Studio, and only those

Sure, it might seem really minor to worry about not unconditionally rebuilding all 32-bit .asm files, which just takes a couple of seconds anyway. But minimal rebuilds in the 32-bit part also provide the foundation for minimal rebuilds in the 16-bit part – and those TLINK invocations do take quite some time after all.

Using Tup for ReC98 was an idea that dated back to January 2017. Back then, I already opened the pull request with a fix to allow Tup to work together with 32-bit TASM. As much as I love Tup though, the fact that it only worked on 64-bit Windows ≥Vista would have meant that we had to exchange perfect dependency tracking for the ability to build on 32-bit and older Windows versions at all. For a project that relies on DOS compilers, this would have been exactly the wrong trade-off to make.

What's worse though: TLINK fails to run on modern 32-bit Windows with Loader error (0000) : Unrecognized Error. Therefore, the set of systems that Tup runs on, and the set of systems that can actually compile ReC98's 16-bit build part natively, would have been exactly disjoint, with no OS getting to use both at the same time.
So I've kept using Tup for only my own development, but indefinitely shelved the idea of making it the official build system, due to those drawbacks. Recently though, it all came together:

As I'm writing this post, the pull request has unfortunately not yet been merged. So, here's my own custom build instead:

💾 Download Tup for 32-bit Windows (optimized build at this commit)

I've also added it to the DevKit, for any newcomers to ReC98.


After the switch to Tup and the fallback option, I extensively tested building ReC98 on all operating systems I had lying around. And holy cow, so much in that build was broken beyond belief. In the end, the solution involved just fully rebuilding the entire 16-bit part by default. :tannedcirno: Which, of course, nullifies any of the advantages we might have gotten from a Makefile in the first place, due to just how unreliable they are. If you had problems building ReC98 in the past, try again now!

And sure, it would certainly be possible to also get Tup working on Windows ≤XP, or 9x even. But I leave that to all those tinkerers out there who are actually motivated to keep those OSes alive. My work here is done – we now have a build process that is optimal on 32-bit Windows ≧Vista, and still functional and reliable on 64-bit Windows, Linux, and everything down to Windows 98 SE, and therefore also real PC-98 hardware. Pretty good, I'd say.

(If it weren't for that weird crash of the 16-bit TASM.EXE in that Windows 95 command prompt I've tried it in, it would also work on that OS. Probably just a misconfiguration on my part?)

Now, it might look like a waste of time to improve a 32-bit build part that won't even exist anymore once this project is done. However, a fully 16-bit DOS build will only make sense after

And who knows whether this project will get funded for that long. So yeah, the 32-bit build part will stay with us for quite some more time, and for all upcoming PI milestones. And with the current build process, it's pretty much the most minor among all the minor issues I can think of. Let's all enjoy the performance of a 32-bit build while we can 🙂

Next up: Paying some technical debt while keeping the RE% and PI% in place.

📝 Posted:
🏷 Tags:

TH01 pellets are coming up next, and for the first time, we'll have the chance to move hardcoded sprite data from ASM land to C land. As it would turn out, bad luck with the 2-byte alignment at the end of REIIDEN.EXE's data segment pretty much forces us to declare TH01's pellet sprites in C if we want to decompile the final few pellet functions without ugly workarounds for the float literals there. And while I could have just converted them into a C array and called it a day, it did raise the question of when we are going to do this The Right And Moddable Way, by auto-converting actual image files into ASM or C arrays during the build process. These arrays are even more annoying to edit in C, after all – unlike TASM, the old C++ we have to work with doesn't support binary number literals, only hexadecimal or, gasp, octal.
Without the explicit funding for such a converter, I reached out to GitHub, asking backers and outside contributors whether they'd be in favor of it. As something that requires no RE skills and collides with nothing else, it would be a perfect task for C/C++ coders who want to support ReC98 with something other than money.

And surprisingly, those still exist! Jonathan Campbell, of DOSBox-X fame, went ahead and implemented all the required functionality, within just a few days. Thanks again! The result is probably a lot more portable than it would have been if I had written it. Which is pretty relevant for future port authors – any additional tooling we write ourselves should not add to the list of problems they'll have to worry about.

Right now, all of the sprites are #included from the big ASM dump files, which means that they have to be converted before those files are assembled during the 32-bit build part. We could have introduced a third distinct build step there, perhaps even a 16-bit one so that we can use Turbo C++ 4.0J to also compile the converter… However, the more reasonable option was to do this at the beginning of the 32-bit build step, and add a 32-bit Windows C++ compiler to the list of tools required for ReC98's build process.
And the best choice for ReC98 is, in fact… 🥁… the 20-year-old Borland C++ 5.5 freeware release. See the README for a lengthy justification, as well as download links.

So yes, all sprites mentioned in the GitHub issue can now be modded by simply editing .BMP files, using an image editor of your choice. 🖌
And now that that's dealt with, it's finally time for more actual progress! TH01 pellets coming tomorrow.