Blog

⮞ List of tags

📝 Posted:
🚚 Summary of:
P0266, P0267, P0268, P0269, P0270, P0271, P0272, P0273, P0274, P0275, P0276, P0277
Commits:
(mly) eb2b0c8...b356884, (mly) b356884...1c70db0, (mly) 1c70db0...db0c195, (BGM packs) 2f9bce5...45087c2, (Seihou) P0256...a9ca081, (Seihou) a9ca081...8db918f, (Seihou) 8db918f...3de48ab, (Seihou) 3de48ab...9467705, (Seihou) 9467705...241a6c9, (Seihou) 241a6c9...P0275, (Seihou) dbc369f...883ac40, (Seihou) 883ac40...6ac72f3
💰 Funded by:
Ember2528, [Anonymous]
🏷 Tags:

📝 Over two years since the previous largest delivery, we've now got a new record in every regard: 12 pushes across 5 repos, 215 commits, and a blog post with over 14,000 words and 48 pieces of media. 😱 Who would have thought that the superficially simple task of putting SC-88Pro recordings into Shuusou Gyoku would actually mainly focus on deep research into the underlying MIDI files? I don't typically cover much music-related content because it's a non-issue as far as PC-98 Touhou code is concerned, so it's quite fitting how extensive this one turned out. So here we go, the result of virtually unlimited funding and patience:

  1. The SC-88Pro recording controversy
  2. Undefined SysEx behavior
  3. Resolving the controversy, and making a choice (contains personal opinion)
  4. A Unix-style command-line MIDI filter (in Rust BTW)
  5. Visualizing MIDI files (for science, and not for playing them on a keyboard)
  6. Shuusou Gyoku's individual loop quirks 🎺
  7. Rewriting pbg's MIDI code
  8. Putting together the BGM packs
  9. Outgrowing miniaudio (and raging about single-file C libraries for a while)
  10. Remaining implementation details
  11. Pricing changes (and no, not everything's getting more expensive)

So where's the controversy? Romantique Tp obviously made the best and most careful real-hardware SC-88Pro recordings of all of ZUN's old MIDIs, including the original (OST) and arranged (AST) soundtrack of Shuusou Gyoku, right? Surely all I have to do now is to cut them into seamless loops to save a bit of disk space, and then put them into the game? Let's start at the end of the track list with the name registration theme, since it's light on instruments and has an obvious loop point that will be easy to spot in the waveform. But, um… wait a moment, that very first drum note comes a bit late, doesn't it?

This can also be heard in Romantique Tp's YouTube upload.
At a notated tempo of 96 BPM, these first four beats should take exactly 2.5 seconds, which they do in this seamlessly looping softsynth rendering.

That's… not quite the accuracy and perfection I was expecting. :thonk: But I think I know what we're seeing and hearing there. Let's look at the first few MIDI events on the drum channel:

Delta	Pulse	 Beat	Channel	Event
 +540	   960	  2:000	      1	Controller { CC   0, value   0 }
   +0	   960	  2:000	      1	Controller { CC  32, value   0 }
   +0	   960	  2:000	      1	ProgramChange {  37 }
[…]
   +0	   960	  2:000	      2	Controller { CC   0, value   0 }
   +0	   960	  2:000	      2	Controller { CC  32, value   0 }
   +0	   960	  2:000	      2	ProgramChange {  19 }
[…]
   +0	   960	  2:000	      3	Controller { CC   0, value   0 }
   +0	   960	  2:000	      3	Controller { CC  32, value   0 }
   +0	   960	  2:000	      3	ProgramChange {   6 }
[…]
   +0	   960	  2:000	      4	Controller { CC   0, value   0 }
   +0	   960	  2:000	      4	Controller { CC  32, value   0 }
   +0	   960	  2:000	      4	ProgramChange {   2 }
[…]
Delta	Pulse	 Beat	Channel	Event
   +0	  960	2:000	     10	Controller { CC   0, value   0 }
   +0	  960	2:000	     10	Controller { CC  32, value   0 }
   +0	  960	2:000	     10	ProgramChange {  25 }
   +0	  960	2:000	     10	Controller { CC   7, value 127 }
   +0	  960	2:000	     10	Controller { CC  11, value 127 }
   +0	  960	2:000	     10	Controller { CC  10, value  64 }
   +0	  960	2:000	     10	Controller { CC  91, value  80 }
   +0	  960	2:000	     10	Controller { CC  93, value  40 }
   +0	  960	2:000	     10	NoteOn { Key  42, Vel.  94 }
   +0	  960	2:000	     10	NoteOn { Key  36, Vel. 110 }
   +1	  961	2:001	     10	NoteOn { Key  42, Vel.   0 }
   +0	  961	2:001	     10	NoteOn { Key  36, Vel.   0 }
 +119	 1080	2:120	     10	NoteOn { Key  42, Vel.  34 }
   +1	 1081	2:121	     10	NoteOn { Key  42, Vel.   0 }
 +119	 1200	2:240	     10	NoteOn { Key  42, Vel.  64 }
   +0	 1200	2:240	     10	NoteOn { Key  36, Vel.  64 }
Also, the fact that GS doesn't put its drums on a non-general voice bank and instead relies on external channel configuration to differentiate drums from pitched instruments is making this Yamaha kid uncontrollably furious. 🤬

Yup. That's the sound of a vintage hardware synth being slow and taking a two-digit number of milliseconds to process a barrage of simultaneous Program Change messages, playing a MIDI file that doesn't take this reality into account and expects program changes to happen instantly.
I can only speak from my own experience of writing MIDIs for hardware synths here, but having the first note displaced by 50 ms is very much not the way a composer would have intended the music to be heard if the note is clearly notated to occur on the beat. If you had told me about such an issue when playing one of my MIDIs on a certain synth, I would have thanked you for the bug report! And I would have promptly released a fixed version of the MIDI with the Program Change events moved back by a beat or two. In the case of Shuusou Gyoku's MIDIs, this wouldn't even have added any additional delay in-game, as all of these files already start with at least one beat of leading silence to make room for setting Roland-specific synth parameters.

OK, but that's just a single isolated bass drum hit. If we wanted to, we could even fix this issue ourselves by splicing the same note from around the loop end point. Maybe this is just an isolated case and the rest of Romantique Tp's recordings are fine? Well…

Again, check Romantique Tp's YouTube upload for proof.
By the way, this seamless audio player is what consumed most of the two website pushes this time. The rest went to the slightly redesigned main page, whose progress bars now use the cap bar style and the GitHub badge colors.

This one is even worse. Here, the delay is so long relative to the tempo of the piece that the intended five drum hits pretty much turn into four.

This type of issue doesn't even have to be isolated to the very beginning of a piece. A few of the tracks in both the OST and AST start with an anacrusis on just one or two channels and leave the Program Change event barrage at the beginning of the first full measure. In 幻想科学 ~ Doll's Phantom for example, this creates a flam-like glitch where the bass on channel 2 is pretty much on time, but the crash hit on channel 10 only follows 50 ms later, after the SC-88Pro took its sweet time to process all the Program Change events on the channels between:

This is from the arranged soundtrack for a change. In that one, ZUN at least fixed the issue in the final three MIDIs (シルクロードアリス, 魔女達の舞踏会, and 二色蓮花蝶 ~ Ancients) that closed out this rearranging project in May 2001, which spread out their per-channel setup events over at least a single measure before playing any note.

Let's listen to that at half speed:

Romantique Tp's YouTube upload.
Still on point.

Sure, all of this is barely noticeable in casual listening, but very noticeable if you're the one who now has to cut these recordings into seamless loops. And these are just the most obvious timing issues that can be easily pinpointed and documented – the actual worst aspects are all the minor tempo and timing fluctuations throughout most of the pieces. With recordings that deviate ever so slightly from the tempo defined in the MIDI files, you can no longer rely on mathematically exact sample positions when cutting loops. Even if those positions do work out from time to time, there'd pretty much always be a discontinuity in the waveform at both ends of the loop, manifesting as a clearly audible click. In the end, the only way of finding good loop points in existing recordings involves straining your ears and listening very, very closely to avoid any audible glitches. 😩

But if you've taken a look at the second tabs in the clips above, you will have noticed that we don't necessarily have to be stuck with recordings from real hardware. In late 2015, Roland released Sound Canvas VA, a VST plugin that emulates the classic core of Roland's old Sound Canvas lineup, including the SC-88Pro. As long as we run such a software synthesizer through a quality VST host, a purely software-based solution should be way superior for recording looped BGM:

Any drawbacks? For our use case, all of them are found in the abysmal software quality of everything around the synth engine. As it's typical for the VST industry, Sound Canvas VA is excessively DRM'd – it takes multiple seconds to start up, and even then only allows a single process to run at any given time, immediately quitting every process beyond the first one with a misleading Parameter File1 Read Error message box. I totally believe anyone who claims that this makes SCVA more annoying than real hardware when composing new music. Retro gamers also dislike how Roland themselves no longer sells the 32-bit builds they used to offer for the first few versions. These old versions are now exclusively available through resellers, or on the seven seas.
But as far as the SC-88Pro emulation is concerned, there don't seem to be any technical reasons against it. There is a long thread over at VOGONS discussing all sorts of issues, but you have to dig quite deep to find any clear descriptions of bugs in SCVA's synth engine. Everything I found either only applies to the SC-55 emulation and not the SC-88Pro, was fixed by Roland in the meantime, or turned out to be a fixable bug in a MIDI file.

Nevertheless, Romantique Tp has a very negative opinion about SCVA, getting quite angry and defensive in this instance where someone favorably compared SCVA to their recordings. Edit (2024-03-10): These days, Romantique Tp has a much more favorable opinion on SCVA as well.
8 years after their release, however, the community unanimously accepts the Romantique Tp recordings as the intended way to listen to ZUN's old MIDIs, so choosing Sound Canvas VA for our Shuusou Gyoku builds might be a bad idea purely for PR reasons. At best, people would slightly wonder why I intentionally went with the opposite of the accepted reference recordings, but at worst, this entire project could face a violent backlash…


But wait, we've already heard one obvious difference between the real SC-88Pro and Sound Canvas VA. Let's listen to the very first clip again:

Ha! You can clearly hear a panning echo in the real-hardware recording that is missing from the Sound Canvas VA rendering. That's an obvious case of a core system effect not being reproduced correctly. If even that's undeniably broken, who knows which other subtle bugs SCVA suffers from, right? Case closed, Romantique Tp was right all along, SCVA is trash, real hardware reigns supreme :godzun:

Actually, let's look closer into this one. Panning delay effects like this are typically reverb-related, but General MIDI only specifies a single controller to specify the per-channel reverb level from 0 to 127. Any specific characteristics of the reverb therefore have to be configured using vendor-specific system-exclusive messages, or SysEx for short.
So it's down to one of the four SysEx messages at the beginning of the MIDI file:

Delta	Pulse	 Beat	Event
   +0	    0	0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)
 +240	  240	0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)
 +120	  360	0:360	SysEx(41 10 42 12 40 01 33 0F 7D F7)
  +60	  420	0:420	SysEx(41 10 42 12 40 01 34 30 5B F7)

Since these byte strings represent Roland-specific instructions, we can't learn anything from a raw MIDI event dump alone here. No problem though, let's just load these files into some old MIDI sequencer that targeted Roland synths, open its MIDI event list, and then they will be automatically decoded into a human-readable representation…
…or at least that's what I expected. In Yamaha land, XGworks has done that for Yamaha's own XG SysEx messages ever since 1997:

Screenshot of the MIDI Event Viewer in Yamaha's XGworks, showing off its automatic XG SysEx decoding feature.
No configuration required. You can even edit the textual Value1 representation and XGworks parses it back into the closest supported value!

But for Roland synths, there's… nothing similar? Seriously? 😶 Roland fanboys, how do you even live?! I mean, they are quick to recommend the typical bloated and sluggish big-name DAWs that take up multiple gigabytes of disk space, but none of the ones I tried seemed to have this feature. They can't have possibly been flinging around raw byte strings for the past 33 years?!
But once you look more into today's MIDI community, it becomes clear that this is exactly what they've been doing. Why else would so many people use the word complicated to describe Roland SysEx, or call it an old school/cryptic communication protocol in hexadecimal format? The latter is particularly hilarious because if you removed the word cryptic, this might as well describe all of MIDI, not just SysEx. :tannedcirno: Everything about this is a tooling issue, and Yamaha showed how easily it could have been solved. Instead, we get Sound Canvas experts, who should know more about the ecosystem than I do, making the incredible mental leap from "my DAW doesn't decode or easily generate SysEx" to "SysEx is antiquated" to "please just lift up these settings to the VST level and into my proprietary DAW's proprietary project format, that would be so much better"

Thankfully that's not entirely true. After some more digging and configuration, I found a somewhat workable solution involving a comparatively modern sequencer called Domino:

  1. Download either Domino's original Japanese version or the partial English translation. The .zip file on the release page contains a full standalone build.
  2. Open the File → Preferences menu and associate your MIDI output device with a module map. This makes sense for SysEx encoding/generation since it can limit the options in the UI to what's actually available on your target hardware, but is also required for selecting the respective SysEx map into Domino's SysEx decoder. There is no technical reason for this because SC-88Pro SysEx messages can be uniquely identified by the three vendor, device, and model ID bytes that every message starts with, but would be too easy and user-friendly. The perception of SysEx being a black art must be upheld at all costs.
    Screenshot of Domino's MIDI-OUT window, complete with garbled text
    I've kept the garbled text of the partial translation to emphasize the sheer amount of jank involved in this entire process.
  3. Load a MIDI file and let Domino "analyze" it:
    Screenshot of Domino's analysis message box
  4. Strangely enough, this will take quite a while – on my system, this analysis step runs at a speed of roughly 4.25 KB/s of MIDI data. Yes, kilobytes.
  5. Unfortunately, "control change macro restoration" also seems to mean that you don't get to see any raw bytes when selecting the respective MIDI track in the UI, but at least we get what we were looking for:
    Screenshot of the four SysEx messages of タイトルドメイド, Shuusou Gyoku's name registration theme, as decoded by Domino
    …for the most part?
    Pulse	Event
        0	SysEx(41 10 42 12 40 00 7F 00 41 F7)
      240	SysEx(41 10 42 12 40 01 30 14 7B F7)
      360	SysEx(41 10 42 12 40 01 33 0F 7D F7)
      420	SysEx(41 10 42 12 40 01 34 30 5B F7)

Alright, that's something we can work with. The GS Reset message is something that every Roland GS MIDI should start with, but it's immediately followed by a message that Domino failed to decode? The two subsequent reverb parameters make sense, but panning delays typically have more parameters than just a reverb level and time.
That unknown SysEx message shares much of the same bytes with the decoded ones though. So let's do what we maybe should have done all along, return to caveman, and check the SC-88Pro manual:

The relevant section from page 194. We can see how the address and value correspond to bytes 5-7 and 8 in the SysEx messages. Byte 9 is a checksum and byte 10 signals the end of the message.

And that's where we find what this particular issue boils down to. The missing SysEx message is clearly intended to be a Reverb Macro command, whose value can range from 0 to 7 inclusive on the SC-88Pro, but ZUN tries to specify Reverb Macro #14h, or 20 in decimal. The SC-88Pro manual does not specify what happens if a SysEx message wants to write an invalid value to a valid address, which means that we've firmly entered the territory of undefined behavior.
Edit (2024-03-10): Romantique Tp confirmed that the real SC-88Pro clamps these Reverb Macro IDs to the supported range of 0-7. Therefore, the appropriate course of action for guaranteeing the same sound on other Roland synths would be to fix the MIDI file and specify Reverb Macro #7 instead. But since this behavior remains technically undefined, we can still argue about ZUN's intention behind specifying the Reverb Macro like this:

In fact, 32 out of the 39 MIDIs across both of Shuusou Gyoku's soundtrack use this invalid Reverb Macro. The only ones that don't are


And that's where this quest seemed to end, until Romantique Tp themselves came in and suggested that I take a closer look at the GS Advanced Editor, or GSAE for short.

The splash screen of GSAE version 4.01e.
Make sure to connect a MIDI input device before starting GSAE, or it will silently crash immediately after this splash screen. At least it accepts any controller, so this might just be a bug instead of the typical user-hostile kind of hardware dongle DRM that is pervasive in today's synth industry. 1999 would seem a bit too early for that, thankfully.

I was aware of this tool, but hadn't initially considered it because it's always described as just a SysEx generator/encoder. In fact, the very existence of such a tool made no sense to me at first, and seemed to prove my point that the usability of GS SysEx was wholly inferior to what I was used to in Yamaha land. Like, why not build at least a tiny and stripped-down MIDI sequencer around this functionality that would allow you to insert SC-88Pro-specific messages at any point within a sequence, and not just the beginning? I can see the need for such a tool in today's world of closed-source DAWs where hardware MIDI modules are niche and retro and are only kept alive by a small community of enthusiasts. But why would its developers guarantee that MIDI composers would have to hop between programs even back in 1997? I can only imagine that they saw how every just slightly advanced MIDI sequencer or DAW back then already used its own project format instead of raw Standard MIDI Files, and assumed that composers would therefore be program-hopping anyway?
However, GSAE does support the import of settings from a MIDI file and features a SysEx history window that decodes every newly processed Roland SysEx byte string, which is all I was looking for. So let's throw in that same MIDI and…

Screenshot of GSAE's SysEx history window,showing the results of sending a GS Reverb Macro #20 message
That's the result of sending just the single F0 41 10 42 12 40 01 30 14 7B F7 message at the top.

Now that's some wild numbers. An equally invalid Reverb Character, and Reverb Level and Time values that even exceed their defined range of 0-127? Could it be that GSAE emulates the real-hardware response to invalid Reverb Macros here, and gives us the exact reverb setting we can hear in Romantique Tp's recording? This could even be the reason why GSAE is still used and recommended within today's Roland MIDI sequencing scene, and hasn't been supplanted by some more modern open-source tool written by the community.

In any case, these values have to come from somewhere, so let's reverse-engineer GSAE and figure out the logic behind them. Shoutout to IDR for being a great help with its automatic generation of IDC debug symbols for the Delphi standard library, and even including a few names of application-level widget class methods by reading Delphi-specific type information from the binary. This little sub-project made me also come around to appreciating Ghidra, whose decompiler and data type manager helped a lot and allowed me to find the relevant code section within just a few hours.
A~nd it turns out that the values all come from out-of-bounds accesses into arrays on the stack. :onricdennat: If we combine 25, 235, and 132 back into a 32-bit value, we get 0x19EB84, which is the virtual address of the relevant function's stack frame base pointer.
But it gets even more hilarious: If you enable debug text output via Option → Other Options → SMF → Insert text events to setup measures and export these imported settings back into a MIDI file, GSAE not only retains these invalid Reverb Macro IDs, but stringifies them via a simple lookup into a hardcoded string pointer array, again without any bounds checks. The effects of this are roughly what you would expect:

In the end, we have Domino not decoding the Reverb Macro message, and GSAE, the premier SysEx tool for Roland synths, responding to it in even more undefined and clearly bugged ways than real hardware apparently does. That's two programs confirming that whatever ZUN intended was never supposed to work reliably. And while we still don't know exactly what these reverb parameters are supposed to be, these observations solve the mystery as far as I'm concerned, and solidify my personal opinion on the matter.


So what do we do now, and which version do we go with? Optimally, I'd offer both versions and turn this controversy into a personal choice so that everybody wins… and Ember2528 agreed and generously provided all the funding to make it happen. 💸
If you haven't picked your favorite yet, here are some final arguments:

The Romantique Tp recordings certainly have something going for them with their provenance of coming from real hardware, and the care that Romantique Tp put into manually recording every single track, warts and all. I wholeheartedly agree that preserving the raw sound of playing the MIDI files into the hardware without thinking about bugs or quirks is an important angle to take when it comes to preservation. It's good that these recordings exist – after all, you wouldn't know which musical elements you'd possibly be missing in an emulation if you have nothing to compare it to. Even the muffled sound in the half-speed clip above can be an argument in their favor, as the SC-88Pro's DAC operates at 32 kHz and you wouldn't expect any meaningful frequency content between 16,000 and 22,050 Hz to begin with. Any frequency content in that range that does remain in Romantique Tp's recording is simply 📝 rolled-off imaging noise added during the ADC's resampling process.
All this is why they are a definite improvement over kaorin's 2007 recordings of only the AST, which used to be the previous reference recordings within the community. Those had all of the same timing issues and more, in addition to being so excessively volume-boosted that 0.15% of the samples across the entire soundtrack ended up clipped. That's 6.25 seconds out of 68:39m being lost to pure digital noise.

Most importantly though: ZUN himself said that only the real SC-88Pro will play back these files as he intended them to sound. This quote is likely where the tagline of Romantique Tp's entire recording project came from in the first place:

> 全てのデエタはSC-88ProもしくはSC-8850(ロオランド社)にて最適に聴けるように調整してあります > それ以外の音源でも、作者の意図した音ではない場合があります。 — ZUN on 東方幻想的音楽, his old MIDI page

However. ZUN is not exactly known for accurately and carefully preserving the legacy of his series, or really doing anything beyond parading his old games as unobtainable showpieces at conventions. With all the issues we've seen, preferring real hardware is ultimately just that: an angle, and a preference. This is why I disagree with the heavy and uncritical advertising that is mainly responsible for elevating the Romantique Tp recordings to their current reference status within the community, especially if at least half of the alleged superiority of real hardware is founded on undefined behavior that can easily be fixed in the MIDI files themselves if people only bothered to look.

Here's where I stand: MIDI files are digital sheet music first and foremost, not an inferior version of tracker modules where the samples are sold separately. As such, the specific synth a MIDI file was written for is merely a secondary property of the composition – and even more so if the MIDI file contains little to nothing in terms of sound design and mostly restricts itself to the basic feature set of General MIDI. In turn, synth quirks and bugs are not a defined part of the composition either, unless they are clearly annotated and documented in the file itself. And most importantly: If the MIDI file specifies a certain timing and a recording fails to reproduce that timing, then that recording is not an accurate representation of the MIDI file.
In that regard, Sound Canvas VA is not only the closest alternative to the real thing, as a few people in the MIDI and retrogaming scene do have to admit, but superior to the real thing. I'll gladly take clarity and perfect timing accuracy in exchange for minor differences in effects, especially if the MIDI file does not explicitly and correctly define said effects to begin with. If I want a panning delay as part of the reverb, I add the respective and correct SysEx message to define one – and if I don't, I do not care about the reverb. You might still get a panning delay on a certain synth, and you might even prefer how it sounds, but it's ultimately a rendering artifact and not a consciously intended part of the composition. In that way, it's similar to the individual flavor a musician adds to a performance of a piece of classical music.
And as far as the differences in frequency response and resonant filters are concerned: In Yamaha land, these are exactly the main distinguishing factors between vintage WF-192XG sound cards (resembling the real SC-88Pro in these characteristics) and the S-YXG50 softsynth (resembling SCVA). Once I found out about that softsynth and how much clearer it sounded in comparison, I sold that old PCI sound card soon after.

In the interest of preservation though, there's still one more unexplored solution that could be the ideal middle ground between the two approaches:

  1. Play the MIDIs through a real-hardware SC-88Pro again
  2. Capture the actually observed system-exclusive settings that fall within the synth's supported and documented ranges
  3. Insert them back into the MIDI file, creating a new bugfixed version
  4. Re-record that bugfixed version through Sound Canvas VA

Edit (2024-03-10): And since Romantique Tp has confirmed what exactly happens on real hardware, I'm going to do exactly that. These bugfixed Sound Canvas VA renderings will be a free bonus of the single next Shuusou Gyoku push, and will add another angle to the preservation of these soundtracks. In the meantime though, the Sound Canvas VA packs will sound like they do in the preview videos above.

Or, you know… Maybe none of this actually matters. Here's beatMARIO streaming some Shuusou Gyoku gameplay using what looks like a real-hardware SC-8850, which plays these MIDIs with occasionally noticeably different instrument patches and no panning delay in the name registration theme, and he still enjoyed every second of it. Imagine undefined SysEx behavior not even being consistent within the same family of Roland synths… nah, I'm done arguing, let's get back to the actual work and cut some loops.


Just to be clear: I'm not suggesting that Romantique Tp should have been the one to cut their recordings into loops, or even just the one who defined where the loop points are supposed to be. On the surface, this seems to be a non-issue, and you'd just pick a point wherever each track appears to loop, right? But with 39 MIDIs to cut and all the financial support from Ember2528, it made sense to also solve this problem more thoroughly, and algorithmically detect provably correct loop points for all of these files. Who knows, maybe we even find some surprises that make it all worth it?
This is the algorithm I came up with:

Of course, this algorithm isn't perfect and won't work for every MIDI file out there. It doesn't consider things like differently ordered events within the same MIDI pulse, (non-)registered parameter numbers, or the effect that SysEx messages can have on the state of individual channels. The latter would require the general SysEx decoding logic that I would have liked to have for the research above… actually, let's add an issue and add the project to the order form. I'd really like to see a comprehensive open-source cross-vendor SysEx decoder library in my lifetime.

As for the implementation, I was happy to write some Rust again for a change, as it's a great fit for these standalone greenfield command-line tools that don't have to directly interact with the legacy C++ code bases that this project usually deals with. It's even better if the foundational functionality is not just available in a crate, but in four, with the community already having gone through multiple iterations to arrive at a tried and tested winner. Who knows, maybe I even get to rewrite this website in it one day? Just for the sheer meme value of doing so, of course.
I also enjoyed this a lot from a technical point of view:

This algorithm works well for the long MIDI files of Shuusou Gyoku's OST that all contain multiple duplicates of their loop section, but it quickly reaches its limit with the AST. Following the classic two-loop + fade-out format, that soundtrack was meant to be played back in generic MIDI players, and not to actually be put back into the game in looped form. Since the loop algorithm did, in fact, find inconsistencies even in the OST, two copies of the apparent loop are sometimes not enough to prove cases where the actual loop ends much later than you think it does. In a few cases, it would be enough to simply remove all volume change events from the fade-out to prove the actual loop, but in others, the algorithm would need MIDI event data far past the end of the fade-out.

However, just giving up and not looping any of these tracks would be equally unfortunate. So how about shifting the question, from what's the best loop in this MIDI file to what's the best loop if the MIDI didn't fade out and instead repeated its apparent second loop a third time? As long as the detected loop in such a pre-processed file ends before the repeated range, it's still a valid loop in terms of the unmodified original.
Ideally, we want to do this pre-processing programmatically with the same Rust library instead of manually editing the MIDI. Many sequencers (and especially XGworks) apply significant changes to a MIDI file's internal structure when saving its internal representation back to a MIDI file, which might even mess with our loop algorithm. So it would be very nice to have a more trustworthy tool that applies only the edit we actually want, and perfectly retains the rest of the MIDI.

And that's how this sub-project turned into a small suite of command-line MIDI operations in the classic Unix filter/pipeline style: Each command reads a MIDI file from stdin, transforms it, and outputs text or the resulting MIDI file on stdout. This way, we gain maximum transparency and reproducibility as I can document the unique pre-processing steps for each AST track by simply providing the command lines. And sure, we're re-encoding and re-decoding the full MIDI sequence at every step along such a pipeline, but computers are fast, Rust and the midly library in particular are ⚡ blazingly fast ⚡, and the usability benefits of this pipeline model far outweigh any theoretical performance drops.
Here's the full list of commands that made it into the resulting mly tool:

This feature set should strike a good balance between not spending too much of the Shuusou Gyoku budget on tangential problems, but still offering a decent solution for the problem at hand. As a counterexample, the obvious killer feature – deserializing a dump back into a Standard MIDI File – would have gone way past the budget. While there are crates that free you from the need to write manual parsing code for basic data structures, they would instead require a lot of attribute boilerplate – and if the library that provided the structures doesn't already come with these attributes, you now have to duplicate all the structures, and convert back and forth between the original structures and your copies. Not to mention that we'd still have to write code for the high-level structure of the dump output…

If we put it all together, this is what we can do:

$ <ssg_02.mid mly loop-find
Best loop in note space: 4 events (between event #[117, 121[ and [121, 125[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m
Loop start: event   117 / pulse   1680 / beat   3:240 / 0:01:400m
  Loop end: event   121 / pulse   1920 / beat   4:000 / 0:01:600m

$ <ssg_02.mid mly cut 466: | mly loop-unfold 240: | mly -r 44100 loop-find
Track #0: Removing events #[16439, 19881[
Track #0: Repeating events #[8344, 16439[ at the end of the sequence
Best loop in note space: 8095 events (between event #[5625, 13720[ and [13720, 21815[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m
Loop start: event  5625 / pulse  75361 / beat 157:001 / 1:03:531m
  Loop end: event 13720 / pulse 183841 / beat 383:001 / 2:34:726m

Best loop in recording space:  8095 events (between event #[5709, 13804[ and [13804, 21899[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m / sample    35280.00
Loop start: event  5709 / pulse  77280 / beat 161:000 / 1:05:163m / sample  2873667.66
  Loop end: event 13804 / pulse 185760 / beat 387:000 / 2:36:358m / sample  6895375.27

Translation:


So, where are these loop quirks that justify why some of these audio files are longer than you'd think they should be? Just listing them as text wouldn't really communicate just how minor these are. It would be much nicer to visualize them in a way that highlights the exact inconsistencies within a fixed range of MIDI measures. Screenshots of MIDI sequencer or DAW windows won't capture these aspects all too well because these programs are geared toward fine-grained editing of single tracks, not visualization of details across all channels.

Screenshot of the first 8 measures of Shuusou Gyoku's Stage 1 theme (フォルスストロベリー) in its OST version, as visualized by REAPER's piano roll
REAPER's piano roll nicely snaps to a certain range, but good luck picking out the individual lines from the single volume lane at the bottom of the screen, or spotting a 7-point difference. Not to mention that CC #11 (Expression) makes up an equal part of a channel's final perceived volume, which is the metric we'd actually want to visualize.

Typical MIDI visualizers, however, are on the complete opposite end of the spectrum. In recent years, MIDI visualization has become synonymous with the typical Synthesia style of YouTube videos with a big keyboard at the bottom, note bars flying in from the top, and optional fancy effects once those notes hit the top of the keyboard. The Black MIDI community has been churning out tons of identically looking MIDI visualizers in recent years that mainly seem to differ in the programming language they're written in, and in how well they can cope with the blackest of black MIDIs.
Thankfully, most of these visualizers are open-source and have small and manageable codebases. The project with the most GitHub stars and the most generic name seemed to be the best starting point for hacking in the missing features, despite using GLSL shaders which I had no prior experience with. It was long overdue that I did something with GLSL though – it added a nice educational aspect to these hacks, and it still was easier than deciphering whatever the fastest and hyper-optimized Rust visualizer is doing.
Still, this visualizer needed a total of 18 small features and bugfixes to be actually usable for demonstrating Shuusou Gyoku's loop quirks. As such, these hacks turned into yet another tangential sub-project that could have easily consumed another two pushes if I cleaned up the code and published the result. But that would have really gone way past the budget for something that people might not even care about. So here's what we're going to do:


Alright then! Here's how to read the visualizations:



Before we package up these looped soundtracks, let's take a quick look at how they would be shown off in the Music Room. The Seihou Music Rooms carry over the per-channel keyboards from TH05, add the current per-channel volume, expression, and pan pot values, and top it off with a fake spectrum analyzer. All of these visualizations rely on MIDI data, and the Music Room would feel very dull and boring without them. Just look at Kioh Gyoku, whose Music Room basically turns into a still image in WAVE mode.
Retaining these visualizations even when playing waveform BGM was very important for me, and not just because it would make for a unique high-quality feature that would break new ground. It can also double as proof that the waveform versions are, in fact, in perfect sync with both the MIDIs they are based on, and, by extension, the respective stage scripts.
However, this would require the game to process the MIDIs and update the internal visualization state without simultaneously playing them back through the WinMM / MME / midiOut*() API. And just like graphics and text rendering, Shuusou Gyoku's original code came with zero architectural separation between platform-independent processing logic and platform-specific playback…

So I accidentally rewrote almost the entire MIDI code to achieve said separation. :tannedcirno: This also provided a great occasion to modernize this code and add some much-needed robustness for potential MIDI mods, while retaining the original code's approach of iterating over raw SMF byte streams. It might all have been very excessive for a delivery that was supposed to be just about waveform BGM support, but on the plus side, MIDI output is now portable to any other system's MIDI API as well.

Surprisingly though, it was Shuusou Gyoku's original MIDI timing that quickly turned out to be rather inaccurate, and not the waveforms. The exact numbers vary depending on the piece, but the game played back every MIDI about 1% slower than notated, adding about 2 or 3 seconds to their total playback time after 5 minutes. Tempo changes in particular were the biggest causes of desynchronizations with the waveforms… :thonk:
To understand how this can happen to begin with, we have to look closer at how you're supposed to use the midiOut*() API. This API is as low-level as it gets, only covering the transmission of a single MIDI message to the selected output device right now. There is no concept of note timing at this low level, so it's completely up to the program to parse delta times and tempo change events out of the MIDI file and correctly time the calls to this API for each MIDI message. With all the code that runs between the API and the actual renderer of the synth for every single message, the resulting timing can only ever be an approximation of the MIDI file. This doesn't really matter for the timescales and polyphony levels of typical music because, again, computers are fast, but such an API is fundamentally unsuitable for accurately playing back even just a moderately complex million-note Black MIDI. :onricdennat:

Shuusou Gyoku handles this required manual timing in the simplest possible way: It runs a MIDI processing function (Mid_Proc() in the code) at an interval of 10 ms, which processes and instantly sends out all MIDI events that have occurred at any point within the last 10 ms, maintaining merely their order. This explains not only why the original game incremented its MIDI TIMER by multiples of 10, but also the infamous missing drums when playing the soundtrack through the Microsoft GS Wavetable Synth:

But while sending MIDI events in such quantized chunks might not be perfect, it can't be the cause behind multi-second playback slowdowns. Instead, this issue has to boil down to the way Shuusou Gyoku times each individual message, and specifically how it converts between MIDI pulse units and real-time (milli)seconds. pbg's original MIDI code chose to do this in an equally confusing and inaccurate way: it kept two counters that tracked the current MIDI pulse before and after the latest tempo change, used the value of the latter counter to decide which events to process, and only added the pulse equivalent of 10 ms to this counter at the end of Mid_Proc() in the then current tempo. The commit message for my rewritten algorithm details the problems with this approach using nice ASCII art in case you're interested, but in short, the main problem lies in how the single final addition can only consider a single tempo change within each call to Mid_Proc(). If a MIDI file contains tempo ramps with less than 10 ms between each different tempo, the original game would only use the last of these tempo values as the basis for converting the entire 10 ms back into MIDI pulses. Not to mention that maybe MIDI pulses aren't the best unit in a game that still 📝 treats the FPU as lava and doesn't use any fixed-point means of increasing the resolution of the 10 ms→pulse division either…

On the contrary, it's much more accurate to immediately convert every encountered MIDI delta time to a real-time quantity and use that unit for event timing, especially if we want to restrict ourselves to integer math. Signed 64-bit integers are enough to fit the product of the slowest possible MIDI tempo ((224 - 1) µs per quarter note) and the highest possible MIDI delta time (228 - 1) at nanosecond precision (103), with one bit to spare. Then, we arrive at a much simpler timing algorithm:

The additive nature of this timer not only naturally allows more than one event to happen within a single Mid_Proc() call, but also averages out any minor timing inconsistencies across the length of a track.

This new algorithm did improve the overall timing accuracy, but only barely, shaving off just ≈100 ms of the total duration. Turns out that the main source behind the slowness was hiding somewhere else entirely, in the single line that deserializes tempo values from MIDI's big-endian representation into the native integer format:

assert(length_of_tempo_message == 3);
uint32_t tempo = 0;
for(int i = 0; i < length_of_tempo_message; i++) {
-	tempo += ((tempo << 8) + (*track_data++));
+	tempo  = ((tempo << 8) + (*track_data++));
}

Yup – the original code performed two additions per byte, which incorrectly added the interim value at every byte to the final result, and yielded a tempo that is ≈0.8% / ≈1 BPM slower than notated in the MIDI file, matching the number we were looking for. That's why the |/OR operator is the safer one to use in such a bit-twiddling context…
But now I'm curious. This is such a tiny bug that is bound to remain unnoticed until someone compares the game's MIDI output to another renderer. It must have certainly made it into other games whose MIDI code is based on Shuusou Gyoku's, or that pbg was involved with. And sure enough, not only did this bug survive Kioh Gyoku's OOP refactoring, but it even traveled into Windows Touhou, where it remained in every single game that supported MIDI playback. Now we know for a fact that pbg's Program Support role in the TH06 credits involved sharing ready-made, finished code with ZUN:

Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH06Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH07Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH08Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH09Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH10
The broken tempo deserialization in the respective latest full versions of TH06 through TH10. And yes, that's TH10 – even though TH09's trial version was the last game to ship MIDI versions of its soundtrack, TH10 still contained all of pbg's MIDI code that originated back in Shuusou Gyoku, before TH11 finally removed it.
Amusingly, ZUN's compiler even started optimizing the combination of left-shifting and addition to a multiplication with 257 for TH09, which even sort of highlights this bug if you're used to reading x86 ASM.

That leaves support for MIDI loop points as the only missing feature for syncing MIDI data with a looping waveform track. While it didn't require all too much code, pbg's original zero-copy approach of iterating over raw MIDI data definitely injected a lot of complexity into the required branches. Multi-track/SMF Type 1 files require quite a bit of extra thought to correctly calculate delta times across loop boundaries that reach past the end of the respective track, while still allowing the real-time delta values to be resynchronized at tempo changes within the loop – and yes, 3 of ZUN's 19 arranged MIDI files actually do use more than one track, so this wasn't just about maximizing MIDI compatibility for mods. I stuck to the original approach mostly as a challenge and to prove that it's possible without first parsing the entire MIDI sequence into a friendlier internal representation, but I absolutely do not recommend this to anyone else. :tannedcirno:

After hardcoding the loop points detected by mly into the binary, we only need to call Mid_Proc() once per frame in the Music Room and pass the frame delta time instead of the 10 ms constant. And then, we get this:

The MIDI TIMER now shows off the arguably more interesting current MIDI pulse value rather than just formatting the PASSED TIME in milliseconds. Ironically, displaying this value in a constantly counting way takes more effort now – the new nanosecond-based timing code doesn't use any measure of total MIDI pulses anymore, and they don't naturally fall out of the algorithm either. Instead, the code remembers the total pulse value of the last event it processed and adds the real-time duration that has passed since, similar to the original timing algorithm.
This naturally causes the timer to jump from the loop end pulse to the loop start pulse, proving that Mid_Proc() is in fact looping the sequence.

Alright, now we know what to package:

Unfortunately, we still haven't reached the end of the complications and weird issues that haunt Shuusou Gyoku's music:

  1. The original game reads the in-game track title directly out of the first Sequence Name event of the playing MIDI file. The waveform equivalent would be the Vorbis comment TITLE tag, which therefore should exactly match the original track's title, down to the exact placement of whitespace. As usual, if I emphasize minor things like this, it's not without reason: 幻想科学 ~ Doll's Phantom inconsistently uses halfwidth spaces at both sides of the , and wouldn't fit into the Music Room's limited space otherwise.

  2. However, the AST MIDI files jam a bunch of other metadata into their Sequence Names, roughly following the format
    【 $title 】 from 秋霜玉  for sc88Pro comp.ZUN
    The track titles should definitely not appear in this format in-game, but how do we get rid of this format without hardcoding either the names or the magic to parse the names out of this format? :thonk:
  3. The absolute state of GS SysEx tooling rears its ugly head one final time in three of the AST MIDIs, which for some reason are missing the Roland vendor prefix byte in all of their SysEx messages and are therefore undeniably bugged. There even seemed to be another SysEx-related bug which Romantique Tp explained away, but not this one:

    ssg_04.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 14 78 F7)
    0:420	SysEx(   10 42 12 40 01 34 50 3B F7)

    ssg_05.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 00 0C F7)
    0:420	SysEx(   10 42 12 40 01 34 14 77 F7)

    ssg_10.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 00 0C F7)
    0:420	SysEx(   10 42 12 40 01 34 60 2B F7)

    ssg_04.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 14 78 F7)	Reverb Level 20
    0:420	SysEx(41 10 42 12 40 01 34 50 3B F7)	Reverb Time 80

    ssg_05.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 00 0C F7)	Reverb Level 0
    0:420	SysEx(41 10 42 12 40 01 34 14 77 F7)	Reverb Time 20

    ssg_10.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 00 0C F7)	Reverb Level 0
    0:420	SysEx(41 10 42 12 40 01 34 60 2B F7)	Reverb Time 96
    The irony of using invalid Reverb Macros within already invalid SysEx messages is not lost on me.

    This is something we should fix even before running these files through Sound Canvas VA in order to render these with the reverb settings that ZUN clearly (and, for once, unironically) intended.

  4. For perfect preservation of the original BGM/gameplay synchronicity, it makes sense for the waveform versions to retain the leading 1 or 2 beats of silence that the original MIDI files use for their SysEx setup. While some of the AST tracks use a slightly different tempo compared to their OST counterparts, they would still be largely in sync as ZUN didn't rearrange the layout of their setup area… except for, once again, the three tracks used in the Extra Stage. :zunpet: Marisa's and Reimu's boss themes aren't too bad with their 4 beats of setup, but シルクロードアリス takes the cake with a whopping 12 beats of leading silence. That's 5 seconds from the start of the Extra Stage to the first note you'd hear. 🐌

2) and 4) could theoretically be worked around in Shuusou Gyoku's MIDI code, but there's no way around editing the MIDI files themselves as far as 3) is concerned. Thus, it makes sense to apply all of the workarounds to the AST MIDIs as part of the BGM build process – parsing the titles out of the 【brackets】, inserting the Roland vendor prefix byte where necessary, and compressing the setup bars in the Extra Stage themes to match their OST counterparts. Adding any hidden magic to the MIDI code would only have needlessly increased complexity and/or annoyed some modder in the future who would then have to work around it.
Ideally, these edits would involve taking the mly dump output, performing the necessary replacements at a plaintext level, and rebuilding the result back into a MIDI file, bu~t we're unfortunately missing the latter feature. Luckily, someone else had the same idea 13 years ago and wrote a tool in C that does exactly what we need. Getting it to compile in 2024 only required fixing a typical C thing… why are students and boomers defending this antique of a language again? 🙄

The single most glaring issue, however, is the drastic difference in volume between the individual tracks in both soundtracks. While Romantique Tp had to normalize each track to the maximum possible volume individually as a consequence of the recording process, the Sound Canvas VA renderings reveal just how inconsistent the volume levels of these MIDI files really are:

The peak amplitudes of every track in both soundtracks, as rendered by Sound Canvas VA at maximum volume. Looking at these, you might think that kaorin's 2007 recordings were purposely trying to preserve the clipping that would come out of an SC-88Pro if you don't manually adjust the volume knob for each song, but those recordings are still much louder than even these numbers.

So how do we interpret this? Is this a bug, because no one in their right mind would want their music to clip on purpose, and that in turn means that everything about these volume levels is arbitrary and unintentional? Or is this a quirk, and ZUN deliberately chose these volume levels for compositional reasons? It certainly would make sense for the name registration theme.
Once again, the AST version of シルクロードアリス is the worst offender in this regard as well, but it might also provide some evidence for the quirk interpretation. The fact that almost all of its MIDI channels blast away at full volume might have been an accident that could have gone unnoticed if the volume knob of ZUN's SC-88Pro was turned rather low during the time he arranged this piece, but the excessive left-panning must have been deliberate. Even Romantique Tp agrees:

Stereo waveform of the Sound Canvas VA rendering of Shuusou Gyoku's Extra Stage theme (シルクロードアリス), highlighting the excessive left-panningStereo waveform of Romantique Tp's recording of Shuusou Gyoku's Extra Stage theme (シルクロードアリス), highlighting the excessive left-panning
It might have even made compositional sense if Silk Road Alice was supposed to be a "Western-style piece", but it's not. :zunpet:

And that's with the volume already normalized. Because this one channel of this one track is almost twice as loud as anything else in the AST, we would consequently have to bring down the volume of every other arranged track and the right channel of the same track by almost 50% if we wanted to maintain the volume differences between the individual tracks of the AST. In the process, we lose almost one entire bit of dynamic range. At this rate, you might even consider remixing and remastering the entire thing, but that would involve so many creative decisions to definitely fall into fanfiction territory…

However, normalizing each track to a peak level of 0 dBFS makes much more sense for in-game playback if you consider how loud Shuusou Gyoku's sound effects are. Once again, the best solution would involve offering both versions, but should we really add two more SCVA BGM packs just to cover volume differences? :thonk:
ReplayGain solves this exact problem for regular music listening in a non-destructive way by writing the per-track and per-album gain levels into an audio file's metadata. Since we need metadata support for titles anyway, we can do something similar, albeit not exactly the same for two reasons:

And so, we hard-apply the volume-level gain during the conversion from 32-bit float to FLAC to preserve the volume differences between the tracks, calculate the track-level GAIN FACTOR based on the resulting peak levels, add a volume normalization toggle to the Sound / Config menu, enable it by default, and thus make everyone happy. ✅

The final interesting tidbit in building these packages can be found in the way the Sound Canvas VA recordings are looped. When manually cutting loops, you always have to consider that the intro might end with unique notes that aren't present at the end of the loop, which will still be fading out at the calculated loop start point. This necessitates shifting the loop start point by a few bars until these notes are no longer audible – or you could simply ignore the issue because ZUN's compositions are so frantic that no one would ever notice. :onricdennat:
With the separate intro and loop files generated by mly, on the other hand, the reverb/release trails are immediately visible and, after trimming trailing silence, exactly define the number of samples that the calculated loop start point needs to be shifted by. The .loop file then remains always exactly as long, in samples, as the duration of the loop reported by mly. If a piece happens to have a constant tempo whose beat duration corresponds to an integer number of samples, we get some very satisfying, round loop durations out of this process. ☺️


So let's play it all back in-game… and immediately run into two unexpected miniaudio limitations, what the…?!

  1. miniaudio uses a fixed linear function for its fade-out envelope, and doesn't offer anything else? We might not even want a logarithmic one this time because symmetry with MIDI's simple quadratic curve would be neat, but we sure don't want a linear function – those stay near the original volume for too long, and then turn quiet way too quickly.
  2. There is no way to access FLAC metadata from miniaudio's public API, even though the library bundles the author's own FLAC library which has this feature?

📝 Back when I evaluated miniaudio, I alluded that I consider single-file C libraries to be massively overrated, and this is exactly why: Once they grow as massive as miniaudio (how ironic), they can quickly lead to their authors treating their dependencies as implementation details and melting down the interfaces that would naturally arise. In a regular library, dr_flac would be a separate, proper dependency, and the API would have a way to initialize a stream from an externally loaded drflac object. But since the C community collectively pretends that multi-file libraries are a burden on other developers, miniaudio ended up with dr_flac copy-pasted into its giant single file, with a silly ma_ namespacing prefix added to all its functions. And why? Did we have to move so far in the other direction just because CMake doesn't support globbing? That's a symptom of CMake not actually solving any problem, not a valid architectural decision that libraries should bend around. 🙄
So unless we fork and hack around in miniaudio, there's now no way around depending on a second, regular copy of dr_flac. Which has now led to the same project organization bloat that single-file libraries originally set out to prevent…

Sigh. At this rate, it makes more sense to just copy-paste and adapt the old BGM streaming code I wrote for thcrap in late 2018, which used dr_flac directly, and extend it with metadata support. With the streaming code moved out of the platform layer and into game logic, it also makes much more sense to implement the squared fade-out curve at that same level instead of copy-pasting and adjusting an unhealthy amount of miniaudio's verbose C code.
While I'm doing the same for the old Vorbis streaming code, it would also make sense to rewrite that one to use stb_vorbis instead of the old libogg+libvorbis reference libraries. There's no need to add two more dependencies if miniaudio already comes with stb_vorbis.c, and that library is widely acclaimed. So, integration should be a breeze, right?
Well, surprise, rarely have I seen a C library so actively hostile toward being integrated. Both of its API variants are completely unreasonable:

What happened to the tried-and-true idea of providing a structure with read, tell, and seek callbacks, and then providing an optional variant for C FILE* handles if you absolutely must? Sure, the whole point of Vorbis is to be small and nobody these days would care about spending a few MB on keeping an entire Vorbis file in memory, but come on. If pulldata made the deliberate and opinionated choice to only support buffers of complete Vorbis streams and argued in the name of simplicity that hand-coded disk streaming isn't worth it in this day and age, I might have even been convinced. And this is from the guy who popularized the concept of single-file C libraries in the first place? :thonk:

Oh well, tupblocks go brrr. libvorbis definitely shows its age with all the old command-line tools in the lib/ directory that they never moved away and that we now have to remove from our glob. But even that just adds a single line to the Tupfile, and then we get to enjoy its much friendlier API. That sure beats the almost 800 lines of code that miniaudio had to write to integrate stb_vorbis… which I can't even link because the file is too big for GitHub. 🤷
At this point, it would have even made sense to upgrade from a 24-year-old lossy codec to an 11-year-old lossy codec and use Opus instead, since the enforced 48,000 Hz sampling rate is a non-issue when you control the entire audio pipeline. But let's keep compatibility with existing thcrap mods for now.

The last time I added dependencies, 📝 I wondered whether just downloading and extracting official Windows binary builds might be superior to pasting batch script duct tape over the usability issues of Git submodules. However, I still wanted to try out Git's sparse checkout feature before, in an attempt to remove all the unneeded bloat… and as it turned out, this might just be the idealistic and perfect nirvana of vendoring libraries in C++ projects. I particularly like how the limitations of its default mode (always checking out all files within each directory level that shows up in a filter) can be turned into a guideline about how to structure a repository: All non-essential stuff that consumers of your code might not need – tests, high-level documentation, or optional features – should go into a subdirectory where it can be easily filtered.
And that's how the size of our libs/ directory went down from 82.7 MiB in the P0256 build to 30.4 MiB in the P0275 build, despite adding 4 more libraries in the latter. Now if only this didn't require even more duct tape to actually set up shallow clones correctly

In the end, the Windows build ended up using only a single one of the miniaudio features that DirectSound doesn't have, and that's the ability to use the more modern WASAPI instead of DirectSound. We're still going to use miniaudio for the Linux port, but as far as Windows is concerned, it would be quite nice to backport BGM streaming to the game's original DirectSound backend. The P0275 build is pushing 1 MiB of binary size for a game that originally came in a 220 KiB binary, so it would remove a noticeable amount of bloat from GIAN07.EXE, but it would also allow waveform BGM to work in the Windows 98-compatible i586 build. If that sounds cool to you, this is the issue you want to fund.


That only left some logic and UI busywork to put it all together, which means that we've almost reached the end of things to talk about! Here's what it all looks like:



After half a year of being bought out way past the cap, I've finally got some small room left for new orders again. If it weren't for this blog post and the required research and web development work, this delivery would have probably come out in early January, taking half the time it ended up taking. So I really have to start factoring the blog posts into the push prices in a better and fairer way.
Meanwhile, the hate toward my day job only keeps growing, but there's little point in looking for a new one as long as ReC98 remains this motivating and complex. It leaves pretty much no cognitive room for any similarly demanding job. Thus, I want 2024 to be the year where ReC98 either becomes profitable enough to be my only full-time job, or where we conclusively find out that it can't, I go look for a better day job, and ReC98 shifts to a slower pace. Here's the plan:

With the new price of per push, this means that there's now a small window in which you can get a full push worth of functionality for , until the current cap is filled up again.

Next up: Probably TH02's endings to relax a bit. Maybe we're also getting some new Touhou-related contributions?

📝 Posted:
🚚 Summary of:
P0264, P0265
Commits:
46cd6e7...78728f6, 78728f6...ff19bed
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷 Tags:

Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.

But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!

  1. The Music Room background masking effect
  2. The GRCG's plane disabling flags
  3. Text color restrictions
  4. The entire messy rest of the Music Room code
  5. TH04's partially consistent congratulation picture on Easy Mode
  6. TH02's boss position and damage variables

As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.

The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:

TH02's Music Room background in its on-disk state TH03's Music Room background in its on-disk state TH04's Music Room background in its on-disk state TH05's Music Room background in its on-disk state
Let's call this "the blank image".

That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right? :thonk:
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:

TH02's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH03's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH04's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH05's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.

On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.


Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:

// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);

// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);

// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);

// We're done, turn off the GRCG.
outportb(0x7C, 0x00);

This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.

Three overlapping Music Room polygons rendered using master.lib's grcg_polygon_c() function with a disabled GRCGThree overlapping Music Room polygons rendered as in the original game, with the GRCG enabled
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.

As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:

The contents of nopoly_B with each game's first track selected.

Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:

And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier… :onricdennat:


For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".

With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌


The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.

Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there! Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.

Screenshot of TH02's Five Magic Stones, with the first two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the second two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the last one (both internally and in the order you fight them in) alive and activated

This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!

These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.

Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.

📝 Posted:
🚚 Summary of:
P0262, P0263
Commits:
ae2fc28...741d889, 741d889...46cd6e7
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:

And once again, the Shuusou Gyoku task was too complex to be satisfyingly solved within a single month. Even just finding provably correct loop sections in both the original and arranged MIDI files required some rather involved detection algorithms. I could have just defined what sounded like correct loops, but the results of these algorithms were quite surprising indeed. Turns out that not even Seihou is safe from ZUN quirks, and some tracks technically loop much later than you'd think they do, or don't loop at all. And since I then wanted to put these MIDI loops back into the game to ensure perfect synchronization between the recordings and MIDI versions, I ended up rewriting basically all the MIDI code in a cross-platform way. This rewrite also uncovered a pbg bug that has traveled from Shuusou Gyoku into Windows Touhou, where it survived until ZUN ultimately removed all MIDI code in TH11 (!)

Fortunately, the backlog still had enough general PC-98 Touhou funds that I could spend on picking some soon-important low-hanging fruit, giving me something to deliver for the end of the month after all. TH04 and TH05 use almost identical code for their main/option menus, so decompiling it would make number go up quite significantly and the associated blog post won't be that long…

Wait, what's this, a bug report from touhou-memories concerning the website?

  1. Tab switchers tended to break on certain Firefox versions, and
  2. video playback didn't work on Microsoft Edge at all?

Those are definitely some high-priority bugs that demand immediate attention.

  1. Microsoft Edge's anti-support of AV1
  2. TH04/TH05's main/option menu
  3. TH04/TH05's first-launch sound setup menu
  4. TH05's title animation ☯️

The tab switcher issue was easily fixed by replacing the previous z-index trickery with a more robust solution involving the hidden attribute. The second one, however, is much more aggravating, because video playback on Edge has been broken ever since I 📝 switched the preferred video codec to AV1.
This goes so far beyond not supporting a specific codec. Usually, unsupported codecs aren't supposed to be an issue: As soon as you start using the HTML <video> tag, you'll learn that not every browser supports all codecs. And so you set up an encoding pipeline to serve each video in a mix of new and ancient formats, put the <source> tag of the most preferred codec first, and rest assured that browsers will fall back on the best-supported option as necessary. Except that Edge doesn't even try, and insists on staying on a non-playing AV1 video. 🙄

The codecs parameter for the <source> type attribute was the first potential solution I came across. Specifying the video codec down to the finest encoding details right in the HTML markup sounds like a good idea, similar to specifying sizes of images and videos to prevent layout reflows on long pages during the initial page load. So why was this the first time I heard of this feature? The fact that there isn't a simple ffprobe -show_html_codecs_string command to retrieve this string might already give a clue about how useful it is in practice. Instead, you have to manually piece the string together by grepping your way through all of a video's metadata
…and then it still doesn't change anything about Edge's behavior, even when also specifying the string for the VP9 and VP8 sources. Calling the infamously ridiculous HTMLMediaElement.canPlayType() method with a representative parameter of "video/webm; codecs=av01.1.04M.08.0.000.01.13.00.0" explains why: Both the AV1-supporting Chrome and Edge return "probably", but only the former can actually play this format. 🤦

But wait, there is an AV1 video extension in the Microsoft Store that would add support to any unspecified favorite video app. Except that it stopped working inside Edge as of version 116. And even if it did: If you can't query the presence of this extension via JavaScript, it might as well not exist at all.
Not to mention that the favorite video app part is obviously a lie as a lot of widely preferred Windows video apps are bundled with their own codecs, and have probably long supported AV1.

In the end, there's no way around the utter desperation move of removing the AV1 <source> for Edge users. Serving each video in two other formats means that we can at least do something here – try visiting the GitHub release page of the P0234-1 TH01 Anniversary Edition build in Edge and you also don't get to see anything, because that video uses AV1 and GitHub understandably doesn't re-encode every uploaded video into a variety of old formats.
Just for comparison, I tried both that page and the ReC98 blog on an old Android 6 phone from 2014, and even that phone picked and played the AV1 videos with the latest available Chrome and Firefox versions. This was the phone whose available Firefox version didn't support VP9 in 2019, which was my initial reason for adding the VP8 versions. Looks like it's finally time to drop those… 🤔 Maybe in the far future once I start running out of space on this server.

Removing the <source> tags can be done in one of two places:

  1. server-side, detecting Edge via the User-Agent header, or
  2. client-side, using navigator.userAgentData.brands.

I went with 2) because more dynamic server-side code would only move us further away from static site generation, which would make a lot of sense as the next evolutionary step in the architecture of this website. The client-side solution is much simpler too, and we can defer the deletion until a user actually hovers over a specific video.
And while we're at it, let's also add a popup complaining about this whole state of affairs. Edge is heavily marketed inside Windows as "the modern browser recommended by Microsoft", and you sure wouldn't expect low-quality chroma-subsampled VP9 from such a tagline. With such a level of anti-support for AV1, Edge users deserve to know exactly what's going on, especially since this post also explains what they will encounter on other websites.

A popup on top of a ReC98 blog video, showing the caption "⚠️ Edge does not support AV1, falling back on low-quality video…"
That's the polite way of putting it.

Alright, where was I? For TH01, the main menu was the last thing I decompiled before the 100% finalization mark, so it's rather anticlimactic to already cover the TH04/TH05 one now, with both of the games still being very far away from 100%, just because people will soon want to translate the description text in the bottom-right corner of the screen. But then again, the ZUN Soft logo animation would make for an even nicer final piece of decompiled code, especially since the bouncing-ball logo from TH01, TH02, and TH03 was the very first decompilation I did, all the way back in 2015.

The code quality of ZUN's VRAM-based menus has barely increased between TH01 and TH05. Both the top-level and option menu still need to know the bounding rectangle of the other one to unblit the right pixels when switching between the two. And since ZUN sure loved hardcoded and copy-pasted numbers in the PC-98 days, the coordinates both tend to be excessively large, and excessively wrong. :zunpet: Luckily, each menu item comes with its own correct unblitting rectangle, which avoids any graphical glitches that would otherwise occur.
As for actual observable quirks and bugs, these menus only contain one of each, and both are exclusive to TH04:

And yes, these videos do have a frame rate of 2 FPS.

Now that 100% finalization of their OP.EXE binaries is within reach, all this bloat made me think about the viability of a 📝 single-executable build for TH04's and TH05's debloated and anniversary versions. It would be really nice to have such a build ready before I start working on the non-ASCII translations – not just because they will be based on the anniversary branch by default, but also because it would significantly help their development if there are 4 fewer executables to worry about.
However, it's not as simple for these games as it was for TH01. The unique code in their OP.EXE and MAINE.EXE binaries is much larger than Borland's easily removed C++ exception handler, so I'd have to remove a lot more bloat to keep the resulting single binary at or below the size of the original MAIN.EXE. But I'm sure going to try.


Speaking of code that can be debloated for great effect: The second push of this delivery focused on the first-launch sound setup menu, whose BGM and sound effect submenus are almost complete code duplicates of each other. The debloated branch could easily remove more than half of the code in there, yielding another ≈800 bytes in case we need them.
If hex-editing MIKO.CFG is more convenient for you than deleting that file, you can set its first byte to FF to re-trigger this menu. Decompiling this screen was not only relevant now because it contains text rendered with font ROM glyphs and it would help dig our way towards more important strings in the data segment, but also because of its visual style. I can imagine many potential mods that might want to use the same backgrounds and box graphics for their menus.

TH04's first-launch sound setup menu, showing the BGM mode selectionTH05's first-launch sound setup menu, showing the sound effect mode selection
How about an initial language selection menu in the same style?

With the two submenus being shown in a fixed sequence, there's not a lot of room for the code to do anything wrong, and it's even more identical between the two games than the main menu already was. Thankfully, ZUN just reblits the respective options in the new color when moving the cursor, with no 📝 palette tricks. TH04's background image only uses 7 colors, so he could have easily reserved 3 colors for that. In exchange, the TH05 image gets to use the full 16 colors with no change to the code.


Rounding out this delivery, we also got TH05's rolling Yin-Yang Orb animation before the title screen… and it's just more bloat and landmines on a smaller scale that might be noticeable on slower PC-98 models. In total, there are three unnecessary inter-page copies of the entire VRAM that can easily insert lag frames, and two minor page-switching landmines that can potentially lead to tearing on the first frame of the roll or fade animation. Clearly, ZUN did not have smoothness or code quality in mind there, as evidenced by the fact that this animation simply displays 8 .PI files in sequence. But hey, a short animation like this is 📝 another perfectly appropriate place for a quick-and-dirty solution if you develop with a deadline.
And that's 1.30% of all PC-98 Touhou code finalized in two pushes! We're slowly running out of these big shared pieces of ASM code…

I've been neglecting TH03's OP.EXE quite a bit since it simply doesn't contain any translatable plaintext outside the Music Room. All menu labels are gaiji, and even the character selection menu displays its monochrome character names using the 4-plane sprites from CHNAME.BFT. Splitting off half of its data into a separate .ASM file was more akin to getting out a jackhammer to free up the room in front of the third remaining Music Room, but now we're there, and I can decompile all three of them in a natural way, with all referenced data.
Next up, therefore: Doing just that, securing another important piece of text for the upcoming non-ASCII translations and delivering another big piece of easily finalized code. I'm going to work full-time on ReC98 for almost all of December, and delivering that and the Shuusou Gyoku SC-88Pro recording BGM back-to-back should free up about half of the slightly higher cap for this month.

📝 Posted:
🚚 Summary of:
P0258, P0259, P0260, P0261
Commits:
5876755...e8a0b3e, e8a0b3e...dfaa3c6, dfaa3c6...ed9ee93, ed9ee93...ae2fc28
💰 Funded by:
Blue Bolt, [Anonymous], Yanga, Splashman
🏷 Tags:

And we're back to PC-98 Touhou for a brief interruption of the ongoing Shuusou Gyoku Linux port. Let's clear some of the Touhou-related progress from the backlog, and use the unconstrained nature of these contributions to prepare the 📝 upcoming non-ASCII translations commissioned by Touhou Patch Center. The current budget won't cover all of my ambitions, but it would at least be nice if all text in these games was feasibly translatable by the time I officially start working on that project.

At a little over 3 pushes, it might be surprising to see that this took longer than the 📝 TH03/TH04/TH05 cutscene system. It's obvious that TH02 started out with a different system for in-game dialog, but while TH04 and TH05 look identical on the surface, they only actually share 30% of their dialog code. So this felt more like decompiling 2.4 distinct systems, as opposed to one identical base with tons of game-specific differences on top.

The table of contents was pretty popular last time around, so let's have another one:

  1. Overview of TH04's dialog system
  2. Changes introduced in TH05
  3. Command reference for the TH04 and TH05 systems
  4. Overview of TH02's dialog system
  5. TH02's face portrait images
  6. Bugs during TH02's dialog box slide-in animation
  7. Bugs and quirks in Mima's defeat dialog (might be lore-relevant)
  8. TH03 win messages

Let's start with the ones from TH04 and TH05, since they are not that broken. For TH04, ZUN started out by copy-pasting the cutscene system, causing the result to inherit many of the caveats I already described in the cutscene blog post:

Then, however, he greatly simplified the system. Mainly, this was done by moving text rendering from the PC-98 graphics chip to the text chip, which avoids the need for any text-related unblitting code, but ZUN also added a bunch of smaller changes:

While it would seem that TH05 has no issues with ASCII 0x20 spaces, the text as a whole is still blindly processed two bytes at a time, and any commands can only appear at even byte positions within a line. I dimmed the VRAM pixels to 25% of their original brightness to make the text easier to read.
The same text backported to TH04, additionally demonstrating how that game's dialog system inherited the whitespace skipping behavior of TH03's cutscene system. Just like there, ASCII 0x20 spaces only work at odd byte positions because the game treats them as the trailing byte of a full-width Shift-JIS codepoint. I don't know how large the budget for the upcoming non-ASCII translations will be, but I'm going to fix this even in the very basic fully static variant. I dimmed the VRAM pixels to 25% of their original brightness to make the text easier to read.
Demonstrating the lack of automatic line or box breaks in TH05's dialog systemDemonstrating the lack of automatic line or box breaks in TH04's dialog system, in addition to its lack of support for ASCII 0x20 spaces carried over from TH03's cutscene system

TH05 then moved from TH04's plaintext scripts to the binary .TX2 format while removing all the unused commands copy-pasted from the cutscene system. Except for a single additional command intended to clear a text box, TH05's dialog system only supports a strict subset of the features of TH04's system.
This change also introduced the following differences compared to TH04:

Writing the 0x02 byte to text RAM results in an SX character, which is simply the PC-98 font ROM's glyph for that Shift-JIS codepoint.
Also note how each face change is now preceded by two frames of delay.
No problem in TH04. Note how the dialog also runs a bit faster – TH04 only adds the aforementioned one frame of delay to each face change, and has fewer two-byte chunks of text to display overall.

For modding these files, you probably want to use TXDEF from -Tom-'s MysticTK. It decodes these files into a text representation, and its encoder then takes care of the character-specific byte offsets in the 10-byte header. This text representation simplifies the format a lot by avoiding all corner cases and landmines you'd experience during hex-editing – most notably by interpreting the box-starting 0x0D as a command to show text that takes a string parameter, avoiding the broken calls to script commands in the middle of text. However, you'd still have to manually ensure an even number of bytes on every line of text.

In the entry function of TH05's dialog loop, we also encounter the hack that is responsible for properly handling 📝 ZUN's hidden Extra Stage replay. Since the dialog loop doesn't access the replay inputs but still requires key presses to advance through the boxes, ZUN chose to just skip the dialog altogether in the specific case of the Extra Stage replay being active, and replicated all sprite management commands from the dialog script by just hardcoding them.
And you know what? Not only do I not mind this hack, but I would have preferred it over the actual dialog system! The aforementioned sprite management commands effectively boil down to manual memory management, deallocating all stage enemy and midboss sprites and thus ensuring that the boss sprites end up at specific master.lib sprite IDs (patnums). The hardcoded boss rendering function then expects these sprites to be available at these exact IDs… which means that the otherwise hardcoded bosses can't render properly without the dialog script running before them. :zunpet:
There is absolutely no excuse for the game to burden dialog scripts with this functionality. Sure, delayed deallocation would allow them to blit stage-specific sprites, but the original games don't do that; probably because none of the two games feature an unblitting command. And even if they did, it would have still been cleaner to expose the boss-specific sprite setup as a single script command that can then also be called from game code if the script didn't do so. Commands like these just are a recipe for crashes, especially with parsers that expect fullwidth Shift-JIS text and where misaligned ASCII text can easily cause these commands to be skipped.

But then again, it does make for funny screenshot material if you accidentally the deallocation and then see bosses being turned into stage enemies:

TH04's dialog before the Stage 4 Marisa fight without deallocating the stage sprites inside the script, causing Marisa to be turned into one of the stage enemiesTH04's dialog before the Stage 6 Yuuka fight without deallocating the stage sprites inside the script, causing Yuuka to be turned into two different cels of the same stage enemyTH05's dialog before the Louise fight without deallocating the stage sprites inside the script, causing Louise to be turned into one of the ice enemies from TH05's Stage 2TH05's dialog before the Louise fight without deallocating the stage sprites inside the script, causing Mai and Yuki to be turned into a windmill and fairy/demon enemy, respectively
Some of the more amusing consequences of not calling the sprite-deallocating :th04: \c /  :th05: 0x04 command inside a dialog script.
In the case of 4️⃣, the game then even crashes on this frame at the end of the dialog, in a way that resembles the infamous 📝 TH04 crash before Stage 5 Yuuka if no EMS driver is loaded. Both the stage- and boss-specific BFNT sprites are loaded into memory at this point, leaving no room for the 256×256-pixel background image on the size-limited master.lib heap.

With all the general details out of the way, here's the command reference:

:th04: :th05:
0
1
0x00
0x01
Selects either the player character (0) or the boss (1) as the currently speaking character, and moves the cursor to the beginning of the text box. In TH04, this command also directly starts the new dialog box, which is probably why it's not prefixed with a \ as it only makes sense outside of text. TH05 requires a separate 0x0D command to do the same.
\=1 0x02 0x!! Replaces the face portrait of the currently active speaking character with image #1 within her .CD2 file.
\=255 0x02 0xFF Removes the face portrait from the currently active text box.
\l,filename 0x03 filename 0x00 Calls master.lib's super_entry_bfnt() function, which loads sprites from a BFNT file to consecutive IDs starting at the current patnum write cursor.
\c 0x04 Deallocates all stage-specific BFNT sprites (i.e., stage enemies and midbosses), freeing up conventional RAM for the boss sprites and ensuring that master.lib's patnum write cursor ends up at :th04: 128 / :th05: 180.
In TH05's Extra Stage, this command also replaces 📝 the sprites loaded from MIKO16.BFT with the ones from ST06_16.BFT.
\d Deallocates all face portrait images.
The game automatically does this at the end of each dialog sequence. However, ZUN wanted to load Stage 6 Yuuka's 76 KiB of additional animations inside the script via \l, and would have once again run up against the master.lib heap size limit without that extra free memory.
\m,filename 0x05 filename 0x00 Stops the currently playing BGM, loads a new one from the given file, and starts playback.
\m$ 0x05 $ 0x00 Stops the currently playing BGM.
Note that TH05 interprets $ as a null-terminated filename as well.
\m* Restarts playback of the currently loaded BGM from the beginning.
\b0,0,0 0x06 0x!!!! 0x!!!! 0x!! Blits the master.lib patnum with the ID indicated by the third parameter to the current VRAM page at the top-left screen position indicated by the first two parameters.
\e0 Plays the sound effect with the given ID.
\t100 Sets palette brightness via master.lib's palette_settone() to any value from 0 (fully black) to 200 (fully white). 100 corresponds to the palette's original colors.
\fo1
\fi1
Calls master.lib's palette_black_out() or palette_black_in() to play a hardware palette fade animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\wo1
\wi1
0x09 0x!!
0x0A 0x!!
Calls master.lib's palette_white_out() or palette_white_in() to play a hardware palette fade animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
The TH05 version of 0x09 also clears the text in both boxes before the animation.
\n 0x0B Starts a new line by resetting the X coordinate of the TRAM cursor to the left edge of the text area and incrementing the Y coordinate.
The new line will always be the next one below the last one that was properly started, regardless of whether the text previously wrapped to the next TRAM row at the edge of the screen.
\g8 Plays a blocking 8-frame screen shake animation. Copy-pasted from the cutscene parser, but actually used right at the end of the dialog shown before TH04's Bad Ending.
\ga0 0x0C 0x!! Shows the gaiji with the given ID from 0 to 255 at the current cursor position, ignoring the per-glyph delay.
\k0 Waits 0 frames (0 = forever) for any key to be pressed before continuing script execution.
0x0D Starts a new dialog box with the previously selected speaker. All text until the next 0xFF command will appear on screen.
Inside dialogs, this is a no-op.
0x0E Takes the current dialog cursor as the top-left corner of a 240×48-pixel rectangle, and replaces all text RAM characters within that rectangle with whitespace.
This is only used to clear the player character's text box before Shinki's final いくよ‼ box. Shinki has two consecutive text boxes in all 4 scripts here, and ZUN probably wanted to clear the otherwise blue text to imply a dramatic pause before Shinki's final sentence. Nice touch.
(You could, however, also use it after a box-ending 0xFF command to mess with text RAM in general.)
\# Quits the currently running loop. This returns from either the text loop to the command loop, or it ends the dialog sequence by returning from the command loop back to gameplay. If this stage of the game later starts another dialog sequence, it will start at the next script byte.
\$ Like \#, but first waits for any key to be pressed.
0xFF Behaves like TH04's \$ in the text loop, and like \# in the command loop. Hence, it's not possible in TH05 to automatically end a text box and advance to the next one without waiting for a key press.
Unused commands are in gray.

At the end of the day, you might criticize the system for how its landmines make it annoying to mod in ASCII text, but it all works and does what it's supposed to. ZUN could have written the cleanest single and central Shift-JIS iterator that properly chunks a byte buffer into halfwidth and fullwidth codepoints, and I'd still be throwing it out for the upcoming non-ASCII translations in favor of something that either also supports UTF-8 or performs dictionary lookups with a full box of text.
The only actual bug can be found in the input detection, which once again doesn't correctly handle the infamous key up/key down scancode quirk of PC-98 keyboards. All it takes is one wrongly placed input polling call, and suddenly you have to think about how the update cycle behind the PC-98 keyboard state bytes might cause the game to run the regular 2-frame delay for a single 2-byte chunk of text before it shows the full text of a box after all… But even this bug is highly theoretical and could probably only be observed very, very rarely, and exclusively on real hardware.


The same can't be said about TH02 though, but more on that later. Let's first take a look at its data, which started out much simpler in that game. The STAGE?.TXT files contain just raw Shift-JIS text with no trace of commands or structure. Turning on the whitespace display feature in your editor reveals how the dialog system even assumes a fixed byte length for each box: 36 bytes per line which will appear on screen, followed by 4 bytes of padding, which the original files conveniently use to visually split the lines via a CR/LF newline sequence. Make sure to disable trimming of trailing whitespace in your editor to not ruin the file when modding the text… :onricdennat:

靈夢:あんた、まだ名前も聞いてないの··
······に覚えられないわよ。・・・・・··
里香:あたいは、里香よ。覚えときなさ··
・・・い。・・・・・・················
Two boxes from TH02's STAGE5.TXT with visualized whitespace. These also demonstrate how the CR/LF newlines only make up 2 of the 4 padding bytes, and require each line to be padded with two more bytes; you could not use these trailing spaces for actual text. Also note how the exquisite mixture of fullwidth and halfwidth spaces demands the text to be viewed with only the most metrically consistent monospace fonts to preserve the intended alignment. 🍷 It appears quite misaligned on my phone.

Consequently, everything else is hardcoded – every effect shown between text boxes, the face portrait shown for each box, and even how many boxes are part of each dialog sequence. Which means that the source code now contains a long hardcoded list of face IDs for most of the text boxes in the game, with the rest being part of the dedicated hardcoded dialog scripts for 2/3 of the game's stages.
Without the restriction to a fixed set of scripting commands, TH02 naturally gravitated to having the most varied dialog sequences of all PC-98 Touhou games. This flexibility certainly facilitated Mima's grand entrance animation in Stage 4, or the different lines in Stage 4 and 5 depending on whether you already used a continue or not. Marisa's post-boss dialog even inserts the number of continues into the text itself – by, you guessed it, writing to hardcoded byte offsets inside the dialog text before printing it to the screen. :godzun: But once again, I have nothing to criticize here – not even the fact that the alternate dialog scripts have to mutate the "box cursor" to jump to the intended boxes within the file. I know that some people in my audience like VMs, but I would have considered it more bloated if ZUN had implemented a full-blown scripting language just to handle all these special cases.


Another unique aspect of TH02 is the way it stores its face portraits, which are infamous for how hard they are to find in the original data files. These sprites are actually map tiles, stored in MIKO_K.MPN, and drawn using the same functions used to blit the regular map tiles to the 📝 tile source area in VRAM. We can only guess why ZUN chose this one out of the three graphics formats he used in TH02:

TH02's MIKO_K.PTN, arranged into a 16×16-tile layout that reveals how these tiles are combined into face portraits.
MPNDEF from -Tom-'s MysticTK conveniently uses this exact layout in its .BMP output. Earlier MPNDEF versions crashed when converting this file as its 256 tiles led to an 8-bit overflow bug, so make sure you've updated to the current version from the end of October 2023 if you want to convert this file yourself. The format stores the 4 bitplanes of each 16×16 tile in order, so good luck finding a different planar image viewer that would support both such a tiled layout and a custom palette. Sometimes, a weird internal format is the best type of obfuscation. :tannedcirno:
TH02's MIKO_K.PTN with the 16×16 tile grid overlaid

And since you're certainly wondering about all these black tiles at the edges: Yes, these are not only part of the file and pad it from the required 240×192 pixels to 256×256, but also kept in memory during a stage, wasting 9.5 KiB of conventional RAM. That's 172 seconds of potential input replay data, just for those people who might still think that we need EMS for replays.


Alright, we've got the text, we've got the faces, let's slide in the box and display it all on screen. Apparently though, we also have to blit the player and option sprites using raw, low-level master.lib function calls in the process? :thonk: This can't be right, especially because ZUN always blits the option sprite associated with the Reimu-A shot type, regardless of which one the player actually selected. And if you keep moving above the box area before the dialog starts, you get to see exactly how wrong this is:

Let's look closer at Reimu's sprite during the slide-in animation, and in the two frames before:

Zoomed-in area around Reimu's sprite from frame 35 of the video aboveZoomed-in area around Reimu's sprite from frame 36 of the video aboveZoomed-in area around Reimu's sprite from frame 37 of the video above

This one image shows off no less than 4 bugs:

  1. ZUN blits the stationary player sprite here, regardless of whether the player was previously moving left or right. This is a nice way of indicating that Reimu stops moving once the dialog starts, but maybe ZUN should have unblitted the old sprite so that the new one wouldn't have appeared on top. The game only unblits the 384×64 pixels covered by the dialog box on every frame of the slide-in animation, so Reimu would only appear correctly if her sprite happened to be entirely located within that area.
  2. All sprites are shifted up by 1 pixel in frame 2️⃣. This one is not a bug in the dialog system, but in the main game loop. The game runs the relevant actions in the following order:

    1. Invalidate any map tiles covered by entities
    2. Redraw invalidated tiles
    3. Decrement the Y coordinate at the top of VRAM according to the scroll speed
    4. Update and render all game entities
    5. Scroll in new tiles as necessary according to the scroll speed, and report whether the game has scrolled one pixel past the end of the map
    6. If that happened, pretend it didn't by incrementing the value calculated in #3 for all further frames and skipping to #8.
    7. Issue a GDC SCROLL command to reflect the line calculated in #3 on the display
    8. Wait for VSync
    9. Flip VRAM pages
    10. Start boss if we're past the end of the map

    The problem here: Once the dialog starts, the game has already rendered an entire new frame, with all sprites being offset by a new Y scroll offset, without adjusting the graphics GDC's scroll registers to compensate. Hence, the Y position in 3️⃣ is the correct one, and the whole existence of frame 2️⃣ is a bug in itself. (Well… OK, probably a quirk because speedrunning exists, and it would be pretty annoying to synchronize any video regression tests of the future TH02 Anniversary Edition if it renders one fewer frame in the middle of a stage.)

  3. ZUN blits the option sprites to their position from frame 1️⃣. This brings us back to 📝 TH02's special way of retaining the previous and current position in a two-element array, indexed with a VRAM page ID. Normally, this would be equivalent to using dedicated prev and cur structure fields and you'd just index it with the back page for every rendering call. But if you then decide to go single-buffered for dialogs and render them onto the front page instead… :zunpet:
    Note that fixing bug #2 would not cancel out this one – the sprites would then simply be rendered to their position in the frame before 1️⃣.

  4. And of course, the fixed option sprite ID also counts as a bug.

As for the boxes themselves, it's yet another loop that prints 2-byte chunks of Shift-JIS text at an even slower fixed interval of 3 frames. In an interesting quirk though, ZUN assumes that every box starts with the name of the speaking character in its first two fullwidth Shift-JIS characters, followed by a fullwidth colon. These 6 bytes are displayed immediately at the start of every box, without the usual delay. The resulting alignment looks rather janky with Genjii, whose single right-padded kanji looks quite awkward with the fullwidth space between the name and the colon. Kind of makes you wonder why ZUN just didn't spell out his proper name, 玄爺, instead, but I get the stylistic difference.
In Stage 4, the two-kanji assumption then breaks with Marisa's three-kanji name, which causes the full-width colon to be printed as the first delayed character in each of her boxes:


That's all the issues and quirks in the system itself. The scripts themselves don't leave much room for bugs as they basically just loop over the hardcoded face ID array at this level… until we reach the end of the game. Previously, the slide-in animation could simply use the tile invalidation and re-rendering system to unblit the box on each frame, which also explained why Reimu had to be separately rendered on top. But this no longer works with a custom-rendered boss background, and so the game just chooses to flood-fill the area with graphics chip color #0:

Then again, transferring pixels from the back page would be just as wrong as they lag one frame behind. No way around capturing these 384×64 pixels to main memory here… Oh well, this flood-fill at least adds even more legibility on top of the already half-transparent text box. A property that the following dialog sequence unfortunately lacks…

For Mima's final defeat dialog though, ZUN chose to not even show the box. He might have realized the issue by that point, or simply preferred the more dramatic effect this had on the lines. The resulting issues, however, might even have ramifications for such un-technical things as lore and character dynamics. :zunpet: As it turns out, the code for this dialog sequence does in fact render Mima's smiling face for all boxes?! You only don't see it in the original game because it's rendered to the other VRAM page that remains invisible during the dialog sequence:

Caution, flashing lights.

Here's how I interpret the situation:

So, the future TH02 Anniversary Edition will fix the bug by showing the back page, but retain the quirk by rewriting the dialog code to not blit the face.


And with that, we've secured all in-game dialog for the upcoming non-ASCII translations! The remaining 2/3 of the last push made for a good occasion to also decompile the small amount of code related to TH03's win messages, stored in the @0?TX.TXT files. Similar to TH02's dialog format, these files are also split into fixed-size blocks of 3×60 bytes. But this time, TH03 loads all 60 bytes of a line, including the CR/LF line breaking codepoints in the original files, into the statically allocated buffer that it renders from. These control characters are then only filtered to whitespace by ZUN's graph_putsa_fx() function. If you remove the line breaks, you get to use the full 60 bytes on every line.
The final commits went to the MIKO.CFG loading and saving functions used in TH04's and TH05's OP.EXE, as well as TH04's game startup code to finally catch up with 📝 TH05's counterpart from over 3 years ago. This brought us right in front of the main menu rendering code in both TH04 and TH05, which is identical in both games and will be tackled in the next PC-98 Touhou delivery.

Next up, though: Returning to Shuusou Gyoku, and adding support for SC-88Pro recordings as BGM. Which may or may not come with a slight controversy…

📝 Posted:
🚚 Summary of:
P0252, P0253, P0254, P0255, P0256, P0257
Commits:
(Seihou) P0251...e98feef, (Seihou) e98feef...24df71c, (Seihou) 24df71c...b7b863b, (Seihou) b7b863b...2b8218e, (Seihou) 2b8218e...P0256, (Website) b79c667...e2ba49b
💰 Funded by:
Arandui, Ember2528, [Anonymous]
🏷 Tags:

And now we're taking this small indie game from the year 2000 and porting its game window, input, and sound to the industry-standard cross-platform API with "simple" in its name.

Why did this have to be so complicated?! I expected this to take maybe 1-2 weeks and result in an equally short blog post. Instead, it raised so many questions that I ended up with the longest blog post so far, by quite a wide margin. These pushes ended up covering so many aspects that could be interesting to a general and non-Seihou-adjacent audience, so I think we need a table of contents for this one:

  1. Evaluating Zig
  2. Visual Studio doesn't implement concepts correctly?
  3. Reusable building blocks for Tup
  4. Compiling SDL 2
  5. The new frame rate limiter
  6. Audio via SDL or SDL_mixer? (Nope, neither)
  7. miniaudio
  8. Resampling defective sound effects (including FLAC not always being lossless)
  9. Joypad input with SDL
  10. Restoring the original screenshot feature
  11. Integer math in hand-written ASM

Before we can start migrating to SDL, we of course have to integrate it into the build somehow. On Linux, we'd ideally like to just dynamically link to a distribution's SDL development package, but since there's no such thing on Windows, we'd like to compile SDL from source there. This allows us to reuse our debug and release flags and ensures that we get debug information, without needing to clone build scripts for every C++ library ever in the process or something.
So let's get my Tup build scripts ready for compiling vendored libraries… or maybe not? Recently, I've kept hearing about a hot new technology that not only provides the rare kind of jank-free cross-compiling build system for C/C++ code, but innovates by even bundling a C++ compiler into a single 279 MiB package with no further dependencies. Realistically replacing both Visual Studio and Tup with a single tool that could target every OS is quite a selling point. The upcoming Linux port makes for the perfect occasion to evaluate Zig, and to find out whether Tup is still my favorite build system in 2023.

Even apart from its main selling point, there's a lot to like about Zig:

However, as a version number of 0.11.0 might already suggest, the whole experience was then bogged down by quite a lot of issues:

So for the time being, I still prefer Tup. But give it maybe two or three years, and I'm sure that Zig will eventually become the best tool for resurrecting legacy C++ codebases. That is, if the proposed divorce of the core Zig compiler from LLVM isn't an indication that the productive parts of the Zig community consider the C/C++ building features to be "good enough", and are about to de-emphasize them to focus more strongly on the actual Zig language. Gaining adoption for your new systems language by bundling it with a C/C++ build system is such a great and unique strategy, and it almost worked in my case. And who knows, maybe Zig will already be good enough by the time I get to port PC-98 Touhou to modern systems.

(If you came from the Zig wiki, you can stop reading here.)


A few remnants of the Zig experiment still remain in the final delivery. If that experiment worked out, I would have had to immediately change the execution encoding to UTF-8, and decompile a few ASM functions exclusive to the 8-bit rendering mode which we could have otherwise ignored. While Clang does support inline assembly with Intel syntax via -fms-extensions, it has trouble with ; comments and instructions like REP STOSD, and if I have to touch that code anyway… (The REP STOSD function translated into a single call to memcpy(), by the way.)

Another smaller issue was Visual Studio's lack of standard library header hygiene, where #including some of the high-level STL features also includes more foundational headers that Clang requires to be included separately, but I've already known about that. Instead, the biggest shocker was that Visual Studio accepts invalid syntax for a language feature as recent as C++20 concepts:

// Defines the interface of a text rendering session class. To simplify this
// example, it only has a single `Print(const char* str)` method.
template <class T> concept Session = requires(T t, const char* str) {
	t.Print(str);
};

// Once the rendering backend has started a new session, it passes the session
// object as a parameter to a user-defined function, which can then freely call
// any of the functions defined in the `Session` concept to render some text.
template <class F, class S> concept UserFunctionForSession = (
	Session<S> && requires(F f, S& s) {
		{ f(s) };
	}
);

// The rendering backend defines a `Prerender()` method that takes the
// aforementioned user-defined function object. Unfortunately, C++ concepts
// don't work like this: The standard doesn't allow `auto` in the parameter
// list of a `requires` expression because it defines another implicit
// template parameter. Nevertheless, Visual Studio compiles this code without
// errors.
template <class T, class S> concept BackendAttempt = requires(
	T t, UserFunctionForSession<S> auto func
) {
	t.Prerender(func);
};

// A syntactically correct definition would use a different constraint term for
// the type of the user-defined function. But this effectively makes the
// resulting concept unusable for actual validation because you are forced to
// specify a type for `F`.
template <class T, class S, class F> concept SyntacticallyFixedBackend = (
	UserFunctionForSession<F, S> && requires(T t, F func) {
		t.Prerender(func);
	}
);

// The solution: Defining a dummy structure that behaves like a lambda as an
// "archetype" for the user-defined function.
struct UserFunctionArchetype {
	void operator ()(Session auto& s) {
	}
};

// Now, the session type disappears from the template parameter list, which
// even allows the concrete session type to be private.
template <class T> concept CorrectBackend = requires(
	T t, UserFunctionArchetype func
) {
	t.Prerender(func);
};
Here's a Godbolt link, configured with both Visual Studio and Clang compilers.

What's this, Visual Studio's infamous delayed template parsing applied to concepts, because they're templates as well? Didn't they get rid of that 6 years ago? You would think that we've moved beyond the age where compilers differed in their interpretation of the core language, and that opting into a current C++ standard turns off any remaining antiquated behaviors…


So let's actually get my Tup build scripts ready for compiling vendored libraries, because the 📝 previous 70 lines of Lua definitely weren't. For this use case, we'd like to have some notion of distinct build targets that can have a unique set of compilation and linking flags. We'd also like to always build them in debug and release versions even if you only intend to build your actual program in one of those versions – with the previous system of specifying a single version for all code, Tup would delete the other one, which forces a time-consuming and ultimately needless rebuild once you switch to the other version.

The solution I came up with treats the set of compiler command-line options like a tree whose branches can concatenate new options and/or filter the versions that are built on this branch. In total, this is my 4th attempt at writing a compiler abstraction layer for Tup. Since we're effectively forced to write such layers in Lua, it will always be a bit janky, but I think I've finally arrived at a solid underlying design that might also be interesting for others. Hence, I've split off the result into its own separate repository and added high-level documentation and a documented example. And yes, that's a Code Nutrition label! I've wanted to add one of these ever since I first heard about the idea, since it communicates nicely how seriously such an open-source project should be taken. Which, in this case, is actually not all too seriously, especially since development of the core Tup project has all but stagnated. If Zig does indeed get better and better at being a Clang frontend/build system, the only niches left for Tup will be Visual Studio-exclusive projects, or retrocoding with nonstandard toolchains (i.e., ReC98). Quite ironic, given Tup's Unix heritage…
Oh, and maybe general Makefile-like tasks where you just want to run specific programs. Maybe once the general hype swings back around and people start demanding proper graph-based dependency tracking instead of just a command runner


Alright, alternatives evaluated, build system ready, time to include SDL! Once again, I went for Git submodules, but this time they're held together by a batch file that ensures that the intended versions are checked out before starting Tup. Git submodules have a bad rap mainly because of their usability issues, and such a script should hopefully work around them? Let's see how this plays out. If it ends up causing issues after all, I'll just switch to a Zig-like model of downloading and unzipping a source archive. Since Windows comes with curl and tar these days, this can even work without any further dependencies, and will also remove all the test code bloat.

Compiling SDL from a non-standard build system requires a bit of globbing to include all the code that is being referenced, as well as a few linker settings, but it's ultimately not much of a big deal. I'm quite happy that it was possible at all without pre-configuring a build, but hey, that's what maintaining a Visual Studio project file does to a project. :tannedcirno:
By building SDL with the stock Windows configuration, we then end up with exactly what the SDL developers want us to use… which is a DLL. You can statically link SDL, but they really don't want you to do that. So strongly, in fact, that they not merely argue how well the textbook advantages of dynamic linking have worked for them and gamers as a whole, but implemented a whole dynamic API system that enforces overridable dynamic function loading even in static builds. Nudging developers to their preferred solution by removing most advantages from static linking by default… that's certainly a strategy. It definitely fits with SDL's grassroots marketing, which is very good at painting SDL as the industry standard and the only reliable way to keep your game running on all originally supported operating systems. Well, at least until SDL 3 is so stable that SDL 2 gets deprecated and won't receive any code for new backends…

However, dynamic linking does make sense if you consider what SDL is. Offering all those multiple rendering, input, and sound backends is what sets it apart from its more hip competition, and you want to have all of them available at any time so that SDL can dynamically select them based on what works best on a system. As a result, everything in SDL is being referenced somewhere, so there's no dead code for the linker to eliminate. Linking SDL statically with link-time code generation just prolongs your link time for no benefit, even without the dynamic API thwarting any chance of SDL calls getting inlined.
There's one thing I still don't like about all this, though. The dynamic API's table references force you to include all of SDL's subsystems in the DLL even if your game doesn't need some of them. But it does fit with their intention of having SDL2.dll be swappable: If an older game stopped working because of an outdated SDL2.dll, it should be possible for anyone to get that game working again by replacing that DLL with any newer version that was bundled with any random newer game. And since that would fail if the newer SDL2.dll was size-optimized to not include some of the subsystems that the older game required, they simply removed (or de-prioritized) the possibility altogether. Maybe that was their train of thought? You can always just use the official Windows DLL, whose whole point is to include everything, after all. 🤷

So, what do we get in these 1.5 MiB? There are:

Unfortunately, SDL 2 also statically references some newer Windows API functions and therefore doesn't run on Windows 98. Since this build of Shuusou Gyoku doesn't introduce any new features to the input or sound interfaces, we can still use pbg's original DirectSound and DirectInput code for the i586 build to keep it working with the rest of the platform-independent game logic code, but it will start to lag behind in features as soon as we add support for SC-88Pro BGM or more sophisticated input remapping. If we do want to keep this build at the same feature level as the SDL one, we now have a choice: Do we write new DirectInput and DirectSound code and get it done quickly but only for Shuusou Gyoku, or do we port SDL 2 to Windows 98 and benefit all other SDL 2 games as well? I leave that for my backers to decide.


Immediately after writing the first bits of actual SDL code to initialize the library and create the game window, you notice that SDL makes it very simple to gradually migrate a game. After creating the game window, you can call SDL_GetWindowWMInfo() to retrieve HWND and HINSTANCE handles that allow you to continue using your original DirectDraw, DirectSound, and DirectInput code and focus on porting one subsystem at a time.
Sadly, D3DWindower can no longer turn SDL's fullscreen mode into a windowed one, but DxWnd still works, albeit behaving a bit janky and insisting on minimizing the game whenever its window loses focus. But in exchange, the game window can surprisingly be moved now! Turns out that the originally fixed window position had nothing to do with the way the game created its DirectDraw context, and everything to do with pbg blocking the Win32 "syscommand" that allows a window to be moved. By deleting a system menu… seriously?! Now I'm dying to hear the Raymond Chen explanation for how this behavior dates back to an unfortunate decision during the Win16 days or something.
As implied by that commit, I immediately backported window movability to the i586 build.

However, the most important part of Shuusou Gyoku's main loop is its frame rate limiter, whose Win32 version leaves a bit of room for improvement. Outside of the uncapped [おまけ] DrawMode, the original main loop continuously checks whether at least 16 milliseconds have elapsed since the last simulated (but not necessarily rendered) frame. And by that I mean continuously, and deliberately without using any of the Windows system facilities to sleep the process in the meantime, as evidenced by a commented-out Sleep(1) call. This has two important effects on the game:

Unsurprisingly, SDL features a delay function that properly sleeps the process for a given number of milliseconds. But just specifying 16 here is not exactly what we want:

  1. Sure, modern computers are fast, but a frame won't ever take an infinitely fast 0 milliseconds to render. So we still need to take the current frame time into account.
  2. SDL_Delay()'s documentation says that the wake-up could be further delayed due to OS scheduling.

To address both of these issues, I went with a base delay time of 15 ms minus the time spent on the current frame, followed by busy-waiting for the last millisecond to make sure that the next frame starts on the exact frame boundary. And lo and behold: Even though this still technically wastes up to 1 ms of CPU time, it still dropped CPU usage into the 0%-2% range during gameplay on my Intel Core i5-8400T CPU, which is over 5 years old at this point. Your laptop battery will appreciate this new build quite a bit.


Time to look at audio then, because it sure looks less complicated than input, doesn't it? Loading sounds from .WAV file buffers, playing a fixed number of instances of every sound at a given position within the stereo field and with optional looping… and that's everything already. The DirectSound implementation is so straightforward that the most complex part of its code is the .WAV file parser.
Well, the big problem with audio is actually finding a cross-platform backend that implements these features in a way that seamlessly works with Shuusou Gyoku's original files. DirectSound really is the perfect sound API for this game:

The last point can't really be an argument against anything, but we'd still be left with 7 other boxes that a cross-platform alternative would have to tick. We already picked SDL for our portability needs, so how does its audio subsystem stack up? Unfortunately, not great:

OK, sure, but you're not supposed to use it for anything more than a single stream of audio. SDL_mixer exists precisely to cover such non-trivial use cases, and it even supports sound effect looping and panning with just a single function call! But as far as the rest of the library is concerned, it manages to be an even bigger disappointment than raw SDL audio:

There is a fork that does add support for an arbitrary number of music streams, but the rest of its features leave me questioning the priorities and focus of this project. Because surely, when I think about missing features in an audio backend, I immediately think about support for a vast array of chiptune file formats… 🤪
And wait, what, they merged this piece of bloat back into the official SDL_mixer library?! Thanks for opening up a vast attack surface for potential security vulnerabilities in code that would never run for the majority of users, just to cover some niche formats that nobody would seriously expect in a general audio library. And that's coming from someone who loves listening to that stuff!
At this rate, I'm expecting SDL_mixer to gain a mail client by the end of the decade. Hmm, what's the closest audio thing to a mail client… oh, right, WebRTC! Yeah, let's just casually drop a giant part of the Chromium codebase into SDL_mixer, what could possibly go wrong?

This dire situation made me wonder if SDL was the wrong choice for Shuusou Gyoku to begin with. Looking at other low-level cross-platform game libraries, you'll quickly notice that all of them come with mostly equally capable 2D renderers these days, and mainly differentiate themselves in minute API details that you'd only notice upon a really close look.
raylib is another one of those libraries and has been getting exceptionally popular in recent years, to the point of even having more than twice as many GitHub stars as SDL. By restricting itself to OpenGL, it can even offer an abstraction for shaders, which we'd really like for the 西方Project lens ball effect.
In the case of raylib's audio system, the lack of sound effect looping is the minute API detail that would make it annoying to use for Shuusou Gyoku. But it might be worth a look at how raylib implements all this if it doesn't use SDL… which turned out to be the best look I've taken in a long time, because raylib builds on top of miniaudio which is exactly the kind of audio library I was hoping to find. Let's check the list from above:

Oh, and it's written by the same developer who also wrote the best FLAC library back in 2018. And that's despite them being single-file C libraries, which I consider to be massively overrated…

The drawback? Similar to Zig, it's only on version 0.11.18, and also focuses on good high-level documentation at the expense of an API reference. Unlike Zig though, the three issues I ran into turned out to be actual and fixable bugs: Two minor ones related to looping of streamed sounds shorter than 2 seconds which won't ever actually affect us before we get into BGM modding, and a critical one that added high-frequency corruption to any mono sound effect during its expansion to stereo. The latter took days to track down – with symptoms like these, you'd immediately suspect the bug to lie in the resampler or its low-pass filter, both of which are so much more of a fickle and configurable part of the conversion chain here. Compared to that, stereo expansion is so conceptually simple that you wouldn't imagine anyone getting it wrong.
While the latter PR has been merged, the fix is still only part of the dev branch and hasn't been properly released yet. Fortunately, raylib is not affected by this bug: It does currently ship version 0.11.16 of miniaudio, but its usage of the library predates miniaudio's high-level API and it therefore uses a different, non-SSE-optimized code path for its format conversions.

The only slightly tricky part of implementing a miniaudio backend for Shuusou Gyoku lies in setting up multiple simultaneously playing instances for each individual sound. The documentation and answers on the issue tracker heavily push you toward miniaudio's resource manager and its file abstractions to handle this use case. We surely could turn Shuusou Gyoku's numeric sound effect IDs into fake file names, but it doesn't really fit the existing architecture where the sound interface just receives in-memory .WAV file buffers loaded from the SOUND.DAT packfile.
In that case, this seems to be the best way:


As a side effect of hunting that one critical bug in miniaudio, I've now learned a fair bit about audio resampling in general. You'll probably need some knowledge about basic digital signal behavior to follow this section, and that video is still probably the best introduction to the topic.

So, how could this ever be an issue? The only time I ever consciously thought about resampling used to be in the context of the Opus codec and its enforced sampling rate of 48,000 Hz, and how Opus advocates claim that resampling is a solved problem and nothing to worry about, especially in the context of a lossy codec. Still, I didn't add Opus to thcrap's BGM modding feature entirely because the mere thought of having to downsample to 44,100 Hz in the decoder was off-putting enough. But even if my worries were unfounded in that specific case: Recording the Stereo Mix of Shuusou Gyoku's now two audio backends revealed that apparently not every audio processing chain features an Opus-quality resampler…

If we take a look at the material that resamplers actually have to work with here, it quickly becomes obvious why their results are so varied. As mentioned above, Shuusou Gyoku's sound effects use rather low sampling rates that are pretty far away from the 48,000 Hz your audio device is most definitely outputting. Therefore, any potential imaging noise across the extended high-frequency range – i.e., from the original Nyquist frequencies of 11,025 Hz/5,512.5 Hz up to the new limit of 24,000 Hz – is still within the audible range of most humans and can clearly color the resulting sound.
But it gets worse if the audio data you put into the resampler is objectively defective to begin with, which is exactly the problem we're facing with over half of Shuusou Gyoku's sound effects. Encoding them all as 8-bit PCM is definitely excusable because it was the turn of the millennium and the resulting noise floor is masked by the BGM anyway, but the blatant clipping and DC offsets definitely aren't:

KEBARI TAME LASER LASER2 BOMB SELECT HIT CANCEL WARNING SBLASER BUZZ MISSILE JOINT DEAD SBBOMB BOSSBOMB ENEMYSHOT HLASER TAMEFAST WARP
<code>SOUND.DAT</code>, file 1/20<code>SOUND.DAT</code>, file 2/20<code>SOUND.DAT</code>, file 3/20<code>SOUND.DAT</code>, file 4/20<code>SOUND.DAT</code>, file 5/20<code>SOUND.DAT</code>, file 6/20<code>SOUND.DAT</code>, file 7/20<code>SOUND.DAT</code>, file 8/20<code>SOUND.DAT</code>, file 9/20<code>SOUND.DAT</code>, file 10/20<code>SOUND.DAT</code>, file 11/20<code>SOUND.DAT</code>, file 12/20<code>SOUND.DAT</code>, file 13/20<code>SOUND.DAT</code>, file 14/20<code>SOUND.DAT</code>, file 15/20<code>SOUND.DAT</code>, file 16/20<code>SOUND.DAT</code>, file 17/20<code>SOUND.DAT</code>, file 18/20<code>SOUND.DAT</code>, file 19/20<code>SOUND.DAT</code>, file 20/20
Waveforms for all 20 of Shuusou Gyoku's sound effects, in the order they appear inside SOUND.DAT and with their internal names. We can see quite an abundance of clipping, as well as a significant DC offset in WARNING, BUZZ, JOINT, SBBOMB, and BOSSBOMB.

Wait a moment, true peaks? Where do those come from? And, equally importantly, how can we even observe, measure, and store anything above the maximum amplitude of a digital signal?

The answer to the first question can be directly derived from the Xiph.org video I linked above: Digital signals are lollipop graphs, not stairsteps as commonly depicted in audio editing software. Converting them back to an analog signal involves constructing a continuous curve that passes through each sample point, and whose frequency components stay below the Nyquist frequency. And if the amplitude of that reconstructed wave changes too strongly and too rapidly, the resulting curve can easily overshoot the maximum digital amplitude of 0 dBFS even if none of the defined samples are above that limit.

But I can assure you that I did not create the waveform images above by recording the analog output of some speakers or headphones and then matching the levels to the original files, so how did I end up with that image? It's not an Audacity feature either because the development team argues that there is no "true waveform" to be visualized as every DAC behaves differently. While this is correct in theory, we'd be happy just to get a rough approximation here.
ffmpeg's ebur128 filter has a parameter to measure the true peak of a waveform and fairly understandable source code, and once I looked at it, all the pieces suddenly started to make sense: For our purpose of only looking at digital signals, 💡 resampling to a floating-point signal with an infinite sampling rate is equivalent to a DAC. And that's exactly what this filter does: It picks 192,000 Hz and 64-bit float as a format that's close enough to the ideal of "analog infinity" for all practical purposes that involve digital audio, and then simply converts each incoming 100 ms of audio and keeps the sample with the largest floating-point value.

So let's store the resampled output as a FLAC file and load it into Audacity to visualize the clipped peaks… only to find all of them replaced with the typical kind of clipping distortion? 😕 Turns out that I've stumbled over the one case where the FLAC format isn't lossless and there's actually no alternative to .WAV: FLAC just doesn't support floating-point samples and simply truncates them to discrete integers during encoding. When we measured inter-sample peaks above, we weren't only resampling to a floating-point format to avoid any quantization to discrete integer values, but also to make it possible to store amplitudes beyond the 0 dBFS point of ±1.0 in the first place. Once we lose that ability, these amplitudes are clipped to the maximum value of the integer bit depth, and baked into the waveform with no way to get rid of them again. After all, the resampled file now uses a higher sampling rate, and the clipping distortion is now a defined part of what the sound is.
Finally, storing a digital signal with inter-sample peaks in a floating-point format also makes it possible for you to reduce the volume, which moves these peaks back into the regular, unclipped amplitude range. This is especially relevant for Shuusou Gyoku as you'll probably never listen to sound effects at full volume.

Now that we understand what's going on there, we can finally compare the output of various resamplers and pick a suitable one to use with miniaudio. And immediately, we see how they fall into two categories:

miniaudio only comes with a linear resampler – but so does DirectSound as it turns out, so we can get actually pretty close to how the game sounded originally:

All of Shuusou Gyoku's sound effects combined and resampled into a single 48,000 Hz / 32-bit float .WAV file, using GoldWave's File Merger tool. By converting to 32-bit float first and then resampling, the conversion preserved the exact frequency range of the original 22,050 Hz and 11,025 Hz files, even despite clipping. There are small noise peaks across the entire frequency range, but they only occur at the exact boundary between individual sound effects. These are a simple result of the discontinuities that naturally occur in the waveform when concatenating signals that don't start or end at a 0 sample.
As mentioned above, you'll only get this sound out of your DAC at lower volumes where all of the resampled peaks still fit within 0 dBFS. But you most likely will have reduced your volume anyway, because these effects would be ear-splittingly loud otherwise.
The result of converting 1️⃣ into FLAC. The necessary bit depth conversion from 32-bit float to 16-bit integers clamps any data above 0 dBFS or ±1.0f to the discrete -32,678 32,767, the maximum value of such an integer. The resulting straight lines at maximum amplitude in the time domain then turn into distortion across the entire 24,000 Hz frequency domain, which then remains a part of the waveform even at lower volumes. The locations of the high-frequency noise exactly match the clipped locations in the time-domain waveform images above.
The resulting additional distortion can be best heard in BOSSBOMB, where the low source frequency ensures that any distortion stays firmly within the hearing range of most humans.
All of Shuusou Gyoku's sound effects as played through DirectSound and recorded through Stereo Mix. DirectSound also seems to use a linear low-pass filter that leaves quite a bit of high-frequency noise in the signals, making these effects sound crispier than they should be. Depending on where you stand, this is either highly inaccurate and something that should be fixed, or actually good because the sound effects really benefit from that added high end. I myself am definitely in the latter camp – and hey, this sound is the result of original game code, so it is accurate at least in that regard. :tannedcirno:
All of Shuusou Gyoku's sound effects as converted by miniaudio and directly saved to a file, with the same low-pass filter setting used in the P0256 build. This first-order low-pass filter is a decent approximation of DirectSound's resampler, even though it sounds slightly crispier as the high-frequency noise is boosted a little further. By default, miniaudio would use a 4th-order low-pass filter, so this is the second-lowest resampling quality you can get, short of disabling the low-pass filter altogether.
Conversion results when using miniaudio's 8th-order low-pass filter for resampling, the highest quality supported. This is the closest we can get to the reference conversion without using a custom resampler. If we do want to go for perfect accuracy though, we might as well go for 1️⃣ directly?

These spectrum images were initially created using ffmpeg's -lavfi showspectrumpic=mode=combined:s=1280x720 filter. The samples appear in the same order as in the waveform above.

And yes, these are indeed the first videos on this blog to have sound! I spent another push on preparing the 📝 video conversion pipeline for audio support, and on adding the highly important volume control to the player. Web video codecs only support lossy audio, so the sound in these videos will not exactly match the spectrum image, but the lossless source files do contain the original audio as uncompressed PCM streams.


Compared to that whole mess of signals and noise, keyboard and joypad input is indeed much simpler. Thanks to SDL, it's almost trivial, and only slightly complicated because SDL offers two subsystems with seemingly identical APIs:

To match Shuusou Gyoku's original WinMM backend, we'd ideally want to keep the best aspects from both APIs but without being restricted to SDL_GameController's idea of a controller. The Joy Pad menu just identifies each button with a numeric ID, so SDL_Joystick would be a natural fit. But what do we do about directional controls if SDL_Joystick doesn't tell us which joypad axes correspond to the X and Y directions, and we don't have the SDL-recommended configuration UI yet? Doing that right would also mean supporting POV hats and D-pads, after all… Luckily, all joypads we've tested map their main X axis to ID 0 and their main Y axis to ID 1, so this seems like a reasonable default guess.

Fortunately, there is a solution for our exact issue. We can still try to open a joypad via SDL_GameController, and if that succeeds, we can use a function to retrieve the SDL_Joystick ID for the main X and Y axis, close the SDL_GameController instance, and keep using SDL_Joystick for the rest of the game.
And with that, the SDL build no longer needs DirectInput 7, certain antivirus scanners will no longer complain about its low-level keyboard hook, and I turned the original game's single-joypad hot-plugging into multi-joypad hot-plugging with barely any code. 🎮

The necessary consolidation of the game's original input handling uncovered several minor bugs around the High Score and Game Over screen that I sufficiently described in the release notes of the new build. But it also revealed an interesting detail about the Joy Pad screen: Did you know that Shuusou Gyoku lets you unbind all these actions by pressing more than one joypad button at the same time? The original game indicated unbound actions with a [Button 0] label, which is pretty confusing if you have ever programmed anything because you now no longer know whether the game starts numbering buttons at 0 or 1. This is now communicated much more clearly.

Joypad button unbinding in the original version of Shuusou Gyoku, indicated by a rather confusing [Button 0] labelJoypad button unbinding in the P0256 build of Shuusou Gyoku, using a much clearer [--------] label
ESC is not bound to any joypad button in either screenshot, but it's only really obvious in the P0256 build.

With that, we're finally feature-complete as far as this delivery is concerned! Let's send a build over to the backers as a quick sanity check… a~nd they quickly found a bug when running on Linux and Wine. When holding a button, the game randomly stops registering directional inputs for a short while on some joypads? Sounds very much like a Wine bug, especially if the same pad works without issues on Windows.
And indeed, on certain joypads, Wine maps the buttons to completely different and disconnected IDs, as if it simply invents new buttons or axes to fill the resulting gaps. Until we can differentiate joypad bindings per controller, it's therefore unlikely that you can use the same joypad mapping on both Windows and Linux/Wine without entering the Joy Pad menu and remapping the buttons every time you switch operating systems.

Still, by itself, this shouldn't cause any issues with my SDL event handling code… except, of course, if I forget a break; in a switch case. 🫠
This completely preventable implicit fallthrough has now caused a few hours of debugging on my end. I'd better crank up the warning level to keep this from ever happening again. Opting into this specific warning also revealed why we haven't been getting it so far: Visual Studio did gain a whole host of new warnings related to the C++ Core Guidelines a while ago, including the one I was looking for, but actually getting the compiler to throw these requires activating a separate static analysis mode together with a plugin, which significantly slows down build times. Therefore I only activate them for release builds, since these already take long enough. :onricdennat:

But that wasn't the only step I took as a result of this blunder. In addition, I now offer free fixes for regressions in my mod releases if anyone else reports an issue before I find it myself. I've already been following this policy 📝 earlier this year when mu021 reported the unblitting bug in the initial release of the TH01 Anniversary Edition, and merely made it official now. If I was the one who broke a thing, I'll fix it for free.


Since all that input debugging already started a 5th push, I might as well fill that one by restoring the original screenshot feature. After all, it's triggered by a key press (and is thus related to the input backend), reads the contents of the frame buffer (and is thus related to the graphics backend), and it honestly looks bad to have this disclaimer in the release notes just because we're one small feature away from 100% parity with pbg's original binary.
Coincidentally, I had already written code to save a DirectDraw surface to a .BMP file for all the debugging I did in the last delivery, so we were basically only missing filename generation. Except that Shuusou Gyoku's original choice of mapping screenshots to the PrintScreen key did not age all too well:

As a result, both Arandui and I independently arrived at the idea of remapping screenshots to the P key, which is the same screenshot key used by every Windows Touhou game since TH08.

The rest of the feature remains unchanged from how it was in pbg's original build and will save every distinct frame rendered by the game (i.e., before flipping the two framebuffers) to a .BMP file as long as the P key is being held. At a 32-bit color depth, these screenshots take up 1.2 MB per frame, which will quickly add up – especially since you'll probably hold the P key for more than 1/60 of a second and therefore end up saving multiple frames in a row. We should probably compress them one day.


Since I already translated some of Shuusou Gyoku's ASM code to C++ during the Zig experiment, it made sense to finish the fifth push by covering the rest of those functions. The integer math functions are used all throughout the game logic, and are the main reason why this goal is important for a Linux port, or any port to a 64-bit architecture for that matter. If you've ever read a micro-optimization-related blog post, you'll know that hand-written ASM is a great recipe that often results in the finest jank, and the game's square root function definitely delivers in that regard, right out of the gate.
What slightly differentiates this algorithm from the typical definition of an integer square root is that it rounds up: In real numbers, √3 is ≈ 1.73, so isqrt(3) returns 2 instead of 1. However, if the result is always rounded down, you can determine whether you have to round up by simply squaring the calculated root and comparing it to the radicand. And even that is only necessary if the difference between the two doesn't naturally fall out of the algorithm – which is what also happens with Shuusou Gyoku's original ASM code, but pbg didn't realize this and squared the result regardless. :tannedcirno:

That's one suboptimal detail already. Let's call the original ASM function in a loop over the entire supported range of radicands from 0 to 231 and produce a list of results that I can verify my C++ translation against… and watch as the function's linear time complexity with regard to the radicand causes the loop to run for over 15 hours on my system. 🐌 In a way, I've found the literal opposite of Q_rsqrt() here: Not fast, not inverse, no bit hacks, and surely without the awe-inspiring kind of WTF.
I really didn't want to run the same loop over a literal C++ translation of the same algorithm afterward. Calculating integer square roots is a common problem with lots of solutions, so let's see if we can go better than linear.

And indeed, Wikipedia also has a bitwise algorithm that runs in logarithmic time, uses only additions, subtractions, and bit shifts, and even ends up with an error term that we can use to round up the result as necessary, without a multiplication. And this algorithm delivers the exact same results over the exact same range in… 50 seconds. 🏎️ And that's with the I/O to print the first value that returns each of the 46,341 different square root results.

"But wait a moment!", I hear you say. "Why are you bothering with an integer square root algorithm to begin with? Shouldn't good old round(sqrt(x)) from <math.h> do the trick just fine? Our CPUs have had SSE for a long time, and this probably compiles into the single SQRTSD instruction. All that extra floating-point hardware might mean that this instruction could even run in parallel with non-SSE code!"
And yes, all of that is technically true. So I tested it, and my very synthetic and constructed micro-benchmark did indeed deliver the same results in… 48 seconds. :thonk: That's not enough of a difference to justify breaking the spirit of treating the FPU as lava that permeates Shuusou Gyoku's code base. Besides, it's not used for that much to begin with:

After a quick C++ translation of the RNG function that spells out a 32-bit multiplication on a 32-bit CPU using 16-bit instructions, we reach the final pieces of ASM code for the 8-bit atan2() and trapezoid rendering. These could actually pass for well-written ASM code in how they express their 64-bit calculations: atan8() prepares its 64-bit dividend in the combined EDX and EAX registers in a way that isn't obvious at all from a cursory look at the code, and the trapezoid functions effectively use Q32.32 subpixels. C++ allows us to cleanly model all these calculations with 64-bit variables, but unfortunately compiles the divisions into a call to a comparatively much more bloated 64-bit/64-bit-division polyfill function. So yeah, we've actually found a well-optimized piece of inline assembly that even Visual Studio 2022's optimizer can't compete with. But then again, this is all about code generation details that are specific to 32-bit code, and it wouldn't be surprising if that part of the optimizer isn't getting much attention anymore. Whether that optimization was useful, on the other hand… Oh well, the new C++ version will be much more efficient in 64-bit builds.

And with that, there's no more ASM code left in Shuusou Gyoku's codebase, and the original DirectXUTYs directory is slowly getting emptier and emptier.


Phew! Was that everything for this delivery? I think that was everything. Here's the new build, which checks off 7 of the 15 remaining portability boxes:

:sh01: Shuusou Gyoku P0256

Next up: Taking a well-earned break from Shuusou Gyoku and starting with the preparations for multilingual PC-98 Touhou translatability by looking at TH04's and TH05's in-game dialog system, and definitely writing a shorter blog post about all that…

📝 Posted:
🚚 Summary of:
P0246, P0247, P0248, P0249, P0250, P0251
Commits:
(Seihou) P0226...152ad74, (Seihou) 152ad74...54c3c4e, (Seihou) 54c3c4e...62ff407, (Seihou) 62ff407...a1f80a3, (Seihou) a1f80a3...629ddd8, (Seihou) 629ddd8...P0251
💰 Funded by:
Ember2528, Arandui, alp-bib
🏷 Tags:

And then I'm even late by yet another two days… For some reason, preparing Shuusou Gyoku for an OpenGL port has been the most difficult and drawn-out task I've worked on so far throughout this project. These pushes were in development since April, and over two months in total. Tackling a legacy codebase with such a rather vague goal while simultaneously wanting to keep everything running did not do me any favors, and it was pretty hard to resist the urge to fix everything that had better be fixed to make this game portable…
📝 2022 ended with Shuusou Gyoku working at full speed on Windows ≥8 by itself, without external tools, for the first time. However, since it all came down to just one small bugfix, the resulting build still had several issues:

Now, we could tackle all of these issues one by one, in focused pushes… or wait for one hero to fund a full-on OpenGL backend as part of the larger goal of porting this game to Linux. This would take much longer, but fix all these issues at once while bringing us significantly closer to Shuusou Gyoku being cross-platform. Which is exactly what Ember2528 did.


Shuusou Gyoku is a very Windows-native codebase. Its usage of types declared in <windows.h> even extends to core gameplay code, the rendering code is completely architected around DirectDraw's features and drawbacks, and text rendering is not abstracted at all. Looks like it's now my task to write all the abstractions that pbg didn't manage to write…
Therefore, I chose to stay with DirectDraw for a few more pushes while I would build these abstractions. In hindsight, this was the least efficient approach one could possibly imagine for the exact goal of porting the game to Linux. Suddenly, I had to understand all this DirectDraw and GDI jank, just to keep the game running at every step along the way. Retaining Shuusou Gyoku's 8-bit mode in particular was a huge pain, but I didn't want to remove it because it's currently the only way I can easily debug the game in windowed mode at a scaled resolution, through DxWnd. In 16-bit or 32-bit mode, DxWnd slows down to a crawl, roughly resembling the performance drop we used to get with Windows' own compatibility mitigations for the original build.
The upside, though, is that everything I've built so far still works with the original 8-bit and 16-bit graphics modes. And with just one compiler flag to disable any modern x86 instructions, my build can still run on i586/P5 Pentium CPUs, and only requires KernelEx and its latest Kstub822 patches to run on Windows 98. And, surprisingly, my core audience does appreciate this fact. Thus, I will include an i586 build in all of my upcoming Shuusou Gyoku releases from now on. Once this codebase can compile into a 64-bit binary (which will obviously be required for a native Linux build), the i586 build will remain the only 32-bit Windows build I'll include in my releases.


So, what was DirectDraw? In the shortest way that still describes it accurately from the point of view of a developer: "A hardware acceleration layer over Ye Olde Win32 GDI, providing double-buffering and fast blitting of rectangles." There's the primary double-buffered framebuffer surface, the offscreen surfaces that you create (which are comparable to what 3D rendering APIs would call textures), and you can blit rectangular regions between the two. That's it. Except for double-buffering, DirectDraw offers no feature that GDI wouldn't also support, while not covering some of GDI's more complex features. I mean, DirectDraw can blit rectangles only? How lame. :tannedcirno:

However, DirectDraw's relative lack of features is not as much of a problem as it might appear at first. The reason for that lies in what I consider to be DirectDraw's actual killer feature: compatibility with GDI's device context (DC) abstraction. By acquiring a DC for a DirectDraw surface, you can use all existing GDI functions to draw onto the surface, and, in general, it will all just work. 😮 Most notably, you can use GDI's blitting functions (i.e., BitBlt() and friends) to transfer pixel data from a GDI HBITMAP in system memory onto a DirectDraw surface in video memory, which is the easiest and most straightforward way to, well, get sprite data onto a DirectDraw surface in the first place.
In theory, you could do that without ever touching GDI by locking the surface memory and writing the raw bytes yourself. But in practice, you probably won't, because your game has to run under multiple bit depths and your data files typically only store one copy of all your sprites in a single bit depth. And the necessary conversion and palette color matching… is a mere implementation detail of GDI's blitting functions, using a supposedly optimized code path for every permutation of source and destination bit depths.

All in all, DirectDraw doesn't look too bad so far, does it? Fast blitting, and you can still use the full wealth of GDI functions whenever needed… at the small cost of potentially losing your surface memory at any time. 🙄 Yup, if a DirectDraw game runs in true resolution-changing fullscreen mode and you switch to the Windows desktop, all your surface memory is freed and you have to manually restore it once the game regains focus, followed by manually copying all intended bitmap data back onto all surfaces. DirectDraw is where this concept of surface loss originated, which later carried over to the earlier versions of Direct3D and, infamously, Direct2D as well.
Looking at it from the point of view of the mid-90s, it does make sense to let the application handle trashed video memory if that's an unfortunate reality that your graphics API implementation has to deal with. You don't want to retain a second copy of each surface in a less volatile part of memory because you didn't have that much of it. Instead, the application can now choose the most appropriate way to restore each individual surface. For procedurally generated surfaces, it could just re-run the generating code, whereas all the fixed sprite sheets could be reloaded from disk.

In practice though, this well-intentioned freedom turns into a huge pain. Suddenly, it's no longer enough to load every sprite sheet once before it's needed, blit its pixel data onto the DirectDraw surface, and forget about it. Now, the renderer must also be able to refresh the pixel data of every surface from within itself whenever any of DirectDraw's blitting functions fails with a DDERR_SURFACELOST error. This fact alone is enough to push your renderer interface towards central management and allocation of surfaces. You could maybe avoid the conceptual SurfaceManager by bundling each surface with a regeneration callback, but why should you? Any other graphics API would work with straight-line procedural load-and-forget initialization code, so why slice that code into little parts just because of some DirectDraw quirk?

So if your surfaces can get trashed at any time, and you already use GDI to copy them from system memory to DirectDraw-managed video memory, and your game features at least one procedurally generated surface… you might as well retain every currently loaded surface in the form of an additional GDI device-independent bitmap. 🤷 In fact, that's even better than what Shuusou Gyoku did originally: For all .BMP-sourced surfaces, it only kept a buffer of the entire decompressed .BMP file data, which means that it had to recreate said intermediate GDI bitmap every time it needed to restore a surface. The in-game music title was originally restored via regeneration callback that re-rendered the intended title directly onto the DirectDraw surface, but this was handled by an additional "restore hook" system that remained unused for anything else.
Anything more involved would be a micro-optimization, especially since the goal is to get away from DirectDraw here. Not much point in "neatly" reloading sprite surfaces from disk if the total size of all loaded sprite sheets barely exceeds the 1 MiB mark. Also, keeping these GDI DIBs loaded and initialized does speed up getting back into the game… in theory, at least. After all, the game still runs in fullscreen mode, and resolution switching already takes longer on modern flat-panel displays than any surface restoration method we could come up with. :tannedcirno:


So that was all pretty annoying. But once we start rendering in 8-bit mode, it gets even worse as we suddenly have to bother with palette management. Similar to PC-98 Touhou, Shuusou Gyoku uses way too many different palettes. In fact, it creates a separate DirectDraw palette to retain the palette embedded into every loaded .BMP file, and simply sets the palette of the primary surface and the backbuffer to the one it loaded last. Like, why would you retain per-surface palettes, and what effect does this even have? What even happens when you blit between two DirectDraw surfaces that have different palettes? Might this be the cause of the discolored in-game music title when playing under DxWnd? 😵
But if we try throwing out those extra palettes, it only takes until Stage 3 for us to be greeted with… the infamous golf course:

Shuusou Gyoku's Stage 3 if it only used the palette it loaded last
Looks familiar? You might remember these colors from your attempts to run the original build using D3DWindower.

As you might have guessed, these exact colors come from Gates' face sprite, whose palette apparently doesn't match the sprite sheets used in Stage 3. Turns out that 256 colors are not enough for what Shuusou Gyoku would like to use across the entire stage. In sprite loading order:

Sprite sheet GRAPH.DAT file Additional unique colors Total unique colors
General system sprites #0 +96 96
Stage 3 enemies #3 +42 138
Stage 3 map tiles #9 +40 178
Wide Shot bomb cut-in #26 +3 181
VIVIT's faceset #13 +40 221
Unknown face #14 +35 256
Gates' faceset #17 +40 296

And that's why Shuusou Gyoku does not only have to retain these palettes, but also contains stage script commands (!) to switch the current palette back to either the map or enemy one, after the dialog system enforced the face palette.

But the worst aspects about palettes rear their ugly head at the boundary between GDI and DirectDraw, when GDI adds its own palettes into the mix. None of the following points are clearly documented in either ancient or current MSDN, forcing each new DirectDraw developer to figure them out on their own:

Ultimately, all of this is why Shuusou Gyoku's original DirectDraw backend looks the way it does. It might seem redundant and inefficient in places, but pbg did in fact discover the only way where all the undocumented GDI and DirectDraw color mapping internals come together to make the game look as intended. 🧑‍🔬
And what else are you going to do if you want to target old hardware? My PC-9821Nw133, for example, can only run the original Shuusou Gyoku in 8-bit mode. For a Windows game on such old hardware, 8-bit DirectDraw looks like the only viable option. You certainly don't want to use GDI alone, because that's probably slow and you'd have to worry about even more palette-related issues. Although people have reported that Shuusou Gyoku does actually run faster on their old Windows 9x machine if they disable DirectDraw acceleration…?
In that case, it might be worth a try to write a completely new 8-bit software renderer, employing the same retained VRAM techniques that the PC-98 Touhou games used to implement their scrolling playfields with a minimum of redraws. The hardware scrolling feature of the PC-98 GDC would then be replicated by blitting the playfield in two halves every frame. I wonder how fast that would be…
Or you go straight back to DOS, and bring your own font renderer and MIDI/PCM sound driver. :thonk:


So why did we have to learn about all this? Well, if GDI functions can directly render onto any kind of DirectDraw surface, this also includes text rendering functions like TextOut() and DrawText(). If you're really lazy, you can even render your text directly onto the DirectDraw backbuffer, which probably re-rasterizes all glyphs every frame!
Which, you guessed it, is exactly how Shuusou Gyoku renders most of its text. 🐷 Granted, it's not too bad with MS Gothic thanks to its embedded bitmaps for font heights between 7 and 22 inclusive, which replace the usual Bézier curve rasterization for TrueType fonts with a rather quick bitmap lookup. However, it would not only become a hypothetical problem if future translations end up choosing more complex fonts without embedded bitmaps, but also as soon as we port the game to other systems. Nobody in their right mind would integrate a cross-platform font renderer directly with a 3D graphics API… right? :onricdennat:

Instead, let's refactor the game to render all its existing text to and from a bitmap, extending the way the in-game music title is rendered to the rest of the game. Conceptually, this is also how the Windows Touhou games have always rendered their text. Since they've always used Direct3D, they've always had to blit GDI's output onto a texture. Through the definitions in text.anm, this fixed-size texture is then turned into a sprite sheet, allowing every rendered line of text to be individually placed on the screen and animated.
However, the static nature of both the sprite sheet and the texture caused its fair share of problems for thcrap's translation support. Some of the sprites, particularly the ones for spell card titles, don't originally take up the entire width of the playfield, cutting off translations long before they reach the left edge. Consequently, thcrap's base patch for the Windows Touhou games has to resize the respective sprites to make translators happy. Before I added .ANM header patching in late 2018, this had to be done through a complete modified copy of text.anm for every game – with possibly additional variants if ZUN changed the layout of this file between game versions. Not to mention that it's bound to be quite annoying to manually allocate a rectangle for every line of text we want to show. After all, I have at least two text-heavy future features in mind already…

So let's not do exactly that. Since DirectDraw wants us to manage all surfaces in a central place, we keep the idea of using a single surface for all text. But instead of predefining anything about the surface layout, we fully build up the surface at runtime based on whatever rectangles we need, using a rectangle packing algorithm… yup, I wouldn't have expected to enter such territory either. For now, we still hardcode a fixed size that each piece of text is allowed to maximally take up. But once we get translations, nothing is stopping us from dynamically extending this size to fit even longer strings, and fitting them onto the fixed screen space via smooth scrolling.
To prevent the surface from arbitrarily growing as the game wants to render more and more text, we also reset all allocated rectangles whenever the game state changes. In turn, this will also recreate the text surface to match the new bounding box of all rectangles before the first prerendering call with the new layout. And if you remember the first bullet point about DirectDraw palettes in 8-bit mode, this also means that the text surface automatically receives the current palette of the primary surface, giving us correct colors even without requiring DxWnd's DC palette tweak. 🎨

In fact, the need to dynamically create surfaces at custom sizes was the main reason why I had to look into DirectDraw surface management to begin with. The original game created all of its surfaces at once, at startup or after changing the bit depth in the main menu, which was a bad idea for many reasons:

In the end, we get four different layouts for the text surface: One for the main menu, the Music Room, the in-game portion, and the ending. With, perhaps surprisingly, not too much text on either of them:

The font-rendered text from Shuusou Gyoku's sound option menu, packed into a texture. The font-rendered text from Shuusou Gyoku's Music Room, packed into a texture. The font-rendered text from Shuusou Gyoku's Stage 1, packed into a texture. The font-rendered text from Shuusou Gyoku's ending, packed into a texture.
Yes, the ending uses just a single rectangle that takes up the entire screen space below the pictures and credits.
For the menus, the resulting packed layout reveals how I'm assigning a separately cached rectangle to each possible option – otherwise, they couldn't be arranged vertically on screen with this bitmap layout. Right now, I'm only storing all text for the current menu level, which requires text to be rendered again when entering or leaving submenus. However, I'm allocating as many rectangles as required for the submenu with the most amount of items to at least prevent the single text surface from being resized while navigating through the menu. As a side effect, this is also why you can see multiple Exit labels: These simply come from other submenus with more elements than the currently visited Sound / Music one.

Still, we're re-rasterizing whole lines of text exactly as they appear on screen, and are even doing so multiple times to apply any drop shadows. Isn't that exactly what every text rendering tutorial nowadays advises against doing? Why not directly go for the classic solution to this problem and render using a font texture atlas? Well…

While the Music Room and the ending can be easily migrated to a prerendering system, it's much harder for the main menu. Technically, all option strings of the currently active submenu are rewritten every frame, even though that would only be necessary for the scrolling MIDI device name in the Sound / Music submenu. And since all this rewriting is done via a classic sprintf() on fixed-size char buffers, we'd have to deploy our own change detection before prerendering can have any performance difference.
In essence, we'd be shifting the text rendering paradigm from the original immediate approach to a more retained one. If you've ever used any of the hot new immediate-mode GUI or web frameworks that have become popular over the last 10 years, your alarm bells are probably already ringing by now. Adding retained elements is always a step back in terms of code quality, as it increases complexity by storing UI state in a second place.

Wouldn't it be better if we could just stay with the original immediate approach then? Absolutely, and we only need a simple cache system to get there. By remembering the string that was last rendered to every registered rectangle, the text renderer can offer an immediate API that combines the distinct Prerender() and Blit() steps into a single Render() call. There still has to be an initialization point that registers all rectangles for each game state (which, surprisingly, was not present for the in-game portion in the original code), but the rendering code remains architecturally unchanged in how we call the text renderer every frame. As long as the text doesn't change, the text renderer just blits whatever it previously rendered to the respective rectangle. With an API like this, the whole pre-rendering part turns into a mere implementation detail.

So, how much faster is the result? Since I can only measure non-VSynced performance in a quite rudimentary way using DxWnd's FPS counter, it highly depends on the selected renderer. Weirdly enough, even just switching font creation to the Unicode APIs tripled the FPS inside the Music Room when rendering with OpenGL? That said, the primary surface renderer seems to yield the most realistic numbers, as we still stay entirely within DirectDraw and perform no API wrapping. Using this renderer, I get speedups of roughly:

Not bad for something I had to do anyway to port the game away from DirectDraw! Shuusou Gyoku is rather infamous among the vintage computer scene for being ridiculously unoptimized, so I should definitely be able to get some performance gains out of the in-game portion as well.

For a final test of all the new blitting code, I also tried running outside DxWnd to verify everything against real and unpatched DirectDraw. Amusingly, this revealed how blitting from the new text surface seems to reach the color mapping limits of the DWM mitigation in 8-bit mode:

For some reason, my system maps the intended #FFFFFF text color to #E4E3BB in the main menu?

8-bit mode does render correctly when I ran the same build in a Windows 98 VirtualBox on the same system though, so it's not worth looking into a mode that the system reports as unsupported to begin with. Let's leave this as somewhat of a visual reminder for players to select 32-bit mode instead.


Alright, enough about the annoying parts of GDI and DirectDraw for now. Let's stop looking back and start looking forward, to a time within this Seihou revolution when we're going to have lots of new options in the main menu. Due to the nature of delivering individual pushes, we can expect lots of revisions to the config file format. Therefore, we'd like to have a backward-compatible system that allows players to upgrade from any older build, including the original 秋霜玉.exe, to a newer one. The original game predominantly used single-byte values for all its options, but we'd like our system to work with variables of any size, including strings to store things like the name of the selected MIDI device in a more robust way. Also, it's pure evil to reset the entire configuration just because someone tried to hex-edit the config file and didn't keep the checksum in mind.

It didn't take long for me to arrive at a common Size()/Read()/Write() interface. By using the same interface for both arrays and individual values, new config file versions can naturally expand older ones by taking the array of option references from the previous version and wrapping it into a new array, together with the new options.
The classic way of implementing this in C++ involves a typical object-oriented class hierarchy: An Option base class would define the interface in the form of virtual abstract functions, and the Value, Array, and ConfigVersion subclasses would provide different implementations. This works, but introduces quite a bit of boilerplate, not to mention the runtime bloat from all the virtual functions which Visual C++ can't inline. Why should we do any runtime dispatch here? We know the set of configuration options at compile time, after all… :thonk:

Let's try looking into the modern C++ toolbox and see if we can do better. The only real challenge here is that the array type has to support arbitrarily sized option value types, which sounds like a job for template parameter packs. If we save these into a std::tuple, we can then "iterate" over all options with std::apply and fold expressions, in a nice functional style.
I was amazed by just how clearly the "crazy" modern C++ approach with template parameter packs, std::apply() over giant std::tuples, and fold expressions beats a classic polymorphic hierarchy of abstract virtual functions. With the interface moved into an even optional concept, the class hierarchy can be completely flattened, which surprisingly also makes the code easier to both read and write.

Here's how the new system works from the player's point of view:


With that, we've got more than enough code for a new build:

:sh01: Shuusou Gyoku P0251

This build also contains two more fixes that didn't fit into the big DirectDraw or configuration categories:

These 6 pushes still left several of Shuusou Gyoku's DirectDraw portability issues unsolved, but I'd better look at them once I've set up a basic OpenGL skeleton to avoid any more premature abstraction. Since the ultimate goal is a Linux port, I might as well already start looking at the current best platform layer libraries. SDL would be the standard choice here, and while SDL_ttf looks regrettably misdesigned, the core SDL library seems to cover all we could possibly want for Shuusou Gyoku, including a 2D renderer… wait, what?!

Yup. Admittedly, I've been living under a rock as far as SDL is concerned, and thus wasn't aware that SDL 2 introduced its own abstraction for 2D rendering that just happens to almost exactly cover everything we need for Shuusou Gyoku. This API even covers all of the game's Direct3D code, which only draws alpha-blended, untextured, and pre-transformed vertex-colored triangles and lines. It's the exact abstraction over OpenGL I thought I had to write myself, and such a perfect match for this game that it would be foolish to go for a custom OpenGL backend – especially since SDL will automatically target the ideal graphics API for any given operating system.

Sadly, the one thing SDL_Renderer is missing is something equivalent to pixel shaders, which we would need to replicate the 西方 Project lens ball effect shown at startup. Looks like we have to drop into a completely separate, unaccelerated rendering mode and continue to software-render this one effect before switching to hardware-accelerated rendering for the rest of the game. But at least we can do that in a cross-platform way, and don't have to bother with shading languages – or, perhaps even worse, SDL's own shading language.
If we were extremely pedantic, we'd also have to do the same for the 📝 unused spiral effect that was originally intended for the staff roll. Software rendering would be even more annoying there, since we don't just have to software-render these staff sprites, but also the ending picture and text, complete with their respective fade effects. And while I typically do go the extra mile to preserve whatever code was present in these games, keeping this effect would just needlessly drive up the cost of the SDL backend. Let's just move this one to the museum of unused code and no longer actively compile it. RIP spiral 🥲 At least you're still preserved in lossless video form.

Now that SDL has become an integral part of Shuusou Gyoku's portability plan rather than just being one potential platform layer among many, the optimal order of tasks has slightly changed. If we stayed within the raw Win32 API any longer than absolutely necessary, we'd only risk writing more Win32-native code for things like audio streaming that we'd then have to throw away and rewrite in SDL later. Next up, therefore: Staying with Shuusou Gyoku, but continuing in a much more focused manner by fixing the input system and starting the SDL migration with input and sound.

📝 Posted:
🏷 Tags:

Yet another small interruption before we get to Shuusou Gyoku, but only because I've got a big announcement to make! Touhou Patch Center has just commissioned the basic feature set that would allow PC-98 Touhou to be translated into non-ASCII languages. 💰 And we're in fact doing it on PC-98, and don't wait for the games to be ported to other systems first.

How is this going to work?

This project will start sometime after I've completed the current big project of porting Shuusou Gyoku to Linux, so probably in the last few months of 2023. Similar to the previous MediaWiki update, this will bypass the ReC98 push and cap model: Touhou Patch Center is going to guarantee a minimum budget out of their Open Collective funds, which can be increased with further donations from the community, and I'm going to send an invoice once I'm done. In addition, I'm also going to keep in contact with all interested translators and backers via a Discord room throughout the process for additional technical quality control.
Since ReC98 hasn't really focused on text so far, the amount of moddable text-related code would be fairly limited if I started working on it right away. As of 2023-11-30, it only includes the following:

Therefore, I'll put any general and unconstrained reverse-engineering, position independence, or anything contributions that come in during the next few months towards covering everything that's still missing there:

In total, that's the next 6 general pushes that will go towards ensuring translatability of most of PC-98 Touhou, if possible. If you'd like your contribution (or existing subscription) to go to gameplay code instead, be sure to tell me!

What's the minimum guaranteed set of features?

The main feature will be a custom renderer for a subsetted, monospaced Unicode bitmap font, and its integration into any translatable part of the game. For the script files, this means UTF-8 support with Shift-JIS fallback. For the glyphs, I'll use GNU Unifont by default, but we could also use any other freely licensed bitmap font with 8×16 or 16×16 glyphs for alphabets of certain languages. Everything about this will be the real deal: The system will potentially support all of Unicode without font ROM hacks so that the translations will work on real hardware, and there will be no shortcuts for just a few Latin characters. And if someone wants to translate this game into a language with more complex shaping rules, I'll make sure that they look pretty as well if there's some budget left.
This will allow translation teams to build static translation patches into any language by editing the original script files, and using -Tom-'s existing tools for any images. Modifications of hardcoded strings would still require recompiling the binary, and each group would have to distribute and advertise the result on their own.

🌐 Which languages are we getting?

As of 2023-10-10, the following translators and teams have expressed interest:

Wait, Arabic?! On my PC-98?! What's the plan there?

The two challenges with Arabic scripts are transforming a text to use the codepoints for contextual glyph forms (shaping), and right-to-left rendering. Shaping requires not too much code, which is easily added to the font subsetting build step. Right-to-left rendering, on the other hand, must be a feature of the new PC-98-native text renderer, because there are several places in PC-98 Touhou where text is gradually typed character-by-character. So it will require a bit of dedicated budget, but not all too much from what I can tell. Bidirectional text would add a great deal of complexity here, but we most likely won't need to implement it – I'll simply pick a direction based on the first codepoint on a line, and ask translators to manually reverse any Latin-script runs of text in the middle of an Arabic-script line.

How much better could it all be?

You might remember most of this from 📝 my initial pitch back in November, but I did have quite a bit of additional ideas since then.

These features are mostly independent of each other, and it will be up to Touhou Patch Center to pick a priority order. That's also where all of you could come in and influence this order with your donations. So it's closer to a traditional crowdfunding campaign with stretch goals, where the sky is the limit, than it is to the usual ReC98 model. And while there can be no fixed prices for any of the goals, you can be sure that anything you invest will improve the quality of the final product.

:opencollective: Touhou Patch Center on Open Collective

From now on, this will be the only way of funding any translation-related goals; I've removed the respective options from the ReC98 order form. Looking forward to how many of these additional ideas I get to implement – but, as always, please invest responsibly.

Shuusou Gyoku finally coming this weekend.

📝 Posted:
🚚 Summary of:
P0245
Commits:
97f0c3b...5876755
💰 Funded by:
Blue Bolt, Ember2528, [Anonymous], Yanga
🏷 Tags:

And then, the supposed boilerplate code revealed yet another confusing issue that quickly forced me back to serial work, leading to no parallel progress made with Shuusou Gyoku after all. 🥲 The list of functions I put together for the first ½ of this push seemed so boring at first, and I was so sure that there was almost nothing I could possibly talk about:

That's three instances of ZUN removing sprites way earlier than you'd want to, intentionally deciding against those sprites flying smoothly in and out of the playfield. Clearly, there has to be a system and a reason behind it.

Turns out that it can be almost completely blamed on master.lib. None of the super_*() sprite blitting functions can clip the rendered sprite to the edges of VRAM, and much less to the custom playfield rectangle we would actually want here. This is exactly the wrong choice to make for a game engine: Not only is the game developer now stuck with either rendering the sprite in full or not at all, but they're also left with the burden of manually calculating when not to display a sprite.
However, strictly limiting the top-left screen-space coordinate to (0, 0) and the bottom-right one to (640, 400) would actually stop rendering some of the sprites much earlier than the clipping conditions we encounter in these games. So what's going on there?

The answer is a combination of playfield borders, hardware scrolling, and master.lib needing to provide at least some help to support the latter. Hardware scrolling on PC-98 works by dividing VRAM into two vertical partitions along the Y-axis and telling the GDC to display one of them at the top of the screen and the other one below. The contents of VRAM remain unmodified throughout, which raises the interesting question of how to deal with sprites that reach the vertical edges of VRAM. If the top VRAM row that starts at offset 0x0000 ends up being displayed below the bottom row of VRAM that starts at offset 0x7CB0 for 399 of the 400 possible scrolling positions, wouldn't we then need to vertically wrap most of the rendered sprites?
For this reason, master.lib provides the super_roll_*() functions, which unconditionally perform exactly this vertical wrapping. But this creates a new problem: If these functions still can't clip, and don't even know which VRAM rows currently correspond to the top and bottom row of the screen (since master.lib's graph_scrollup() function doesn't retain this information), won't we also see sprites wrapping around the actual edges of the screen? That's something we certainly wouldn't want in a vertically scrolling game…
The answer is yes, and master.lib offers no solution for this issue. But this is where the playfield borders come in, and helpfully cover 16 pixels at the top and 16 pixels at the bottom of the screen. As a result, they can hide up to 32 rows of potentially wrapped sprite pixels below them:


The earliest possible frame that TH05 can start rendering the Stage 5 midboss on. Hiding the text layer reveals how master.lib did in fact "blindly" render the top part of her sprite to the bottom of the playfield. That's where her sprite starts before it is correctly wrapped around to the top of VRAM.
If we scrolled VRAM by another 200 pixels (and faked an equally shifted TRAM for demonstration purposes), we get an equally valid game scene that points out why a vertically scrolling PC-98 game must wrap all sprites at the vertical edges of VRAM to begin with.
Also, note how the HP bar has filled up quite a bit before the midboss can actually appear on screen.
VRAM contents of the first possible frame that TH05's Stage 5 midboss can appear on, at their original scrolling position. Also featuring the 64×64 bounding box of the midboss sprite.VRAM contents of the first possible frame that TH05's Stage 5 midboss can appear on, scrolled down by a further 200 pixels. Also featuring the 64×64 bounding box of the midboss sprite.

And that's how the lowest possible top Y coordinate for sprites blitted using the master.lib super_roll_*() functions during the scrolling portions of TH02, TH04, and TH05 is not 0, but -16. Any lower, and you would actually see some of the sprite's upper pixels at the bottom of the playfield, as there are no more opaque black text cells to cover them. Theoretically, you could lower this number for some animation frames that start with multiple rows of transparent pixels, but I thankfully haven't found any instance of ZUN using such a hack. So far, at least… :godzun:
Visualized like that, it all looks quite simple and logical, but for days, I did not realize that these sprites were rendered to a scrolling VRAM. This led to a much more complicated initial explanation involving the invisible extra space of VRAM between offsets 0x7D00 and 0x7FFF that effectively grant a hidden additional 9.6 lines below the playfield. Or even above, since PC-98 hardware ignores the highest bit of any offset into a VRAM bitplane segment (& 0x7FFF), which prevents blitting operations from accidentally reaching into a different bitplane. Together with the aforementioned rows of transparent pixels at the top of these midboss sprites, the math would have almost worked out exactly. :tannedcirno:

The need for manual clipping also applies to the X-axis. Due to the lack of scrolling in this dimension, the boundaries there are much more straightforward though. The minimum left coordinate of a sprite can't fall below 0 because any smaller coordinate would wrap around into the 📝 tile source area and overwrite some of the pixels there, which we obviously don't want to re-blit every frame. Similarly, the right coordinate must not extend into the HUD, which starts at 448 pixels.
The last part might be surprising if you aren't familiar with the PC-98 text chip. Contrary to the CGA and VGA text modes of IBM-compatibles, PC-98 text cells can only use a single color for either their foreground or background, with the other pixels being transparent and always revealing the pixels in VRAM below. If you look closely at the HUD in the images above, you can see how the background of cells with gaiji glyphs is slightly brighter (◼ #100) than the opaque black cells (◼ #000) surrounding them. This rather custom color clearly implies that those pixels must have been rendered by the graphics GDC. If any other sprite was rendered below the HUD, you would equally see it below the glyphs.

So in the end, I did find the clear and logical system I was looking for, and managed to reduce the new clipping conditions down to a set of basic rules for each edge. Unfortunately, we also need a second macro for each edge to differentiate between sprites that are smaller or larger than the playfield border, which is treated as either 32×32 (for super_roll_*()) or 32×16 (for non-"rolling" super_*() functions). Since smaller sprites can be fully contained within this border, the games can stop rendering them as soon as their bottom-right coordinate is no longer seen within the playfield, by comparing against the clipping boundaries with <= and >=. For example, a 16×16 sprite would be completely invisible once it reaches (16, 0), so it would still be rendered at (17, 1). A larger sprite during the scrolling part of a stage, like, say, the 64×64 midbosses, would still be rendered if their top-left coordinate was (0, -16), so ZUN used < and > comparisons to at least get an additional pixel before having to stop rendering such a sprite. Turbo C++ 4.0J sadly can't constant-fold away such a difference in comparison operators.

And for the most part, ZUN did follow this system consistently. Except for, of course, the typical mistakes you make when faced with such manual decisions, like how he treated TH04's Stage 4 midboss as a "small" sprite below 32×32 pixels (it's 64×64), losing that precious one extra pixel. Or how the entire rendering code for the 48×48 boss explosion sprite pretends that it's actually 64×64 pixels large, which causes even the initial transformation into screen space to be misaligned from the get-go. :zunpet: But these are additional bugs on top of the single one that led to all this research.
Because that's what this is, a bug. 🐞 Every resulting pixel boundary is a systematic result of master.lib's unfortunate lack of clipping. It's as much of a bug as TH01's byte-aligned rendering of entities whose internal position is not byte-aligned. In both cases, the entities are alive, simulated, and partake in collision detection, but their rendered appearance doesn't accurately reflect their internal position.
Initially, I classified 📝 the sudden pop-in of TH05's Stage 5 midboss as a quirk because we had no conclusive evidence that this wasn't intentional, but now we do. There have been multiple explanations for why ZUN put borders around the playfield, but master.lib's lack of sprite clipping might be the biggest reason.

And just like byte-aligned rendering, the clipping conditions can easily be removed when porting the game away from PC-98 hardware. That's also what uth05win chose to do: By using OpenGL and not having to rely on hardware scrolling, it can simply place every sprite as a textured quad at its exact position in screen space, and then draw the black playfield borders on top in the end to clip everything in a single draw call. This way, the Stage 5 midboss can smoothly fly into the playfield, just as defined by its movement code:

The entire smooth Stage 5 midboss entrance animation as shown in uth05win. If the simultaneous appearance of the Enemy!! label doesn't lend further proof to this having been ZUN's actual intention, I don't know what will.

Meanwhile, I designed the interface of the 📝 generic blitter used in the TH01 Anniversary Edition entirely around clipping the blitted sprite at any explicit combination of VRAM edges. This was nothing I tacked on in the end, but a core aspect that informed the architecture of the code from the very beginning. You really want to have one and only one place where sprite clipping is done right – and only once per sprite, regardless of how many bitplanes you want to write to.


Which brings us to the goal that the final ¼ of this push went toward. I thought I was going to start cleaning up the 📝 player movement and rendering code, but that turned out too complicated for that amount of time – especially if you want to start with just cleanup, preserving all original bugs for the time being.
Fixing and smoothening player and Orb movement would be the next big task in Anniversary Edition development, needing about 3 pushes. It would start with more performance research into runtime-shifting of larger sprites, followed by extending my generic blitter according to the results, writing new optimized loaders for the original image formats, and finally rewriting all rendering code accordingly. With that code in place, we can then start cleaning up and fixing the unique code for each boss, one by one.

Until that's funded, the code still contains a few smaller and easier pieces of code that are equally related to rendering bugs, but could be dealt with in a more incremental way. Line rendering is one of those, and first needs some refactoring of every call site, including 📝 the rotating squares around Mima and 📝 YuugenMagan's pentagram. So far, I managed to remove another 1,360 bytes from the binary within this final ¼ of a push, but there's still quite a bit to do in that regard.
This is the perfect kind of feature for smaller (micro-)transactions. Which means that we've now got meaningful TH01 code cleanup and Anniversary Edition subtasks at every price range, no matter whether you want to invest a lot or just a little into this goal.

If you can, because Ember2528 revealed the plan behind his Shuusou Gyoku contributions: A full-on Linux port of the game, which will be receiving all the funding it needs to happen. 🐧 Next up, therefore: Turning this into my main project within ReC98 for the next couple of months, and getting started by shipping the long-awaited first step towards that goal.
I've raised the cap to avoid the potential of rounding errors, which might prevent the last needed Shuusou Gyoku push from being correctly funded. I already had to pick the larger one of the two pending TH02 transactions for this push, because we would have mathematically ended up 1/25500 short of a full push with the smaller transaction. :onricdennat: And if I'm already at it, I might as well free up enough capacity to potentially ship the complete OpenGL backend in a single delivery, which is currently estimated to cost 7 pushes in total.

📝 Posted:
🚚 Summary of:
P0244
Commits:
ac33bd2...97f0c3b
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:

🎉 After almost 3 years, TH04 finally caught up to TH05 and is now 100% position-independent as well! 🎉

For a refresher on what this means and does not mean, check the announcements from back in 2019 and 2020 when we chased the goal for TH05's 📝 OP.EXE and 📝 the rest of the game. These also feature some demo videos that show off the kind of mods you were able to efficiently code back then. With the occasional reverse-engineering attention it received over the years, TH04's code should now be slightly easier to work with than TH05's was back in the day. Although not by much – TH04 has remained relatively unpopular among backers, and only received more than the funded attention because it shares most of its core code with the more popular TH05. Which, coincidentally, ended up becoming 📝 the reason for getting this done now.
Not that it matters a lot. Ever since we reached 100% PI for TH05, community and backer interest in position independence has dropped to near zero. We just didn't end up seeing the expected large amount of community-made mods that PI was meant to facilitate, and even the 📝 100% decompilation of TH01 changed nothing about that. But that's OK; after all, I do appreciate the business of continually getting commissioned for all the 📝 large-scale mods. Not focusing on PI is also the correct choice for everyone who likes reading these blog posts, as it often means that I can't go that much into detail due to cutting corners and piling up technical debt left and right.

Surprisingly, this only took 1.25 pushes, almost twice as fast as expected. As that's closer to 1 push than it is to 2, I'm OK with releasing it like this – especially since it was originally meant to come out three days ago. 🍋 Unfortunately, it was delayed thanks to surprising website bugs and a certain piece of code that was way more difficult to document than it was to decompile… The next push will have slightly less content in exchange, though.


📝 P0240 and P0241 already covered the final remaining structures, so I only needed to do some superficial RE to prove the remaining numeric literals as either constants or memory addresses. For example, I initially thought I'd have to decompile the dissolve animations in the staff roll, but I only needed to identify a single function pointer type to prove all false positives as screen coordinates there. Now, the TH04 staff roll would be another fast and cheap decompilation, similar to the custom entity types of TH04. (And TH05 as well!)

The one piece of code I did have to decompile was Stage 4's carpet lighting animation, thanks to hex literals that were way too complicated to leave in ASM. And this one probably takes the crown for TH04's worst set of landmines and bloat that still somehow results in no observable bugs or quirks.
This animation starts at frame 1664, roughly 29.5 seconds into the stage, and quickly turns the stage background into a repeated row of dark-red plaid carpet tiles by moving out from the center of the playfield towards the edges. Afterward, the animation repeats with a brighter set of tiles that is then used for the rest of the stage. As I explained 📝 a while ago in the context of TH02, the stage tile and map formats in PC-98 Touhou can't express animations, so all of this needed to be hardcoded in the binary.

A row of the carpet tiles from TH04's Stage 4, at the lowest light levelA row of the carpet tiles from TH04's Stage 4, at the medium light levelA row of the carpet tiles from TH04's Stage 4, at the highest light level
The repeating 384×16 row of carpet tiles at the beginning of TH04's Stage 4 in all three light levels, shown twice for better visibility.

And ZUN did start out making the right decision by only using fully-lit carpet tiles for all tile sections defined in ST03.MAP. This way, the animation can simply disable itself after it completed, letting the rest of the stage render normally and use new tile sections that are only defined for the final light level. This means that the "initial" dark version of the carpet is as much a result of hardcoded tile manipulation as the animation itself.
But then, ZUN proceeded to implement it all by directly manipulating the ring buffer of on-screen tiles. This is the lowest level before the tiles are rendered, and rather detached from the defined content of the 📝 .MAP tile sections. Which leads to a whole lot of problems:

  1. If you decide to do this kind of tile ring modification, it should ideally happen at a very specific point: after scrolling in new tiles into the ring buffer, but before blitting any scrolled or invalidated tiles to VRAM based on the ring buffer. Which is not where ZUN chose to put it, as he placed the call to the stage-specific render function after both of those operations. :zunpet: By the time the function is called, the tile renderer has already blitted a few lines of the fully-lit carpet tiles from the defined .MAP tile section, matching the scroll speed. Fortunately, these are hidden behind the black TRAM cells above and below the playfield…

  2. Still, the code needs to get rid of them before they would become visible. ZUN uses the regular tile invalidation function for this, which will only cause actual redraws on the next frame. Again, the tile rendering call has already happened by the time the Stage 4-specific rendering function gets called.
    But wait, this game also flips VRAM pages between frames to provide a tear-free gameplay experience. This means that the intended redraw of the new tiles actually hits the wrong VRAM page. :tannedcirno: And sure, the code does attempt to invalidate these newly blitted lines every frame – but only relative to the current VRAM Y coordinate that represents the top of the hardware-scrolled screen. Once we're back on the original VRAM page on the next frame, the lines we initially set out to remove could have already scrolled past that point, making it impossible to ever catch up with them in this way.
    The only real "solution": Defining the height of the tile invalidation rectangle at 3× the scroll speed, which ensures that each invalidation call covers 3 frames worth of newly scrolled-in lines. This is not intuitive at all, and requires an understanding of everything I have just written to even arrive at this conclusion. Needless to say that ZUN didn't comprehend it either, and just hardcoded an invalidation height that happened to be enough for the small scroll speeds defined in ST03.STD for the first 30 seconds of the stage.

  3. The effect must consistently modify the tile ring buffer to "fix" any new tiles, overriding them with the intended light level. During the animation, the code not only needs to set the old light level for any tiles that are still waiting to be replaced, but also the new light level for any tiles that were replaced – and ZUN forgot the second part. :zunpet: As a result, newly scrolled-in tiles within the already animated area will "remain" untouched at light level 2 if the scroll speed is fast enough during the transition from light level 0 to 1.

All that means that we only have to raise the scroll speed for the effect to fall apart. Let's try, say, 4 pixels per frame rather than the original 0.25:

By hiding the text RAM layer and revealing what's below the usually opaque black cells above and below the playfield, we can observe all three landmines – 1) and 2) throughout light level 0, and 3) during the transition from level 0 to 1.

All of this could have been so much simpler and actually stable if ZUN applied the tile changes directly onto the .MAP. This is a much more intuitive way of expressing what is supposed to happen to the map, and would have reduced the code to the actually necessary tile changes for the first frame and each individual frame of the animation. It would have still required a way to force these changes into the tile ring buffer, but ZUN could have just used his existing full-playfield redraw functions for that. In any case, there would have been no need for any per-frame tile fixing and redrawing. The CPU cycles saved this way could have then maybe been put towards writing the tile-replacing part of the animation in C++ rather than ASM…


Wow, that was an unreasonable amount of research into a feature that superficially works fine, just because its decompiled code didn't make sense. :onricdennat: To end on a more positive note, here are some minor new discoveries that might actually matter to someone:

Next up: ¾ of a push filled with random boilerplate, finalization, and TH01 code cleanup work, while I finish the preparations for Shuusou Gyoku's OpenGL backend. This month, everything should finally work out as intended: I'll complete both tasks in parallel, ship the former to free up the cap, and then ship the latter once its 5th push is fully funded.

📝 Posted:
🚚 Summary of:
P0242, P0243
Commits:
08352a5...dfa758d, dfa758d...ac33bd2
💰 Funded by:
Yanga
🏷 Tags:

OK, let's decompile TH02's HUD code first, gain a solid understanding of how increasing the score works, and then look at the item system of this game. Should be no big deal, no surprises expected, let's go!

…Yeah, right, that's never how things end up in ReC98 land. :godzun: And so, we get the usual host of newly discovered oddities in addition to the expected insights into the item mechanics. Let's start with the latter:


Onto score tracking then, which only took a single commit to raise another big research question. It's widely known that TH02 grants extra lives upon reaching a score of 1, 2, 3, 5, or 8 million points. But what hasn't been documented is the fact that the game does not stop at the end of the hardcoded extend score array. ZUN merely ends it with a sentinel value of 999,999,990 points, but if the score ever increased beyond this value, the game will interpret adjacent memory as signed 32-bit score values and continue giving out extra lives based on whatever thresholds it ends up finding there. Since the following bytes happen to turn into a negative number, the next extra life would be awarded right after gaining another 10 points at exactly 1,000,000,000 points, and the threshold after that would be 11,114,905,600 points. Without an explicit counterstop, the number of score-based extra lives is theoretically unlimited, and would even continue after the signed 32-bit value overflowed into the negative range. Although we certainly have bigger problems once scores ever reach that point… :tannedcirno:
That said, it seems impossible that any of this could ever happen legitimately. The current high scores of 42,942,800 points on Lunatic and 42,603,800 points on Extra don't even reach 1/20 of ZUN's sentinel value. Without either a graze or a bullet cancel system, the scoring potential in this game is fairly limited, making it unlikely for high scores to ever increase by that additional order of magnitude to end up anywhere near the 1 billion mark.
But can we really be sure? Is this a landmine because it's impossible to ever reach such high scores, or is it a quirk because these extends could be observed under rare conditions, perhaps as the result of other quirks? And if it's the latter, how many of these adjacent bytes do we need to preserve in cleaned-up versions and ports? We'd pretty much need to know the upper bound of high scores within the original stage and boss scripts to tell. This value should be rather easy to calculate in a game with such a simple scoring system, but doing that only makes sense after we RE'd all scoring-related code and could efficiently run such simulations. It's definitely something we'd need to look at before working on this game's debloated version in the far future, which is when the difference between quirks and landmines will become relevant. Still, all that uncertainty just because ZUN didn't restrict a loop to the size of the extend threshold array…


TH02 marks a pivotal point in how the PC-98 Touhou games handle the current score. It's the last game to use a 32-bit variable before the later games would regrettably start using arrays of binary-coded decimals. More importantly though, TH02 is also the first game to introduce the delayed score counting animation, where the displayed score intentionally lags behind and gradually counts towards the real one over multiple frames. This could be implemented in one of two ways:

  1. Keep the displayed score as a separate variable inside the presentation layer, and let it gradually count up to the real score value passed in from the logic layer
  2. Burden the game logic with this presentation detail, and split the score into two variables: One for the displayed score, and another for the delta between that score and the actual one. Newly gained points are first added to the delta variable, and then gradually subtracted from there and added to the real score before being displayed.

And by now, we can all tell which option ZUN picked for the rest of the PC-98 games, even if you don't remember 📝 me mentioning this system last year. 📝 Once again, TH02 immortalized ZUN's initial attempt at the concept, which lacks the abstraction boundaries you'd want for managing this one piece of state across two variables, and messes up the abstractions it does have. In addition to the regular score transfer/render function, the codebase therefore has

And – you guessed it – I wouldn't have mentioned any of this if it didn't result in one bug and one quirk in TH02. The bug resulting from 1) is pretty minor: The function is called when losing a life, and simply stops any active score-counting animation at the value rendered on the frame where the player got hit. This one is only a rendering issue – no points are lost, and you just need to gain 10 more for the rendered value to jump back up to its actual value. You'll probably never notice this one because you're likely busy collecting the single 5-power spawned around Reimu when losing a life, which always awards at least 10 points.

The quirk resulting from 2) is more intriguing though. Without a separate reset of the score delta, the function effectively awards the current delta value as a one-time point bonus, since the same delta will still be regularly transferred to the score on further game frames.
This function is called at the start of every dialog sequence. However, TH02 stops running the regular game loop between the post-boss dialog and the next stage where the delta is reset, so we can only observe this quirk for the pre-boss sequences and the dialog before Mima's form change. Unfortunately, it's not all too exploitable in either case: Each of the pre-boss dialog sequences is preceded by an ungrazeable pellet pattern and followed by multiple seconds of flying over an empty playfield with zero scoring opportunities. By the time the sequence starts, the game will have long transferred any big score delta from max-valued point items. It's slightly better with Mima since you can at least shoot her and use a bomb to keep the delta at a nonzero value, but without a health bar, there is little indication of when the dialog starts, and it'd be long after Mima gave out her last bonus items in any case.
But two of the bosses – that is, Rika, and the Five Magic Stones – are scrolled onto the playfield as part of the stage script, and can also be hit with player shots and bombs for a few seconds before their dialog starts. While I'll only get to cover shot types and bomb damage within the next few TH02 pushes, there is an obvious initial strategy for maximizing the effect of this quirk: Spreading out the A-Type / Wide / High Mobility shot to land as many hits as possible on all Five Magic Stones, while firing off a bomb.

Turns out that the infamous button-mashing mechanics of the player shot are also more complicated than simply pressing and releasing the Shot key at alternating frames. Even this result took way too many takes.

Wow, a grand total of 1,750 extra points! Totally worth wasting a bomb for… yeah, probably not. :onricdennat: But at the very least, it's something that a TAS score run would want to keep in mind. And all that just because ZUN "forgot" a single score_delta = 0; assignment at the end of one function…

And that brings TH02 over the 30% RE mark! Next up: 100% position independence for TH04. If anyone wants to grab the that have now been freed up in the cap: Any small Touhou-related task would be perfect to round out that upcoming TH04 PI delivery.

📝 Posted:
🚚 Summary of:
P0240, P0241
Commits:
be69ab6...40c900f, 40c900f...08352a5
💰 Funded by:
JonathKane, Blue Bolt, [Anonymous]
🏷 Tags:

Well, well. My original plan was to ship the first step of Shuusou Gyoku OpenGL support on the next day after this delivery. But unfortunately, the complications just kept piling up, to a point where the required solutions definitely blow the current budget for that goal. I'm currently sitting on over 70 commits that would take at least 5 pushes to deliver as a meaningful release, and all of that is just rearchitecting work, preparing the game for a not too Windows-specific OpenGL backend in the first place. I haven't even written a single line of OpenGL yet… 🥲
This shifts the intended Big Release Month™ to June after all. Now I know that the next round of Shuusou Gyoku features should better start with the SC-88Pro recordings, which are much more likely to get done within their current budget. At least I've already completed the configuration versioning system required for that goal, which leaves only the actual audio part.

So, TH04 position independence. Thanks to a bit of funding for stage dialogue RE, non-ASCII translations will soon become viable, which finally presents a reason to push TH04 to 100% position independence after 📝 TH05 had been there for almost 3 years. I haven't heard back from Touhou Patch Center about how much they want to be involved in funding this goal, if at all, but maybe other backers are interested as well.
And sure, it would be entirely possible to implement non-ASCII translations in a way that retains the layout of the original binaries and can be easily compared at a binary level, in case we consider translations to be a critical piece of infrastructure. This wouldn't even just be an exercise in needless perfectionism, and we only have to look to Shuusou Gyoku to realize why: Players expected that my builds were compatible with existing SpoilerAL SSG files, which was something I hadn't even considered the need for. I mean, the game is open-source 📝 and I made it easy to build. You can just fork the code, implement all the practice features you want in a much more efficient way, and I'd probably even merge your code into my builds then?
But I get it – recompiling the game yields just yet another build that can't be easily compared to the original release. A cheat table is much more trustworthy in giving players the confidence that they're still practicing the same original game. And given the current priorities of my backers, it'll still take a while for me to implement proof by replay validation, which will ultimately free every part of the community from depending on the original builds of both Seihou and PC-98 Touhou.

However, such an implementation within the original binary layout would significantly drive up the budget of non-ASCII translations, and I sure don't want to constantly maintain this layout during development. So, let's chase TH04 position independence like it's 2020, and quickly cover a larger amount of PI-relevant structures and functions at a shallow level. The only parts I decompiled for now contain calculations whose intent can't be clearly communicated in ASM. Hitbox visualizations or other more in-depth research would have to wait until I get to the proper decompilation of these features.
But even this shallow work left us with a large amount of TH04-exclusive code that had its worst parts RE'd and could be decompiled fairly quickly. If you want to see big TH04 finalization% gains, general TH04 progress would be a very good investment.


The first push went to the often-mentioned stage-specific custom entities that share a single statically allocated buffer. Back in 2020, I 📝 wrongly claimed that these were a TH05 innovation, but the system actually originated in TH04. Both games use a 26-byte structure, but TH04 only allocates a 32-element array rather than TH05's 64-element one. The conclusions from back then still apply, but I also kept wondering why these games used a static array for these entities to begin with. You know what they call an area of memory that you can cleanly repurpose for things? That's right, a heap! :tannedcirno: And absolutely no one would mind one additional heap allocation at the start of a stage, next to the ones for all the sprites and portraits.
However, we are still running in Real Mode with segmented memory. Accessing anything outside a common data segment involves modifying segment registers, which has a nonzero CPU cycle cost, and Turbo C++ 4.0J is terrible at optimizing away the respective instructions. Does this matter? Probably not, but you don't take "risks" like these if you're in a permanent micro-optimization mindset… :godzun:

In TH04, this system is used for:

  1. Kurumi's symmetric bullet spawn rays, fired from her hands towards the left and right edges of the playfield. These are rather infamous for being the last thing you see before 📝 the Divide Error crash that can happen in ZUN's original build. Capped to 6 entities.

  2. The 4 📝 bits used in Marisa's Stage 4 boss fight. Coincidentally also related to the rare Divide Error crash in that fight.

  3. Stage 4 Reimu's spinning orbs. Note how the game uses two different sets of sprites just to have two different outline colors. This was probably better than messing with the palette, which can easily cause unintended effects if you only have 16 colors to work with. Heck, I have an entire blog post tag just to highlight these cases. Capped to the full 32 entities.

  4. The chasing cross bullets, seen in Phase 14 of the same Stage 6 Yuuka fight. Featuring some smart sprite work, making use of point symmetry to achieve a fluid animation in just 4 frames. This is good-code in sprite form. Capped to 31 entities, because the 32nd custom entity during this fight is defined to be…

  5. The single purple pulsating and shrinking safety circle, seen in Phase 4 of the same fight. The most interesting aspect here is actually still related to the cross bullets, whose spawn function is wrongly limited to 32 entities and could theoretically overwrite this circle. :zunpet: This is strictly landmine territory though:

    • Yuuka never uses these bullets and the safety circle simultaneously
    • She never spawns more than 24 cross bullets
    • All cross bullets are fast enough to have left the screen by the time Yuuka restarts the corresponding subpattern
    • The cross bullets spawn at Yuuka's center position, and assign its Q12.4 coordinates to structure fields that the safety circle interprets as raw pixels. The game does try to render the circle afterward, but since Yuuka's static position during this phase is nowhere near a valid pixel coordinate, it is immediately clipped.

  6. The flashing lines seen in Phase 5 of the Gengetsu fight, telegraphing the slightly random bullet columns.

    The spawn column lines in the TH05 Gengetsu fight, in the first of their two flashing colors.The spawn column lines in the TH05 Gengetsu fight, in the second of their two flashing colors.

These structures only took 1 push to reverse-engineer rather than the 2 I needed for their TH05 counterparts because they are much simpler in this game. The "structure" for Gengetsu's lines literally uses just a single X position, with the remaining 24 bytes being basically padding. The only minor bug I found on this shallow level concerns Marisa's bits, which are clipped at the right and bottom edges of the playfield 16 pixels earlier than you would expect:


The remaining push went to a bunch of smaller structures and functions:


To top off the second push, we've got the vertically scrolling checkerboard background during the Stage 6 Yuuka fight, made up of 32×32 squares. This one deserves a special highlight just because of its needless complexity. You'd think that even a performant implementation would be pretty simple:

  1. Set the GRCG to TDW mode
  2. Set the GRCG tile to one of the two square colors
  3. Start with Y as the current scroll offset, and X as some indicator of which color is currently shown at the start of each row of squares
  4. Iterate over all lines of the playfield, filling in all pixels that should be displayed in the current color, skipping over the other ones
  5. Count down Y for each line drawn
  6. If Y reaches 0, reset it to 32 and flip X
  7. At the bottom of the playfield, change the GRCG tile to the other color, and repeat with the initial value of X flipped

The most important aspect of this algorithm is how it reduces GRCG state changes to a minimum, avoiding the costly port I/O that we've identified time and time again as one of the main bottlenecks in TH01. With just 2 state variables and 3 loops, the resulting code isn't that complex either. A naive implementation that just drew the squares from top to bottom in a single pass would barely be simpler, but much slower: By changing the GRCG tile on every color, such an implementation would burn a low 5-digit number of CPU cycles per frame for the 12×11.5-square checkerboard used in the game.
And indeed, ZUN retained all important aspects of this algorithm… but still implemented it all in ASM, with a ridiculous layer of x86 segment arithmetic on top? :zunpet: Which blows up the complexity to 4 state variables, 5 nested loops, and a bunch of constants in unusual units. I'm not sure what this code is supposed to optimize for, especially with that rather questionable register allocation that nevertheless leaves one of the general-purpose registers unused. :onricdennat: Fortunately, the function was still decompilable without too many code generation hacks, and retains the 5 nested loops in all their goto-connected glory. If you want to add a checkerboard to your next PC-98 demo, just stick to the algorithm I gave above.
(Using a single XOR for flipping the starting X offset between 32 and 64 pixels is pretty nice though, I have to give him that.)


This makes for a good occasion to talk about the third and final GRCG mode, completing the series I started with my previous coverage of the 📝 RMW and 📝 TCR modes. The TDW (Tile Data Write) mode is the simplest of the three and just writes the 8×1 GRCG tile into VRAM as-is, without applying any alpha bitmask. This makes it perfect for clearing rectangular areas of pixels – or even all of VRAM by doing a single memset():

// Set up the GRCG in TDW mode.
outportb(0x7C, 0x80);

// Fill the tile register with color #7 (0111 in binary).
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0xFF); // Plane 2: (G): (********)
outportb(0x7E, 0x00); // Plane 3: (E): (        )

// Set the 32 pixels at the top-left corner of VRAM to the exact contents of
// the tile register, effectively repeating the tile 4 times. In TDW mode, the
// GRCG ignores the CPU-supplied operand, so we might as well just pass the
// contents of a register with the intended width. This eliminates useless load
// instructions in the compiled assembly, and even sort of signals to readers
// of this code that we do not care about the source value.
*reinterpret_cast<uint32_t far *>(MK_FP(0xA800, 0)) = _EAX;

// Fill the entirety of VRAM with the GRCG tile. A simple C one-liner that will
// probably compile into a single `REP STOS` instruction. Unfortunately, Turbo
// C++ 4.0J only ever generates the 16-bit `REP STOSW` here, even when using
// the `__memset__` intrinsic and when compiling in 386 mode. When targeting
// that CPU and above, you'd ideally want `REP STOSD` for twice the speed.
memset(MK_FP(0xA800, 0), _AL, ((640 / 8) * 400));

However, this might make you wonder why TDW mode is even necessary. If it's functionally equivalent to RMW mode with a CPU-supplied bitmask made up entirely of 1 bits (i.e., 0xFF, 0xFFFF, or 0xFFFFFFFF), what's the point? The difference lies in the hardware implementation: If all you need to do is write tile data to VRAM, you don't need the read and modify parts of RMW mode which require additional processing time. The PC-9801 Programmers' Bible claims a speedup of almost 2× when using TDW mode over equivalent operations in RMW mode.
And that's the only performance claim I found, because none of these old PC-98 hardware and programming books did any benchmarks. Then again, it's not too interesting of a question to benchmark either, as the byte-aligned nature of TDW blitting severely limits its use in a game engine anyway. Sure, maybe it makes sense to temporarily switch from RMW to TDW mode if you've identified a large rectangular and byte-aligned section within a sprite that could be blitted without a bitmask? But the necessary identification work likely nullifies the performance gained from TDW mode, I'd say. In any case, that's pretty deep micro-optimization territory. Just use TDW mode for the few cases it's good at, and stick to RMW mode for the rest.

So is this all that can be said about the GRCG? Not quite, because there are 4 bits I haven't talked about yet…


And now we're just 5.37% away from 100% position independence for TH04! From this point, another 2 pushes should be enough to reach this goal. It might not look like we're that close based on the current estimate, but a big chunk of the remaining numbers are false positives from the player shot control functions. Since we've got a very special deadline to hit, I'm going to cobble these two pushes together from the two current general subscriptions and the rest of the backlog. But you can, of course, still invest in this goal to allow the existing contributions to go to something else.
… Well, if the store was actually open. :thonk: So I'd better continue with a quick task to free up some capacity sooner rather than later. Next up, therefore: Back to TH02, and its item and player systems. Shouldn't take that long, I'm not expecting any surprises there. (Yeah, I know, famous last words…)

📝 Posted:
🚚 Summary of:
P0238, P0239
Commits:
(Website) 4698397...edf2926, c5e51e6...P0239
💰 Funded by:
Ember2528
🏷 Tags:

:stripe: Stripe is now properly integrated into this website as an alternative to PayPal! Now, you can also financially support the project if PayPal doesn't work for you, or if you prefer using a provider out of Stripe's greater variety. It's unfortunate that I had to ship this integration while the store is still sold out, but the Shuusou Gyoku OpenGL backend has turned out way too complicated to be finished next to these two pushes within a month. It will take quite a while until the store reopens and you all can start using Stripe, so I'll just link back to this blog post when it happens.

Integrating Stripe wasn't the simplest task in the world either. At first, the Checkout API seems pretty friendly to developers: The entire payment flow is handled on the backend, in the server language of your choice, and requires no frontend JavaScript except for the UI feedback code you choose to write. Your backend API endpoint initiates the Stripe Checkout session, answers with a redirect to Stripe, and Stripe then sends a redirect back to your server if the customer completed the payment. Superficially, this server-based approach seems much more GDPR-friendly than PayPal, because there are no remote scripts to obtain consent for. In reality though, Stripe shares much more potential personal data about your credit card or bank account with a merchant, compared to PayPal's almost bare minimum of necessary data. :thonk:
It's also rather annoying how the backend has to persist the order form information throughout the entire Checkout session, because it would otherwise be lost if the server restarts while a customer is still busy entering data into Stripe's Checkout form. Compare that to the PayPal JavaScript SDK, which only POSTs back to your server after the customer completed a payment. In Stripe's case, more JavaScript actually only makes the integration harder: If you trigger the initial payment HTTP request from JavaScript, you will have to improvise a bit to avoid the CORS error when redirecting away to a different domain.

But sure, it's all not too bad… for regular orders at least. With subscriptions, however, things get much worse. Unlike PayPal, Stripe kind of wants to stay out of the way of the payment process as much as possible, and just be a wrapper around its supported payment methods. So if customers aren't really meant to register with Stripe, how would they cancel their subscriptions? :thonk:
Answer: Through the… merchant? Which I quite dislike in principle, because why should you have to trust me to actually cancel your subscription after you requested it? It also means that I probably should add some sort of UI for self-canceling a Stripe subscription, ideally without adding full-blown user accounts. Not that this solves the underlying trust issue, but it's more convenient than contacting me via email or, worse, going through your bank somehow. Here is how my solution works:

I might have gone a bit overboard with the crypto there, but I liked the idea of not storing any of the Stripe session IDs in the server database. It's not like that makes the system more complex anyway, and it's nice to have a separate confirmation step before canceling a subscription.

But even that wasn't everything I had to keep in mind here. Once you switch from test to production mode for the final tests, you'll notice that certain SEPA-based payment providers take their sweet time to process and activate new subscriptions. The Checkout session object even informs you about that, by including a payment status field. Which initially seems just like another field that could indicate hacking attempts, but treating it as such and rejecting any unpaid session can also reject perfectly valid subscriptions. I don't want all this control… 🥲
Instead, all I can do in this case is to tell you about it. In my test, the Stripe dashboard said that it might take days or even weeks for the initial subscription transaction to be confirmed. In such a case, the respective fraction of the cap will unfortunately need to remain red for that entire time.

And that was 1½ pushes just to replicate the basic functionality of a simple PayPal integration with the simplest type of Stripe integration. On the architectural site, all the necessary refactoring work made me finally upgrade my frontend code to TypeScript at least, using the amazing esbuild to handle transpilation inside the server binary. Let's see how long it will now take for me to upgrade to SCSS…


With the new payment options, it makes sense to go for another slight price increase, from up to per push. The amount of taxes I have to pay on this income is slowly becoming significant, and the store has been selling out almost immediately for the last few months anyway. If demand remains at the current level or even increases, I plan to gradually go up to by the end of the year.
📝 As 📝 usual, I'm going to deliver existing orders in the backlog at the value they were originally purchased at. Due to the way the cap has to be calculated, these contributions now appear to have increased in value by a rather awkward 13.33%.


This left ½ of a push for some more work on the TH01 Anniversary Edition. Unfortunately, this was too little time for the grand issue of removing byte-aligned rendering of bigger sprites, which will need some additional blitting performance research. Instead, I went for a bunch of smaller bugfixes:

The final point, however, raised the question of what we're now going to do about 📝 a certain issue in the 地獄/Jigoku Bad Ending. ZUN's original expensive way of switching the accessed VRAM page was the main reason behind the lag frames on slower PC-98 systems, and search-replacing the respective function calls would immediately get us to the optimized version shown in that blog post. But is this something we actually want? If we wanted to retain the lag, we could surely preserve that function just for this one instance…
The discovery of this issue predates the clear distinction between bloat, quirks, and bugs, so it makes sense to first classify what this issue even is. The distinction comes all down to observability, which I defined as changes to rendered frames between explicitly defined frame boundaries. That alone would be enough to categorize any cause behind lag frames as bloat, but it can't hurt to be more explicit here.

Therefore, I now officially judge observability in terms of an infinitely fast PC-98 that can instantly render everything between two explicitly defined frames, and will never add additional lag frames. If we plan to port the games to faster architectures that aren't bottlenecked by disappointing blitter chips, this is the only reasonable assumption to make, in my opinion: The minimum system requirements in the games' README files are minimums, after all, not recommendations. Chasing the exact frame drop behavior that ZUN must have experienced during the time he developed these games can only be a guessing game at best, because how can we know which PC-98 model ZUN actually developed the games on? There might even be more than one model, especially when it comes to TH01 which had been in development for at least two years before ZUN first sold it. It's also not like any current PC-98 emulator even claims to emulate the specific timing of any existing model, and I sure hope that nobody expects me to import a bunch of bulky obsolete hardware just to count dropped frames.

That leaves the tearing, where it's much more obvious how it's a bug. On an infinitely fast PC-98, the ドカーン frame would never be visible, and thus falls into the same category as the 📝 two unused animations in the Sariel fight. With only a single unconditional 2-frame delay inside the animation loop, it becomes clear that ZUN intended both frames of the animation to be displayed for 2 frames each:

No tearing, and 34 frames in total for the first of the two instances of this animation.

:th01: TH01 Anniversary Edition, version P0239 2023-05-01-th01-anniv.zip

Next up: Taking the oldest still undelivered push and working towards TH04 position independence in preparation for multilingual translations. The Shuusou Gyoku OpenGL backend shouldn't take that much longer either, so I should have lots of stuff coming up in May afterward.

📝 Posted:
🚚 Summary of:
P0235, P0236, P0237
Commits:
e7a9262...62c4b7f, 62c4b7f...7fa9038, 7fa9038...c5e51e6
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

So, TH02! Being the only game whose main binary hadn't seen any dedicated attention ever, we get to start the TH02-related blog posts at the very beginning with the most foundational pieces of code. The stage tile system is the best place to start here: It not only blocks every entity that is rendered on top of these tiles, but is curiously placed right next to master.lib code in TH02, and would need to be separated out into its own translation unit before we can do the same with all the master.lib functions.

In late 2018, I already RE'd 📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this blog yet, so this post is also going to include the details that are unique to those games. On a high level, the stage tile system works identically in all three games:

The differences between the three games can best be summarized in a table:

:th02: TH02 :th04: TH04 :th05: TH05
Tile image file extension .MPN
Tile section format .MAP
Tile section order defined as part of .DT1 .STD
Tile section index format 0-based ID 0-based ID × 2
Tile image index format Index between 0 and 100, 1 byte VRAM offset in tile source area, 2 bytes
Scroll speed control Hardcoded Part of the .STD format, defined per referenced tile section
Redraw granularity Full tiles (16×16) Half tiles (16×8)
Rows per tile section 8 5
Maximum number of tile sections 16 32
Lowest number of tile sections used 5 (Stage 3 / Extra) 8 (Stage 6) 11 (Stage 2 / 4)
Highest number of tile sections used 13 (Stage 4) 19 (Extra) 24 (Stage 3)
Maximum length of a map 320 sections (static buffer) 256 sections (format limitation)
Shortest map 14 sections (Stage 5) 20 sections (Stage 5) 15 sections (Stage 2)
Longest map 143 sections (Stage 4) 95 sections (Stage 4) 40 sections (Stage 1 / 4 / Extra)

The most interesting part about stage tiles is probably the fact that some of the .MAP files contain unused tile sections. 👀 Many of these are empty, duplicates, or don't really make sense, but a few are unique, fit naturally into their respective stage, and might have been part of the map during development. In TH02, we can find three unused sections in Stage 5:

Section 0 of TH02's STAGE4.MAPSection 1 of TH02's STAGE4.MAPSection 2 of TH02's STAGE4.MAPSection 3 of TH02's STAGE4.MAPSection 4 of TH02's STAGE4.MAPSection 5 of TH02's STAGE4.MAPSection 6 of TH02's STAGE4.MAPSection 7 of TH02's STAGE4.MAP
The non-empty tile sections defined in TH02's STAGE4.MAP, showing off three unused ones.
These unused tile sections are much more common in the later games though, where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2, and 4. I'll document those once I get to finalize the tile rendering code of these games, to leave some more content for that blog post. TH04/TH05 tile code would be quite an effective investment of your money in general, as most of it is identical across both games. Or how about going for a full-on PC-98 Touhou map viewer and editor GUI?


Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN was just starting to understand how to pull off smooth vertical scrolling on a PC-98. As such, it comes with a few inefficiencies and suboptimal implementation choices:

Even though this was ZUN's first attempt at scrolling tiles, he already saw it fit to write most of the code in assembly. This was probably a reaction to all of TH01's performance issues, and the frame rate reduction workarounds he implemented to keep the game from slowing down too much in busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM code, and then it will be fast, right?" :zunpet:
Another reason for going with ASM might be found in the kind of documentation that may have been available to ZUN. Last year, the PC-98 community discovered and scanned two new game programming tutorial books from 1991 (1, 2). Their example code is not only entirely written in assembly, but restricts itself to the bare minimum of x86 instructions that were available on the 8086 CPU used by the original PC-9801 model 9 years earlier. Such code is not only suboptimal on the 486, but can often be actually worse than what your C++ compiler would generate. TH02 is where the trend of bad hand-written ASM code started, and it 📝 only intensified in ZUN's later games. So, don't copy code from these books unless you absolutely want to target the earlier 8086 and 286 models. Which, 📝 as we've gathered from the recent blitting benchmark results, are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and maintainability. Apart from the aforementioned issues, the algorithms themselves are mostly fine – especially since most EGC and GRCG operations are decently batched this time around, in contrast to TH01.


Luckily, the tile functions merely use inline assembly within a typical C function and can therefore be at least part of a C++ source file, even if the result is pretty ugly. This time, we can actually be sure that they weren't written directly in a .ASM file, because they feature x86 instruction encodings that can only be generated with Turbo C++ 4.0J's inline assembler, not with TASM. The same can't unfortunately be said about the following function in the same segment, which marks the tiles covered by the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM inconsistency in the function's epilog to make the entire function undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:

PUSH	BP
MOV 	BP, SP
SUB 	SP, ?? ; if the function needs the stack for local variables
When compiling without optimizations, Turbo C++ 4.0J will replace this sequence with a single ENTER instruction. That one is two bytes smaller, but much slower on every x86 CPU except for the 80186 where it was introduced.
In functions without local variables, BP and SP remain identical, and a single POP BP is all that's needed in the epilog to tear down such a stack frame before returning from the function. Otherwise, the function needs an additional MOV SP, BP instruction to pop all local variables. With x86 being the helpful CISC architecture that it is, the 80186 also introduced the LEAVE instruction to perform both tasks. Unlike ENTER, this single instruction is faster than the raw two instructions on a lot of x86 CPUs (and even current ones!), and it's always smaller, taking up just 1 byte instead of 3.
So what if you use LEAVE even if your function doesn't use local variables? :thonk: The fact that the instruction first does the equivalent of MOV SP, BP doesn't matter if these registers are identical, and who cares about the additional CPU cycles of LEAVE compared to just POP BP, right? So that's definitely something you could theoretically do, but not something that any compiler would ever generate.

And so, TH02 MAIN.EXE decompilation already hits the first brick wall after two pushes. Awesome! :godzun: Theoretically, we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the function epilog would mean that we'd have to keep Turbo C++ 4.0J from emitting any epilog or prolog code so that we can write our own. This means that we'd once again have to hide any use of the SI and DI registers from the compiler… and doing that requires code generation macros for 22 of the 49 instructions of the function in question, almost none of which we currently have. So, this gets quite silly quite fast, especially if we only need to do it for one single byte.

Instead, wouldn't it be much better if we had a separate build step between compile and link time that allowed us to replicate mistakes like these by just patching the compiled .OBJ files? These files still contain the names of exported functions for linking, which would allow us to look up the code of a function in a robust manner, navigate to specific instructions using a disassembler, replace them, and write the modified .OBJ back to disk before linking. Such a system could then naturally expand to cover all other decompilation issues, culminating in a full-on optimizer that could even recreate ZUN's self-modifying code. At that point, we would have sealed away all of ZUN's ugly ASM code within a separate build step, and could finally decompile everything into readable C++.

Pulling that off would require a significant tooling investment though. Patching that one byte in TH02's spark invalidation function could be done within 1 or 2 pushes, but that's just one issue, and we currently have 32 other .ASM files with undecompilable code. Also, note that this is fundamentally different from what we're doing with the debloated branch and the Anniversary Editions. Mistake patching would purely be about having readable code on master that compiles into ZUN's exact binaries, without fixing weird code. The Anniversary Editions go much further and rewrite such code in a much more fundamental way, improving it further than mistake patching ever could.
Right now, the Anniversary Editions seem much more popular, which suggests that people just want 100% RE as fast as possible so that I can start working on them. In that case, why bother with such undecompilable functions, and not just leave them in raw and unreadable x86 opcode form if necessary… :tannedcirno: But let's first see how much backer support there actually is for mistake patching before falling back on that.

The best part though: Once we've made a decision and then covered TH02's spark and particle systems, that was it, and we will have already RE'd all ZUN-written PC-98-specific blitting code in this game. Every further sprite or shape is rendered via master.lib, and is thus decently abstracted. Guess I'll need to update 📝 the assessment of which PC-98 Touhou game is the easiest to port, because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.


Until then, there are still enough parts of the game that don't use any of the remaining few functions in the _TEXT segment. Previously, I mentioned in the 📝 status overview blog post that TH02 had a seemingly weird sprite system, but the spark and point popup (〇一二三四五六七八九十×÷) structures showed that the game just stores the current and previous position of its entities in a slightly different way compared to the rest of PC-98 Touhou. Instead of having dedicated structure fields, TH02 uses two-element arrays indexed with the active VRAM page. Same thing, and such a pattern even helps during RE since it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe a landmine that causes sprite glitches when trying to display more than 99,990 points. Sadly, the final push in this delivery was rounded out by yet another piece of code at the opposite end of the quality spectrum. The particle and smear effects for Reimu's bomb animations consist almost entirely of assembly bloat, which would just be replaced with generic calls to the generic blitter in this game's future Anniversary Edition.

If I continue to decompile TH02 while avoiding the brick wall, items would be next, but they probably require two pushes. Next up, therefore: Integrating Stripe as an alternative payment provider into the order form. There have been at least three people who reported issues with PayPal, and Stripe has been working much better in tests. In the meantime, here's a temporary Stripe order link for everyone. This one is not connected to the cap yet, so please make sure to stay within whatever value is currently shown on the front page – I will treat any excess money as donations. :onricdennat: If there's some time left afterward, I might also add some small improvements to the TH01 Anniversary Edition.

📝 Posted:
🏷 Tags:

Turns out I was not quite done with the TH01 Anniversary Edition yet. You might have noticed some white streaks at the beginning of Sariel's second form, which are in fact a bug that I accidentally added to the initial release. :tannedcirno:
These can be traced back to a quirk I wasn't aware of, and hadn't documented so far. When defeating Sariel's first form during a pattern that spawns pellets, it's likely for the second form to start with additional pellets that resemble the previous pattern, but come out of seemingly nowhere. This shouldn't really happen if you look at the code: Nothing outside the typical pattern code spawns new pellets, and all existing ones are reset before the form transition…

Except if they're currently showing the 10-frame delay cloud animation , activated for all pellets during the symmetrical radial 2-ring pattern in Phase 2 and left activated for the rest of the fight. These pellets will continue their animation after the transition to the second form, and turn into regular pellets you have to dodge once their animation completed.

By itself, this is just one more quirk to keep in mind during refactoring. It only turned into a bug in the Anniversary Edition because the game tracks the number of living pellets in a separate counter variable. After resetting all pellets, this counter is simply set to 0, regardless of any delay cloud pellets that may still be alive, and it's merely incremented or decremented when pellets are spawned or leave the playfield. :zunpet:
In the original game, this counter is only used as an optimization to skip spawning new pellets once the cap is reached. But with batched EGC-accelerated unblitting, it also makes sense to skip the rather costly setup and shutdown of the EGC if no pellets are active anyway. Except if the counter you use to check for that case can be 0 even if there are pellets alive, which consequently don't get unblitted… :onricdennat:
There is an optimal fix though: Instead of unconditionally resetting the living pellet counter to 0, we decrement it for every pellet that does get reset. This preserves the quirk and gives us a consistently correct counter, allowing us to still skip every unnecessary loop over the pellet array.

Cutting out the lengthy defeat animation makes it easier to see where the additional pellets come from.
Cutting out the lengthy defeat animation makes it easier to see where the additional pellets come from. Also, note how regular unblitting resumes once the first pellet gets clipped at the top of the playfield – the living pellet counter then gets decremented to -1, and who uses <= rather than == on a seemingly unsigned counter, right?
Cutting out the lengthy defeat animation makes it easier to see where the additional pellets come from.

Ultimately, this was a harmless bug that didn't affect gameplay, but it's still something that players would have probably reported a few more times. So here's a free bugfix:

:th01: TH01 Anniversary Edition, version P0234-1 2023-03-14-th01-anniv.zip

Thanks to mu021 for reporting this issue and providing helpful videos to identify the cause!

📝 Posted:
🚚 Summary of:
P0229, P0230, P0231, P0232, P0233, P0234
Commits:
6370f96...d535d87, d535d87...ca523b4, ca523b4...05a49b9, f7ef7f8...abeaf85, abeaf85...dbc5b51, dd2265c...12f29c6
💰 Funded by:
Ember2528, [Anonymous]
🏷 Tags:

128 commits! Who would have thought that the ideal first release of the TH01 Anniversary Edition would involve so much maintenance, and raise so many research questions? It's almost as if the real work only starts after the 100% finalization mark… Once again, I had to steal some funding from the reserved JIS trail word pushes to cover everything I liked to research, which means that the next towards the anything goal will repay this debt. Luckily, this doesn't affect any immediate plans, as I'll be spending March with tasks that are already fully funded.

So, how did this end up so massive? The list of things I originally set out to do was pretty short:

  1. Build entire game into single executable
  2. Fix rendering issues in the one or two most important parts of the game for a good initial impression

But even the first point already started with tons of little cleanup commits. A part of them can definitely be blamed on the rush to hit the 100% decompilation mark before the 25th anniversary last August. However, all the structural changes that I can't commit to master reveal how much of a mess the TH01 codebase actually is.
Merging the executables is mainly difficult because of all the inconsistencies between REIIDEN.EXE and FUUIN.EXE. The worst parts can be found in the REYHI*.DAT format code and the High Score menu, but the little things are just as annoying, like how the current score is an unsigned variable in REIIDEN.EXE, but a signed one in FUUIN.EXE. :zunpet: If it takes me this long and this many commits just to sort out all of these issues, it's no wonder that the only thing I've seen being done with this codebase since TH01's 100% decompilation was a single porting attempt that ended in a rather quick ragequit.
So why are we merging the executables in preparation for the Anniversary Edition, and not waiting with it until we start doing ports?

The game actually is so bloated that the combined binary ended up smaller than the original REIIDEN.EXE. If all you see are the file sizes of the original three executables, this might look like a pretty impressive feat. Like, how can we possibly get 407,812 bytes into less than 238,612 bytes, without using compression?
If you've ever looked at the linker map though, it's not at all surprising. Excluding the aforementioned inconsistencies that are hard to quantify, OP.EXE and FUUIN.EXE only feature 5,767 and 6,475 bytes of unique code and data, respectively. All other code in these binaries is already part of REIIDEN.EXE, with more than half of the size coming from the Borland C++ runtime. The single worst offender here is the C++ exception handler that Borland forces onto every non-.COM binary by default, which alone adds 20,512 bytes even if your binary doesn't use C++ exceptions.
On a more hilarious note, this single line is responsible for pulling another unnecessary 14,242 bytes into OP.EXE and FUUIN.EXE. This floating-point multiplication is completely unnecessary in this context because all possible parameters are integers, but it's enough for Turbo C++ and TLINK to pull in the entire x87 FPU emulation machinery. These two binaries don't even draw lines, but since this function is part of the general graphics code translation unit and contains other functions that these binaries do need, TLINK links in the entire thing. Maybe, multiple executables aren't the best choice either if you use a linker that can't do dead code elimination…

Since the 📝 Orb's physics do turn the entire precision of a double variable into gameplay effects, it's not feasible to ever get rid of all FPU code in TH01. The exception handler, however, can be removed, which easily brings the combined binary below the size of the original REIIDEN.EXE. Compiling all code with a single set of compiler optimization flags, including the more x86-friendly pascal calling convention, then gets us a few more KB on top. As does, of course, removing unused code: The only remaining purpose of features such as 📝 resident palettes is to potentially make porting more difficult for anyone who doesn't immediately realize that nothing in the game uses these functions.
Technically, all unused code would be bloat, but for now, I'm keeping the parts that may tell stories about the game's development history (such as unused effects or the 📝 mouse cursor), or that might help with debugging. Even with that in mind, I've only scratched the surface when it comes to bloat removal, and the binary is only going to get smaller from here. A lot smaller.

If only we now could start MDRV98 from this new combined binary, we wouldn't need a second batch file either…


Which brings us to the first big research question of this delivery. Using the C spawn() function works fine on this compiler, so spawn("MDRV98.COM") would be all we need to do, right? Except that the game crashes very soon after that subprocess returned. :thonk:
So it's not going to be that easy if the spawned process is a TSR. But why should this be a problem? Let's take a look at the DOS heap, and how DOS lays out processes in conventional memory if we launch the game regularly through GAME.BAT:

The rough layout of the DOS heap when launching TH01 from GAME.BAT.

The batch file starts MDRV98 first, which will therefore end up below the game in conventional memory. This is perfect for a TSR: The program can resize itself arbitrarily before returning to DOS, and the rest of memory will be left over for the game. If we assume such a layout, a DOS program can implement a custom memory allocator in a very simple way, as it only has to search for free memory in one direction – and this is exactly how Borland implemented the C heap for functions like malloc() and free(), and the C++ new and delete operators.
But if we spawn MDRV98 after starting TH01, well…

MDRV98 will spawn in the next free memory location, allocate itself, return to TH01… which suddenly finds its C heap blocked from growing. As a result, the next big allocation will immediately fail with a rather misleading "out of memory" error.

So, what can we do about this? Still in a bloat removal mindset, my gut reaction was to just throw out Borland's C heap implementation, and replace it with a very thin wrapper around the DOS heap as managed by INT 21h, AH=48h/49h/4Ah. Like, why did these DOS compilers even bother with a custom allocator in the first place if DOS already comes with a perfectly fine native one? Using the native allocator would completely erase the distinction between TSR memory and game memory, and inherently allow the game to allocate beyond MDRV98.
I did in fact implement this, and noticed even more benefits:

Ultimately though, the drawbacks became too significant. Most of them are related to the PC-98 Touhou games only ever creating a single DOS process, even though they contain multiple executables. Switching executables is done via exec(), which resizes a program's main allocation to match the new binary and then overwrites the old program image with the new one. If you've ever wondered why DOSBox-X only ever shows OP as the active process name in the title bar, you now know why. As far as DOS is concerned, it's still the same OP.EXE process rooted at the same segment, and exec() doesn't bother rewriting the name either. Most importantly though, this is how REIIDEN.EXE can launch into another REIIDEN.EXE process even if there are less than 238,612 bytes free when exec() is called, and without consuming more memory for every successive binary.
For now, ANNIV.EXE still re-exec()s itself at every point where the original game did, as ZUN's original code really depends on being reinitialized at boss and scene boundaries. The resulting accidental semi-hot reloading is also a useful property to retain during development.
So why is the DOS heap a bad idea for regular game allocation after all?

I could release this DOS heap wrapper in unused form for another push if anyone's interested, but for now, I'm pretty happy with not actually using it in the games. Instead, let's stay with the Borland C heap, and find a way to push MDRV98 to the very top of conventional RAM. Like this:

Which is much easier said than done. It would be nice if we could just use the last fit allocation strategy here, but .COM executables always receive all free memory by default anyway, which eliminates any difference between the strategies.
But we can still change memory itself. So let's temporarily claim all remaining free memory, minus the exact amount we need for MDRV98, for our process. Then, the only remaining free space to spawn MDRV98 is at the exact place where we want it to be:

Obviously, we release all the additional memory after spawning MDRV98.

Now we only need to know how much memory to not temporarily allocate. First, we need to replicate the assumption that MDRV98's -M7 command-line parameter corresponds to a resident size of 23,552 bytes. This is not as bad as it seems, because the -M parameter explicitly has a KiB unit, and we can nicely abstract it away for the API.
The (env.) block though? Its minimum size equals the combined length of all environment variables passed to the process, but its maximum size is… not limited at all?! As in, DOS implementations can add and have historically added more free space because some programs insisted on storing their own new environment variables in this exact segment. DOSBox and DOSBox-X follow this tradition by providing a configuration option for the additional amount of environment space, with the latter adding 1024 additional bytes by default, y'know, just in case someone wants to compile FreeDOS on a slow emulator. It's not even worth sending a bug report for this specific case, because it's only a symptom of the fact that unexpectedly large program environment blocks can and will happen, and are to be expected in DOS land.
So thanks to this cruel joke, it's technically impossible to achieve what we want to do there. Hooray! The only thing we can kind of do here is an educated guess: Sum up the length of all environment variables in our environment block, compare that length against the allocated size of the block, and assume that the MDRV98 process will get as much additional memory as our process got. 🤷

The remaining hurdles came courtesy of some Borland C runtime implementation details. You would think that the temporary reallocation could even be done in pure C using the sbrk(), coreleft(), and brk() functions, but all values passed to or returned from these functions are inaccurate because they don't factor in the aforementioned KiB padding to the underlying DOS memory block. So we have to directly use the DOS syscalls after all. Which at least means that learning about them wasn't completely useless…
The final issue is caused inside Borland's spawn() implementation. The environment block for the child process is built out of all the strings reachable from C's environ pointer, which is what that FreeDOS build process should have used. Coalescing them into a single buffer involves yet another C heap allocation… and since we didn't report our DOS memory block manipulation back to the C heap, the malloc() call might think it needs to request more memory from DOS. This resets the DOS memory block back to its intended level, undoing our manipulation right before the actual INT 21h, AH=4Bh EXEC syscall. Or in short:

Manipulate DOS heap ➜ spawn() call ➜ _LoadProg() ➜ allocate and prepare environment block ➜ _spawn() ➜ DOS EXEC syscall

The obvious solution: Replace _LoadProg(), implement the coalescing ourselves, and do it before the heap manipulation. Fortunately, Borland's internal low-level _spawn() function is not static, so we can call it ourselves whenever we want to:

Allocate and prepare environment block ➜ manipulate DOS heap ➜ _spawn() call ➜ EXEC syscall

So yes, launching MDRV98 from C can be done, but it involves advanced witchcraft and is completely ridiculous. :tannedcirno: Launching external sound drivers from a batch file is the right way of doing things.
Fortunately, you don't have to rely on this auto-launching feature. You can still launch DEBLOAT.EXE or ANNIV.EXE from a batch file that launched MDRV98.COM before, and the binaries will detect this case and skip the attempt of launching MDRV98 from C. It's unlikely that my heuristic will ever break, but I definitely recommend replicating GAME.BAT just to be completely sure – especially for user-friendly repacks that don't want to include the original game anyway.
This is also why ANNIV.EXE doesn't launch ZUNSOFT.COM: The "correct" and stable way to launch ANNIV.EXE still involves a batch file, and I would say that expecting people to remove ZUNSOFT.COM from that file is worse than not playing the animation. It's certainly a debate we can have, though.


This deep dive into memory allocation revealed another previously undocumented bug in the original game. The RLE decompression code for the 東方靈異.伝 packfile contains two heap overflows, which are actually triggered by SinGyoku's BOSS1_3.BOS and Konngara's BOSS8_1.BOS. They only do not immediately crash the game when loading these bosses thanks to two implementation details of Borland's C heap. :zunpet:
Obviously, this is a bug we should fix, but according to the definition of bugs, that fix would be exclusive to the anniversary branch. Isn't that too restrictive for something this critical? This code is guaranteed to blow up with a different heap implementation, if only in a Debug build. :thonk: And besides, nobody would notice a fix just by looking at the game's rendered output…

Looks like we have to introduce a fourth category of weird code, in addition to the previous bloat, bug, and quirk categories, for invisible internal issues like these. Let's call it landmine, and fix them on the debloated branch as well. Thanks to Clerish for the naming inspiration!
With this new category, the full definitions for all categories have become quite extensive. Thus, they now live in CONTRIBUTING.md inside the ReC98 repository.

With the new discoveries and the new landmine category, TH01 is now at 67 bugs and 20 landmines. And the solution for the landmine in question? Simplifying the 61 lines of the original code down to 16. And yes, I'm including comments in these numbers – if the interactions of the code are complex enough to require multi-paragraph comments, these are a necessary and valid part of the code.


While we're on the topic of weird code and its visible or invisible effects, there's one thing you might be concerned about. With all the rearchitecting and data shifting we're doing on the debloated branch, what will happen to the 📝 negative glitch stages? These are the result of a clearly observable bug that, by definition, must not be fixed on the debloated branch. But given that the observable layout of the glitch stages is defined by the memory surrounding the scene stage variable, won't the debloated branch inherently alter their appearance (= ⚠️ fanfiction ⚠️), or even remove them completely?

Well, yes, it will. But we can still preserve their layout by hardcoding the exact original data that the game would originally read, and even emulate the original segment relocations and other pieces of global data.
Doing this is feasible thanks to the fact that there are only 4 glitch stages. Unfortunately, the same can't be said for the timer values, which are determined by an array lookup with the un-modulo'd stage ID. If we wanted to preserve those as well, we'd have to bundle an exact copy of the original REIIDEN.EXE data segment to preserve the values of all 32,768 negative stages you could possibly enter, together with a map of all relocations in this segment. 😵 Which I've decided against for now, since this has been going on for far too long already. Let's first see if anyone ever actually complains about details like this…


Alright, time to start the anniversary branch by rendering everything at its correct internal unaligned X position? Eh… maybe not quite yet. If we just hacked all the necessary bit-shifting code into all the format-specific blitting functions, we'd still retain all this largely redundant, bad, and slow code, and would make no progress in terms of portability. It'd be much better to first write a single generic blitter that's decently optimized, but supports all kinds of sprites to make this optimization actually worth something.
So, next research question: How would such a blitter look like? After I learned during my 📝 first foray into cycle counting that port I/O is slow on 486 CPUs, it became clear that TH04's 📝 GRCG batching for pellets was one of the more useful optimizations that probably contributed a big deal towards achieving the high bullet counts of that game. This leads to two conclusions:

Maybe we should also start by not even doing these unaligned bit shifts ourselves, and instead expect the call site to 📝 always deliver a byte-aligned sprite that is correctly preshifted, if necessary? Some day, we definitely should measure how slow runtime shifting would really be…

What we should do, however, are some further general optimizations that I would have expected from master.lib: Unrolling the vertical loop, and baking a single function for every sprite width to eliminate the horizontal loop. We can then use the widest possible x86 MOV instruction for the lowest possible number of cycles per row – for example, we'd blit a 56-wide sprite with three MOVs (32-bit + 16-bit + 8-bit), and a 64-wide one with two 32-bit MOVs.
Or maybe not? There's a lot of blitting code in both master.lib and PC-98 Touhou that checks for empty bytes within sprites to skip needlessly writing them to VRAM:

uint8_t left_half = ((uint8_t *)(sprite))[0];
uint8_t right_half = ((uint8_t *)(sprite))[1];
if(right_half != 0x00) {
	pokeb(VRAM_SEGMENT, (vram_offset + 0), left_half);
}
if(right_half != 0x00) {
	pokeb(VRAM_SEGMENT, (vram_offset + 1), right_half);
}

Which goes against everything you seem to know about computers. We aren't running on an 8-bit CPU here, so wouldn't it be faster to always write both halves of a sprite in a single operation?

uint16_t both_halves = ((uint16_t *)(sprite))[0];
pokew(VRAM_SEGMENT, vram_offset, both_halves);

That's a single CPU instruction, compared to two instructions and two branches. The only possible explanation for this would be that VRAM writes are so slow on PC-98 that you'd want to avoid them at all costs, even if that means additional branching on the CPU to do so. Or maybe that was something you would want to do on certain models with slow VRAM, but not on others?

So I wrote a benchmark to answer all these questions, and to compare my new blitter against typical TH01 blitting code:

A not really representative run on DOSBox-X. Since the master.lib sprite functions are also unbatched, I expect them to not be much faster than the naive C implementation.

2023-03-05-blitperf.zip And here are the real-hardware results I've got from the PC-9800 Central Discord server:

PC-286LS PC-9801ES PC-9821Cb/Cx PC-9821Ap3 PC-9821An PC-9821Nw133 PC-9821Ra20
80286, 12 MHz i386SX, 16 MHz 486SX, 33 MHz 486DX4, 100 MHz Pentium, 90 MHz Pentium, 133 MHz Pentium Pro, 200 MHz
1987 1989 1994 1994 1994 1997 1996
Unchecked C GRCG 36,85 38,42 26,02 26,87 3,98 4,13 2,08 2,16 1,81 1,87 0,86 0,89 1,25 1,25
MOVS GRCG 15,22 16,87 9,33 10,19 1,22 1,37 0,44 0,44
MOV GRCG 15,42 17,08 9,65 10,53 1,15 1,3 0,44 0,44
4-plane 37,23 43,97 29,2 32,96 4,44 5,01 4,39 4,67 5,11 5,32 5,61 5,74 6,63 6,64
Checking first GRCG 17,49 19,15 10,84 11,72 1,27 1,44 1,04 1,07 0,54 0,54
4-plane 46,49 53,36 35,01 38,79 5,66 6,26 5,43 5,74 6,56 6,8 8,08 8,29 10,25 10,29
Checking second GRCG 16,47 18,12 10,77 11,65 1,25 1,39 1,02 0,51 0,51
4-plane 43,41 50,26 33,79 37,82 5,22 5,81 5,14 5,43 6,18 6,4 7,57 7,77 9,58 9,62
Checking both GRCG 16,14 18,03 10,84 11,71 1,33 1,49 1,01 0,49 0,49
4-plane 43,61 50,45 34,11 37,87 5,39 5,99 4,92 5,23 5,88 6,11 7,19 7,43 9,1 9,13
Amount of frames required to render 2000 16×8 pellet sprites on a variety of PC-98 models, using the new generic blitter. Both preshifted (first column) and runtime-shifted (second column) sprites were tested; empty columns correspond to times faster than a single frame. Thanks to cuba200611, Shoutmon, cybermind, and Digmac for running the tests!

The key takeaways:

Since this won't be the only piece of game-independent and explicitly PC-98-specific custom code involved in this delivery, it makes sense to start a dedicated PC-98 platform layer. This code will gradually eliminate the dependency on master.lib and replace it with better optimized and more readable C++ code. The blitting benchmark, for example, is already implemented completely without master.lib.
While this platform layer is mainly written to generate optimal code within Turbo C++ 4.0J, it can also serve as general PC-98 documentation for everyone who prefers code over machine-translating old Japanese books. Not to mention the immediacy of having all actual relevant information in one place, which might otherwise be pretty well hidden in these books, or some obscure old text file. For example, did you know that uploading gaiji via INT 18h might end up disabling the VSync interrupt trigger, deadlocking the process on the next frame delay loop? This nuisance is not replicated by any emulators, and it's quite frustrating to encounter it when trying to run your code on real hardware. master.lib works around it by simply hooking INT 18h and unconditionally reenabling the VSync interrupt trigger after the original handler returns, and so does our platform layer.


So, with the pellet draw calls batched and routed through the new renderer, we should have gained enough free CPU cycles to disable 📝 interlaced pellet rendering without any impact on frame rates?

Well, kinda. We do get 56.4 FPS, but only together with noticeable and reproducible tearing in the top part of the playfield, suggesting exactly why ZUN interlaced the rendering in the first place. 😕 So have we already reached the limit of single-buffered PC-98 games here, or can we still do something about it?
As it turns out, the main bottleneck actually lies in the pellet unblitting code. Every EGC-"accelerated" unblitting call in TH01 is as unbatched as the pellet blitting calls were, spending an additional 17 I/O port writes per call to completely set up and shut down the EGC, every time. And since this is TH01, the two-instruction operation of changing the active PC-98 VRAM page isn't inlined either, but instead done via a function call to a faraway segment. On the 486, that's:

This sums up to

And this calculation even ignores the lack of small micro-optimizations that could further optimize the blitting loop. Multiply that by the game's pellet cap of 100, and we get a 6-digit number of wasted CPU cycles. On paper, that's roughly 1/6 of the time we have for each of our target 56.423 FPS on the game's target 33 MHz systems. Might not sound all too critical, but the single-buffered nature of the game means that we're effectively racing the beam on every frame. In turn, we have to be even more serious about performance.

So, time to also add a batched EGC API to our PC-98 platform layer? Writing our own EGC code presents a nice opportunity to finally look deeper into all its registers and configuration options, and see what exactly we can do about ZUN's enforced 16-pixel alignment.
To nobody's surprise, this alignment is completely unnecessary, and only displays a lack of knowledge about the chip. While it is true that the EGC wants VRAM to be exclusively addressed in 16-bit chunks at 16-bit-aligned addresses, it specifically provides

And it gets even better: After ⌈bitlength ÷ 16⌉ write instructions, the EGC's internal shifter state automatically reinitializes itself in preparation for blitting another row of pixels with the same initially configured bit addresses and length. This is perfect for blitting rectangles, as two I/O port writes before the start of your blitting loop are enough to define your entire rectangle.

The manual nature of reading and writing in 16-pixel chunks does come with a slight pitfall though. If the source bit address is larger than the destination bit address, the first 16-bit read won't fill the EGC's internal shift register with all pixels that should appear in the first 16-pixel destination chunk. In this case, the EGC simply won't write anything and leave the first chunk unchanged. In a 📝 regular blitting loop, however, you expect that memory to be written and immediately move on to the next chunks within the row. As a result, the actual blitting process for such a rectangle will no longer be aligned to the configured address and bit length. The first row of the rectangle will appear 16 pixels to the right of the destination address, and the second one will start at bit offset 0 with pixels from the rightmost byte of the first line, which weren't blitted and remained in the tile register.
There is an easy solution though: Before the horizontal loop on each line of the rectangle, simply read one additional 16-pixel chunk from the source location to prefill the shift register. Thankfully, it's large enough to also fit the second read of the then full 16 pixels, without dropping any pixels along the way.

And that's how we get arbitrarily unaligned rectangle copies with the EGC! Except for a small register allocation trick to use two-register addressing, there's not much use in further optimizations, as the runtime of these inter-page blit operations is dominated by the VRAM page switches anyway.

Except that T98-Next seems to disagree about the register prefilling issue:

Glitched blitting results on T98-Next when trying EGC copies where the source bit address is larger than the destination bit address

Every other emulator agrees with real hardware in this regard, so we can safely assume this to be a bug in T98-Next. Just in case this old emulator with its last release from June 2010 still has any fans left nowadays… For now though, even they can still enjoy the TH01 Anniversary Edition: The only EGC copy algorithm that TH01 actually needs is the left one during the single-buffered tests, which even that emulator gets right.
That only leaves 📝 my old offer of documenting the EGC raster ops, and we've got the EGC figured out completely!


And that did in fact remove tearing from the pellet rendering function! For the first time, we can now fight Elis, Kikuri, Sariel, and Konngara with a doubled pellet frame rate:

Switchable videos like these can nicely provide evidence that these changes have no effect on gameplay, making it easy to see that the Orb still collides with all pellets on the same frames. Also, check out the difference in remaining conventional memory (coreleft)…

With only pellets and no other animation on screen, this exact pattern presents the optimal demonstration case for the new unblitter. But as you can already tell from the invincibility sprites, we'd also need to route every other kind of sprite through the same new code. This isn't all too trivial: Most sprites are still rendered at byte-aligned positions, and their blitting APIs hide that fact by taking a pixel position regardless. This is why we can't just replace ZUN's original 16-pixel-aligned EGC unblitting function with ours, and always have to replace both the blitter and the unblitter on a per-sprite basis.
To completely remove all flickering, we'd also like to get rid of all the sprite-specific unblit ➜ update ➜ render sequences, and instead gather all unblitting code to the beginning of the game loop, before any update and rendering calls. So yeah, it will take a long time to completely get rid of all flickering. Until we're there, I recommend any backer to tell me their favorite boss, so that I can focus on getting that one rendered without any flickering. Remember that here at ReC98, we can have a Touhou character popularity contest at any time during the year, whenever the store is open! :tannedcirno:

In the meantime, the consistent use of 8×8 rectangles during pellet unblitting does significantly reduce flickering across the entire game, and shrinks certain holes that pellets tend to rip into lazily reblitted sprites:

TH01 SinGyoku's crossing pellet pattern in the Anniversary Edition, demonstrating smaller unblitting artifactsThe same frame in the original game, featuring much more giant holes ripped into the sphere sprite
SinGyoku's "crossing pellets" pattern, shortly before completing the transformation back to the sphere.

To round out the first release, I added all the other bug fixes to achieve parity with my previously released patched REIIDEN.EXE builds:

So here it is, the first build of TH01's Anniversary Edition: 2023-03-05-th01-anniv.zip Edit (2023-03-12): If you're playing on Neko Project and seeing more flickering than in the original game, make sure you've checked the Screen → Disp vsync option.

Next up: The long overdue extended trip through the depths of TH02's low-level code. From what I've seen of it so far, the work on this project is finally going to become a bit more relaxing. Which is quite welcome after, what, 6 months of stressful research-heavy work?

📝 Posted:
🚚 Summary of:
P0227, P0228
Commits:
4f85326...bfd24c6, bfd24c6...739e1d8
💰 Funded by:
nrook, [Anonymous]
🏷 Tags:

Starting the year with a delivery that wasn't delayed until the last day of the month for once, nice! Still, very soon and high-maintenance did not go well together…

It definitely wasn't Sara's fault though. As you would expect from a Stage 1 Boss, her code was no challenge at all. Most of the TH02, TH04, and TH05 bosses follow the same overall structure, so let's introduce a new table to replace most of the boilerplate overview text:

Phase # Patterns HP boundary Timeout condition
Sprite of Sara in TH05 (Entrance) 4,650 288 frames
2 4 2,550 2,568 frames (= 32 patterns)
3 4 450 5,296 frames (= 24 patterns)
4 1 0 1,300 frames
Total 9 9,452 frames

And that's all the gameplay-relevant detail that ZUN put into Sara's code. It doesn't even make sense to describe the remaining patterns in depth, as their groups can significantly change between difficulties and rank values. The 📝 general code structure of TH05 bosses won't ever make for good-code, but Sara's code is just a lesser example of what I already documented for Shinki.
So, no bugs, no unused content, only inconsequential bloat to be found here, and less than 1 push to get it done… That makes 9 PC-98 Touhou bosses decompiled, with 22 to go, and gets us over the sweet 50% overall finalization mark! 🎉 And sure, it might be possible to pass through the lasers in Sara's final pattern, but the boss script just controls the origin, angle, and activity of lasers, so any quirk there would be part of the laser code… wait, you can do what?!?


TH05 expands TH04's one-off code for Yuuka's Master and Double Sparks into a more featureful laser system, and Sara is the first boss to show it off. Thus, it made sense to look at it again in more detail and finalize the code I had purportedly 📝 reverse-engineered over 4 years ago. That very short delivery notice already hinted at a very time-consuming future finalization of this code, and that prediction certainly came true. On the surface, all of the low-level laser ray rendering and collision detection code is undecompilable: It uses the SI and DI registers without Turbo C++'s safety backups on the stack, and its helper functions take their input and output parameters from convenient registers, completely ignoring common calling conventions. And just to raise the confusion even further, the code doesn't just set these registers for the helper function calls and then restores their original values, but permanently shifts them via additions and subtractions. Unfortunately, these convenient registers also include the BP base pointer to the stack frame of a function… and shifting that register throws any intuition behind accessed local variables right out of the window for a good part of the function, requiring a correctly shifted view of the stack frame just to make sense of it again. :godzun: How could such code even have been written?! This goes well beyond the already wrong assumption that using more stack space is somehow bad, and straight into the territory of self-inflicted pain.

So while it's not a lot of instructions, it's quite dense and really hard to follow. This code would really benefit from a decompilation that anchors all this madness as much as possible in existing C++ structures… so let's decompile it anyway? :tannedcirno:
Doing so would involve emitting lots of raw machine code bytes to hide the SI and DI registers from the compiler, but I already had a certain 📝 batshit insane compiler bug workaround abstraction lying around that could make such code more readable. Hilariously, it only took this one additional use case for that abstraction to reveal itself as premature and way too complicated. :onricdennat: Expanding the core idea into a full-on x86 instruction generator ended up simplifying the code structure a lot. All we really want there is a way to set all potential parameters to e.g. a specific form of the MOV instruction, which can all be expressed as the parameters to a force-inlined __emit__() function. Type safety can help by providing overloads for different operand widths here, but there really is no need for classes, templates, or explicit specialization of templates based on classes. We only need a couple of enums with opcode, register, and prefix constants from the x86 reference documentation, and a set of associated macros that token-paste pseudoregisters onto the prefixes of these enum constants.
And that's how you get a custom compile-time assembler in a 1994 C++ compiler and expand the limits of decompilability even further. What's even truly left now? Self-modifying code, layout tricks that can't be replicated with regularly structured control flow… and that's it. That leaves quite a few functions I previously considered undecompilable to be revisited once I get to work on making this game more portable.

With that, we've turned the low-level laser code into the expected horrible monstrosity that exposes all the hidden complexity in those few ASM instructions. The high-level part should be no big deal now… except that we're immediately bombarded with Fixup overflow errors at link time? Oh well, time to finally learn the true way of fixing this highly annoying issue in a second new piece of decompilation tech – and one that might actually be useful for other x86 Real Mode retro developers at that.
Earlier in the RE history of TH04 and TH05, I often wrote about the need to split the two original code segments into multiple segments within two groups, which makes it possible to slot in code from different translation units at arbitrary places within the original segment. If we don't want to define a unique segment name for each of these slotted-in translation units, we need a way to set custom segment and group names in C land. Turbo C++ offers two #pragmas for that:

For the most part, these #pragmas work well, but they seemed to not help much when it came to calling near functions declared in different segments within the same group. It took a bit of trial and error to figure out what was actually going on in that case, but there is a clear logic to it:

Summarized in code:

#pragma option -zCfoo_TEXT -zPfoo

void bar(void);
void near qux(void); // defined somewhere else, maybe in a different segment

#pragma codeseg baz_TEXT baz

// Despite the segment change in the line above, this function will still be
// put into `foo_TEXT`, the active segment during the first appearance of the
// function name.
void bar(void) {
}

// This function hasn't been declared yet, so it will go into `baz_TEXT` as
// expected.
void baz(void) {
	// This `near` function pointer will be calculated by subtracting the
	// flat/linear address of qux() inside the binary from the base address
	// of qux()'s declared segment, i.e., `foo_TEXT`.
	void (near *ptr_to_qux)(void) = qux;
}

So yeah, you might have to put #pragma codeseg into your headers to tell the linker about the correct segment of a near function in advance. 🤯 This is an important insight for everyone using this compiler, and I'm shocked that none of the Borland C++ books documented the interaction of code segment definitions and near references at least at this level of clarity. The TASM manuals did have a few pages on the topic of groups, but that syntax obviously doesn't apply to a C compiler. Fixup overflows in particular are such a common error and really deserved better than the unhelpful 🤷 of an explanation that ended up in the User's Guide. Maybe this whole technique of custom code segment names was considered arcane even by 1993, judging from the mere three sentences that #pragma codeseg was documented with? Still, it must have been common knowledge among Amusement Makers, because they couldn't have built these exact binaries without knowing about these details. This is the true solution to 📝 any issues involving references to near functions, and I'm glad to see that ZUN did not in fact lie to the compiler. 👍


OK, but now the remaining laser code compiles, and we get to write C++ code to draw some hitboxes during the two collision-detected states of each laser. These confirm what the low-level code from earlier already uncovered: Collision detection against lasers is done by testing a 12×12-pixel box at every 16 pixels along the length of a laser, which leaves obvious 4-pixel gaps at regular intervals that the player can just pass through. :zunpet: This adds 📝 yet 📝 another 📝 quirk to the growing list of quirks that were either intentional or must have been deliberately left in the game after their initial discovery. This is what constants were invented for, and there really is no excuse for not using them – especially during intoxicated coding, and/or if you don't have a compile-time abstraction for Q12.4 literals.

When detecting laser collisions, the game checks the player's single center coordinate against any of the aforementioned 12×12-pixel boxes. Therefore, it's correct to split these 12×12 pixels into two 6×6-pixel boxes and assign the other half to the player for a more natural visualization. Always remember that hitbox visualizations need to keep all colliding entities in mind – 📝 assigning a constant-sized hitbox to "the player" and "the bullets" will be wrong in most other cases.

Using subpixel coordinates in collision detection also introduces a slight inaccuracy into any hitbox visualization recorded in-engine on a 16-color PC-98. Since we have to render discrete pixels, we cannot exactly place a Q12.4 coordinate in the 93.75% of cases where the fractional part is non-zero. This is why pretty much every laser segment hitbox in the video above shows up as 7×7 rather than 6×6: The actual W×H area of each box is 13 pixels smaller, but since the hitbox lies between these pixels, we cannot indicate where it lies exactly, and have to err on the side of caution. It's also why Reimu's box slightly changes size as she moves: Her non-diagonal movement speed is 3.5 pixels per frame, and the constant focused movement in the video above halves that to 1.75 pixels, making her end up on an exact pixel every 4 frames. Looking forward to the glorious future of displays that will allow us to scale up the playfield to 16× its original pixel size, thus rendering the game at its exact internal resolution of 6144×5888 pixels. Such a port would definitely add a lot of value to the game…

The remaining high-level laser code is rather unremarkable for the most part, but raises one final interesting question: With no explicitly defined limit, how wide can a laser be? Looking at the laser structure's 1-byte width field and the unsigned comparisons all throughout the update and rendering code, the answer seems to be an obvious 255 pixels. However, the laser system also contains an automated shrinking state, which can be most notably seen in Mai's wheel pattern. This state shrinks a laser by 2 pixels every 2 frames until it reached a width of 0. This presents a problem with odd widths, which would fall below 0 and overflow back to 255 due to the unsigned nature of this variable. So rather than, I don't know, treating width values of 0 as invalid and stopping at a width of 1, or even adding a condition for that specific case, the code just performs a signed comparison, effectively limiting the width of a shrinkable laser to a maximum of 127 pixels. :zunpet: This small signedness inconsistency now forces the distinction between shrinkable and non-shrinkable lasers onto every single piece of code that uses lasers. Yet another instance where 📝 aiming for a cinematic 30 FPS look made the resulting code much more complicated than if ZUN had just evenly spread out the subtraction across 2 frames. 🤷
Oh well, it's not as if any of the fixed lasers in the original scripts came close to any of these limits. Moving lasers are much more streamlined and limited to begin with: Since they're hardcoded to 6 pixels, the game can safely assume that they're always thinner than the 28 pixels they get gradually widened to during their decay animation.

Finally, in case you were missing a mention of hitboxes in the previous paragraph: Yes, the game always uses the aforementioned 12×12 boxes, regardless of a laser's width.

This video also showcases the 127-pixel limit because I wanted to include the shrink animation for a seamless loop.

That was what, 50% of this blog post just being about complications that made laser difficult for no reason? Next up: The first TH01 Anniversary Edition build, where I finally get to reap the rewards of having a 100% decompiled game and write some good code for once.

📝 Posted:
🚚 Summary of:
P0226
Commits:
(Seihou) M0002...P0226
💰 Funded by:
Arandui, alp-bib
🏷 Tags:
> "OK, TH03/TH04/TH05 cutscenes done, let's quickly finish the Touhou Patch Center MediaWiki upgrade. Just some scripting and verification left, it will be done so quickly that I don't even have to mention it on this blog" > Still not done after 3 weeks > Blocked by one final critical bug that really should be fixed upstream > Code reviewers are probably on vacation

And so, the year unfortunately ended with yet another slow month. During the MediaWiki upgrade, I was slowly decompiling the TH05 Sara fight on the side, but stumbled over one interesting but high-maintenance detail there that would really enhance her blog post. TH02 would need a lot of attention for the basic rendering calls as well…

…so let's end the year with Shuusou Gyoku instead, looking at its most critical issue in particular. As if that were the easy option here… :tannedcirno:
The game does not run properly on modern Windows systems due to its usage of the ancient DirectDraw APIs, with issues ranging from unbearable slowdown to glitched colors to the game not even starting at all. Thankfully, Shuusou Gyoku is not the only ancient Windows game affected by these issues, and people have developed a variety of generic DirectDraw wrappers and patches for playing such games on modern systems. Out of all these, DDrawCompat is one of the simpler solutions for Shuusou Gyoku in particular: Just drop its ddraw proxy DLL into the game directory, and the game will run as it's supposed to.
So let's just bundle that DLL with all my future Shuusou Gyoku releases then? That would have been the quick and dirty option, coming with several drawbacks:

Fortunately, I had the budget to dig a bit deeper and figure out what exactly DDrawCompat does to make Shuusou Gyoku work properly. Turns out that among all the hooks and patches, the game only needs the most central one: Enforcing a 32-bit display mode regardless of whatever lower bit depth the game requests natively, combined with converting the game's pixel buffer to 32-bit on the fly.
So does this mean that adding 32-bit to the game's list of supported bit depths is everything we have to do?

The new 32-bit rendering option in the Shuusou Gyoku P0226 build.
Interestingly, Shuusou Gyoku already saved the DirectDraw enumeration flag that indicates support for 32-bit display modes. The official version just did nothing with it.

Well, almost everything. Initially, this surprised me as well: With all the if statements checking for precise bit depths, you would think that supporting one more bit depth would be way harder in this code base. As it turned out though, these conditional branches are not really about 8-bit or 16-bit color for the most part, but instead differentiate between two very distinct rendering approaches:

Consequently, most of these branches deal with differences between these two approaches that couldn't be nicely abstracted away in pbg's renderer interface: Specific palette changes that are exclusive to "8-bit" mode, or certain entities and effects whose Direct3D draw calls in "16-bit" mode require tailor-made approximations for the "8-bit" mode. Since our new 32-bit mode is equivalent to the 16-bit mode in all of these branches, I only needed to replace the raw number comparisons with more meaningful method calls.

That only left a very small number of 2D raster effects that directly write to or read from DirectDraw surface memory, and therefore do need to know the bit size of each pixel. Thanks to std::variant and std::visit(), adding 32-bit support becomes trivial here: By rewriting the code in a generic manner that derives all offsets from the template type, you only have to say hey, I'd like to have 32-bit as well, and C++ will automatically instantiate correct 32-bit variants of all bit depth-dependent code snippets.
There are only three features in the entire game that access pixel buffers this way: a color key retrieval function, the lens ball animation on the logo screen, and… the ending staff roll? Sure, the text sprites fade in and out, but so does the picture next to it, using Direct3D alpha blending or palette color ramping depending on the current rendering mode. Instead, the only reason why these sprites directly access their pixel buffer is… an unused and pretty wild spiral effect. 😮 It's still part of the code, and only doesn't show up because the parameters that control its timing were commented out before release:

They probably considered it too wild for the mood of this ending.
The main ending text was the only remaining issue of mojibake present in my previous Shuusou Gyoku builds, and is now fixed as well. Windows can render Shift-JIS text via GDI even outside Japanese locale, but only when explicitly selecting a font that supports the SHIFTJIS_CHARSET, and the game simply didn't select any font for rendering this text. Thus, GDI fell back onto its default font, which obviously is only guaranteed to support the SHIFTJIS_CHARSET if your system locale is set to Japanese. This is why the font in the original game might look different between systems. For my build, I chose the font that would appear on a clean Windows installation – a basic 400-weighted MS Gothic at font size 16, which is already used all throughout the game.

Alright, 32-bit mode complete, let's set it as the default if possible… and break compatibility to the original 秋霜CFG.DAT format in the process? When validating this file, the original game only allows the originally supported 8-bit or 16-bit modes. Setting the BitDepth field to any other value causes the entire file to be reset to its defaults, re-locking the Extra Stage in the process. :onricdennat:
Introducing a backward-compatible version system for 秋霜CFG.DAT was beyond the scope of this push. Changing the validation to a per-field approach was a good small first step to take though. The new build no longer validates the BitDepth field against a fixed list, but against the actually supported bit depths on your system, picking a different supported one if necessary. With the original approach, this would have caused your entire configuration to fail the validation check. Instead, you can now safely update to the new build without losing your option settings, or your previously unlocked access to the Extra Stage.
Side note: The validation limit for starting bombs is off by one, and the one for starting lives check is off by two. By modifying 秋霜CFG.DAT, you could theoretically get new games to start with 7 lives and 3 bombs… if you then calculate a correct checksum for your hacked config file, that is. 🧑‍💻

Interestingly, DirectDraw doesn't even indicate support for 8-bit or 16-bit color on systems that are affected by the initially mentioned issues. Therefore, these issues are not the fault of DirectDraw, but of Shuusou Gyoku, as the original release requested a bit depth that it has even verified to be unsupported. Unfortunately, Windows sides with Sim City Shuusou Gyoku here: If you previously experimented with the Windows app compatibility settings, you might have ended up with the DWM8And16BitMitigation flag assigned to the full file path of your Shuusou Gyoku executable in either

As the term mitigation suggests, these modes are (poorly) emulated, which is exactly what causes the issues with this game in the first place. Sure, this might be the lesser evil from the point of view of an operating system: If you don't have the budget for a full-blown DDrawCompat-style DirectDraw wrapper, you might consider it better for users to have the game run poorly than have it fail at startup due to incorrect API usage. Controlling this with a flag that sticks around for future runs of a binary is definitely suboptimal though, especially given how hard it is to programmatically remove this flag within the binary itself. It only adds additional complexity to the ideal clean upgrade path.
So, make sure to check your registry and manually remove these flags for the time being. Without them, the new Config → Graphic menu will correctly prevent you from selecting anything else but 32-bit on modern Windows.


After all that, there was just enough time left in this push to implement basic locale independence, as requested by the Seihou development Discord group, without looking into automatic fixes for previous mojibake filenames yet. Combining std::filesystem::path with the native Win32 API should be straightforward and bloat-free, especially with all the abstractions I've been building, right?
Well, turns out that std::filesystem::path does not actually meet my expectations. At least as long as it's not constexpr-enabled, because you still get the unfortunate conversion from narrow to wide encoding at runtime, even for globals with static storage duration. That brings us back to writing our path abstraction in terms of the regular std::string and std::wstring containers, which at least allow us to enforce the respective encoding at compile time. Even std::string_view only adds to the complexity here, as its strings are never inherently null-terminated, which is required by both the POSIX and Win32 APIs. Not to mention dynamic filenames: C++20's std::format() would be the obvious idiomatic choice here, but using it almost doubles the size of the compiled binary… 🤮
In the end, the most bloat-free way of implementing C++ file I/O in 2023 is still the same as it was 30 years ago: Call system APIs, roll a custom abstraction that conditionally uses the L prefix, and pass around raw pointers. And if you need a dynamic filename, just write the dynamic characters into arrays at fixed positions. Just as PC-98 Touhou used to do… :zunpet:
Oh, and the game's window also uses a Unicode title bar now.

And that's it for this push! Make sure to rename your configuration (秋霜CFG.DAT), score (秋霜SC.DAT), and replay (秋霜りぷ*.DAT) filenames if you were previously running the game on a non-Japanese locale, and then grab the new build:

:sh01: Shuusou Gyoku P0226

With that, we've got the most critical bugs out of the way, but the number of potential fixes and features in Shuusou Gyoku has only increased. Looking forward to what's next in this apparent Seihou revolution, later in 2023!

Next up: Starting the new year with all my plans hopefully working out for once. TH05 Sara very soon, ZMBV code review afterward, low-hanging fruit of the TH01 Anniversary Edition after that, and then kicking off TH02 with a bunch of low-level blitting code.

📝 Posted:
🚚 Summary of:
P0223, P0224, P0225
Commits:
139746c...371292d, 371292d...8118e61, 8118e61...4f85326
💰 Funded by:
rosenrose, Blue Bolt, Splashman, -Tom-, Yanga, Enderwolf, 32th System
🏷 Tags:

More than three months without any reverse-engineering progress! It's been way too long. Coincidentally, we're at least back with a surprising 1.25% of overall RE, achieved within just 3 pushes. The ending script system is not only more or less the same in TH04 and TH05, but actually originated in TH03, where it's also used for the cutscenes before stages 8 and 9. This means that it was one of the final pieces of code shared between three of the four remaining games, which I got to decompile at roughly 3× the usual speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The Music Room is largely equivalent in all three remaining games as well, and the sound device selection, ZUN Soft logo screens, and main/option menus are the same in TH04 and TH05. A lot of that code is in the "technically RE'd but not yet decompiled" ASM form though, so it would shift Finalized% more significantly than RE%. Therefore, make sure to order the new Finalization option rather than Reverse-engineering if you want to make number go up.

  1. General overview
  2. Game-specific differences
  3. Command reference
  4. Thoughts about translation support

So, cutscenes. On the surface, the .TXT files look simple enough: You directly write the text that should appear on the screen into the file without any special markup, and add commands to define visuals, music, and other effects at any place within the script. Let's start with the basics of how text is rendered, which are the same in all three games:


Superficially, the list of game-specific differences doesn't look too long, and can be summarized in a rather short table:

:th03: TH03 :th04: TH04 :th05: TH05
Script size limit 65536 bytes (heap-allocated) 8192 bytes (statically allocated)
Delay between every 2 bytes of text 1 frame by default, customizable via \v None
Text delay when holding ESC Varying speed-up factor None
Visibility of new text Immediately typed onto the screen Rendered onto invisible VRAM page, faded in on wait commands
Visibility of old text Unblitted when starting a new box Left on screen until crossfaded out with new text
Key binding for advancing the script Any key ⏎ Return, Shot, or ESC
Animation while waiting for an advance key None ⏎⃣, past right edge of current row
Inexplicable delays None 1 frame before changing pictures and after rendering new text boxes
Additional delay per interpreter loop 614.4 µs None 614.4 µs
The 614.4 µs correspond to the necessary delay for working around the repeated key up and key down events sent by PC-98 keyboards when holding down a key. While the absence of this delay significantly speeds up TH04's interpreter, it's also the reason why that game will stop recognizing a held ESC key after a few seconds, requiring you to press it again.

It's when you get into the implementation that the combined three systems reveal themselves as a giant mess, with more like 56 differences between the games. :zunpet: Every single new weird line of code opened up another can of worms, which ultimately made all of this end up with 24 pieces of bloat and 14 bugs. The worst of these should be quite interesting for the general PC-98 homebrew developers among my audience:


That brings us to the individual script commands… and yes, I'm going to document every single one of them. Some of their interactions and edge cases are not clear at all from just looking at the code.

Almost all commands are preceded by… well, a 0x5C lead byte. :thonk: Which raises the question of whether we should document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded ¥ yen sign. From a gaijin perspective, it seems obvious that it's a backslash, as it's consistently displayed as one in most of the editors you would actually use nowadays. But interestingly, iconv -f shift-jis -t utf-8 does convert any 0x5C lead bytes to actual ¥ U+00A5 YEN SIGN code points :tannedcirno:.
Ultimately, the distinction comes down to the font. There are fonts that still render 0x5C as ¥, but mainly do so out of an obvious concern about backward compatibility to JIS X 0201, where this mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho, the old Japanese fonts from Windows 3.1, but even Meiryo and Yu Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much every other modern font, and freely licensed ones in particular, render this code point as \, even if you set your editor to Shift-JIS. And while ZUN most definitely saw it as a ¥, documenting this code point as \ is less ambiguous in the long run. It can only possibly correspond to one specific code point in either Shift-JIS or UTF-8, and will remain correct even if we later mod the cutscene system to support full-blown Unicode.

Now we've only got to clarify the parameter syntax, and then we can look at the big table of commands:

:th03: :th04: :th05: \@ Clears both VRAM pages by filling them with VRAM color 0.
🐞 In TH03 and TH04, this command does not update the internal text area background used for unblitting. This bug effectively restricts usage of this command to either the beginning of a script (before the first background image is shown) or its end (after no more new text boxes are started). See the image below for an example of using it anywhere else.
:th03: :th04: :th05: \b2 Sets the font weight to a value between 0 (raw font ROM glyphs) to 3 (very thicc). Specifying any other value has no effect.
:th04: :th05: 🐞 In TH04 and TH05, \b3 leads to glitched pixels when rendering half-width glyphs due to a bug in the newly micro-optimized ASM version of 📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the graph_putsa_fx() effect function, removing the sanity check that was present in TH03. In exchange, you can also access the four dissolve masks for the bold font (\b2) by specifying a parameter between 4 (fewest pixels) to 7 (most pixels). Demo video below.
:th03: :th04: :th05: \c15 Changes the text color to VRAM color 15.
:th05: \c=,15 Adds a color map entry: If is the first code point inside the name area on a new line, the text color is automatically set to 15. Up to 8 such entries can be registered before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
:th03: :th04: :th05: \e0 Plays the sound effect with the given ID.
:th03: :th04: :th05: \f (no-op)
:th03: :th04: :th05: \fi1
\fo1
Calls master.lib's palette_black_in() or palette_black_out() to play a hardware palette fade animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
:th03: :th04: :th05: \fm1 Fades out BGM volume via PMD's AH=02h interrupt call, in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to AH=02h's fade-in feature, which can't be used from cutscene scripts because it requires BGM volume to first be lowered via AH=19h, and there is no command to do that.
:th03: :th04: :th05: \g8 Plays a blocking 8-frame screen shake animation.
:th03: :th04: \ga0 Shows the gaiji with the given ID from 0 to 255 at the current cursor position. Even in TH03, gaiji always ignore the text delay interval configured with \v.
:th05: @3 TH05's replacement for the \ga command from TH03 and TH04. The default ID of 3 corresponds to the ♫ gaiji. Not to be confused with \@, which starts with a backslash, unlike this command.
:th05: @h Shows the 🎔 gaiji.
:th05: @t Shows the 💦 gaiji.
:th05: @! Shows the ! gaiji.
:th05: @? Shows the ? gaiji.
:th05: @!! Shows the ‼ gaiji.
:th05: @!? Shows the ⁉ gaiji.
:th03: :th04: :th05: \k0 Waits 0 frames (0 = forever) for an advance key to be pressed before continuing script execution. Before waiting, TH05 crossfades in any new text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time. As a workaround, \vp1 can be used before \k to immediately display that text without a fade-in animation.
:th03: :th04: :th05: \m$ Stops the currently playing BGM.
:th03: :th04: :th05: \m* Restarts playback of the currently loaded BGM from the beginning.
:th03: :th04: :th05: \m,filename Stops the currently playing BGM, loads a new one from the given file, and starts playback.
:th03: :th04: :th05: \n Starts a new line at the leftmost X coordinate of the box, i.e., the start of the name area. This is how scripts can "change" the name of the currently speaking character, or use the entire 480×64 pixels without being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line. Using this command at the "end" of a line with the maximum number of 30 full-width glyphs would therefore start a second new line and leave the previously started line empty.
If this command moved the cursor into the 5th line of a box, \s is executed afterward, with any of \n's parameters passed to \s.
:th03: :th04: :th05: \p (no-op)
:th03: :th04: :th05: \p- Deallocates the loaded .PI image.
:th03: :th04: :th05: \p,filename Loads the .PI image with the given file into the single .PI slot available to cutscenes. TH04 and TH05 automatically deallocate any previous image, 🐞 TH03 would leak memory without a manual prior call to \p-.
:th03: :th04: :th05: \pp Sets the hardware palette to the one of the loaded .PI image.
:th03: :th04: :th05: \p@ Sets the loaded .PI image as the full-screen 640×400 background image and overwrites both VRAM pages with its pixels, retaining the current hardware palette.
:th03: :th04: :th05: \p= Runs \pp followed by \p@.
:th03: :th04: :th05: \s0
\s-
Ends a text box and starts a new one. Fades in any text rendered to the invisible VRAM page, then waits 0 frames (0 = forever) for an advance key to be pressed. Afterward, the new text box is started with the cursor moved to the top-left corner of the name area.
\s- skips the wait time and starts the new box immediately.
:th03: :th04: :th05: \t100 Sets palette brightness via master.lib's palette_settone() to any value from 0 (fully black) to 200 (fully white). 100 corresponds to the palette's original colors. Preceded by a 1-frame delay unless ESC is held.
:th03: \v1 Sets the number of frames to wait between every 2 bytes of rendered text.
:th04: Sets the number of frames to spend on each of the 4 fade steps when crossfading between old and new text. The game-specific default value is also used before the first use of this command.
:th05: \v2
:th03: :th04: :th05: \vp0 Shows VRAM page 0. Completely useless in TH03 (this game always synchronizes both VRAM pages at a command boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to their intended shown page before every blitting operation anyway. A debloated mod of this game would just remove this command, as it exposes an implementation detail that script authors should not need to worry about. None of the original scripts use it anyway.
:th03: :th04: :th05: \w64
  • \w and \wk wait for the given number of frames
  • \wm and \wmk wait until PMD has played back the current BGM for the total number of measures, including loops, given in the first parameter, and fall back on calling \w and \wk with the second parameter as the frame number if BGM is disabled.
    🐞 Neither PMD nor MMD reset the internal measure when stopping playback. If no BGM is playing and the previous BGM hasn't been played back for at least the given number of measures, this command will deadlock.
Since both TH04 and TH05 fade in any new text from the invisible VRAM page, these commands can be used to simulate TH03's typing effect in those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be interrupted if ⏎ Return or Shot are held down. TH04 and TH05 recognize the k as well, but removed its functionality.
All of these commands have no effect if ESC is held.
\wm64,64
:th03: \wk64
\wmk64,64
:th03: :th04: :th05: \wi1
\wo1
Calls master.lib's palette_white_in() or palette_white_out() to play a hardware palette fade animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
:th03: :th04: :th05: \=4 Immediately displays the given quarter of the loaded .PI image in the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
:th03: :th04: :th05: \==4,1 Crossfades the picture area between its current content and quarter #4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless ESC is held. Any value ≥ 4 is replaced with quarter #0.
:th03: :th04: :th05: \$ Stops script execution. Must be called at the end of each file; otherwise, execution continues into whatever lies after the script buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04 require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter is omitted; \c is therefore equivalent to \c15.
Using the \@ command in the middle of a TH03 or TH04 cutscene script
The \@ bug. Yes, the ¥ is fake. It was easier to GIMP it than to reword the sentences so that the backslashes landed on the second byte of a 2-byte half-width character pair. :onricdennat:
Cutscene font weights in TH03Cutscene font weights in TH05, demonstrating the <code>\b3</code> bug that also affects TH04Cutscene font weights in TH03, rendered at a hypothetical unaligned X positionCutscene font weights in TH05, rendered at a hypothetical unaligned X position
The font weights and effects available through \b, including the glitch with \b3 in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels at the left side of each glyph, which would extend into the previous VRAM byte. Ironically, the TH04/TH05 version is more correct in this regard: For half-width glyphs, it preserves any further pixel columns generated by the weight functions in the high byte of the 16-dot glyph variable. Unlike TH03, which still cuts them off when rendering text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them towards their correct place (4️⃣). It's only at byte-aligned X positions (2️⃣) where they remain at their internally calculated place, and appear on screen as these glitched pixel columns, 15 pixels away from the glyph they belong to. It's easy to blame bugs like these on micro-optimized ASM code, but in this instance, you really can't argue against it if the original C++ version was equally incorrect.
Combining \b and s- into a partial dissolve animation. The speed can be controlled with \v.
Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we also get an additional fade animation after the box ends.

So yeah, that's the cutscene system. I'm dreading the moment I will have to deal with the other command interpreter in these games, i.e., the stage enemy system. Luckily, that one is completely disconnected from any other system, so I won't have to deal with it until we're close to finishing MAIN.EXE… that is, unless someone requests it before. And it won't involve text encodings or unblitting…


The cutscene system got me thinking in greater detail about how I would implement translations, being one of the main dependencies behind them. This goal has been on the order form for a while and could soon be implemented for these cutscenes, with 100% PI being right around the corner for the TH03 and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching for Latin-script languages could be implemented fairly quickly:

  1. Establish basic UTF-8 parsing for less painful manual editing of the source files
  2. Procedurally generate glyphs for the few required additional letters based on existing font ROM glyphs. For example, we'd generate ä by painting two short lines on top of the font ROM's a glyph, or generate ¿ by vertically flipping the question mark. This way, the text retains a consistent look regardless of whether the translated game is run with an NEC or EPSON font ROM, or the hideous abomination that Neko Project II auto-generates if you don't provide either.
  3. (Optional) Change automatic line breaks to work on a per-word basis, rather than per-glyph

That's it – script editing and distribution would be handled by your local translation group. It might seem as if this would also work for Greek and Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not sure if I want to attempt procedurally shrinking these glyphs from 16×16 to 8×16… For any more thorough solution, we'd need to go for a more "Chad" kind of full-blown translation support:

  1. Implement text subdivisions at a sensible granularity while retaining automatic line and box breaks
  2. Compile translatable text into a Japanese→target language dictionary (I'm too old to develop any further translation systems that would overwrite modded source text with translations of the original text)
  3. Implement a custom Unicode font system (glyphs would be taken from GNU Unifont unless translators provide a different 8×16 font for their language)
  4. Combine the text compiler with the font compiler to only store needed glyphs as part of the translation's font file (dealing with a multi-MB font file would be rather ugly in a Real Mode game)
  5. Write a simple install/update/patch stacking tool that supports both .HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to warrant a separate tool – each patch stack would be statically compiled into a single package file in the game's directory)
  6. Add a nice language selection option to the main menu
  7. (Optional) Support proportional fonts

Which sounds more like a separate project to be commissioned from Touhou Patch Center's Open Collective funds, separate from the ReC98 cap. This way, we can make sure that the feature is completely implemented, and I can talk with every interested translator to make sure that their language works.
It's still cheaper overall to do this on PC-98 than to first port the games to a modern system and then translate them. On the other hand, most of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with the difficulty of getting arbitrary Unicode characters to work natively in a PC-98 DOS game at all, and would be either unnecessary or trivial if we had already ported the game. Depending on where the patrons' interests lie, it may not be worth it. So let's see what all of you think about which way we should go, or whether it's worth doing at all. (Edit (2022-12-01): With Splashman's order towards the stage dialogue system, we've pretty much confirmed that it is.) Maybe we want to meet in the middle – using e.g. procedural glyph generation for dynamic translations to keep text rendering consistent with the rest of the PC-98 system, and just not support non-Latin-script languages in the beginning? In any case, I've added both options to the order form.
Edit (2023-07-28): Touhou Patch Center has agreed to fund a basic feature set somewhere between the Virgin and Chad level. Check the 📝 dedicated announcement blog post for more details and ideas, and to find out how you can support this goal!


Surprisingly, there was still a bit of RE work left in the third push after all of this, which I filled with some small rendering boilerplate. Since I also wanted to include TH02's playfield overlay functions, 1/15 of that last push went towards getting a TH02-exclusive function out of the way, which also ended up including that game in this delivery. :tannedcirno:
The other small function pointed out how TH05's Stage 5 midboss pops into the playfield quite suddenly, since its clipping test thinks it's only 32 pixels tall rather than 64:

Good chance that the pop-in might have been intended.
Edit (2023-06-30): Actually, it's a 📝 systematic consequence of ZUN having to work around the lack of clipping in master.lib's sprite functions.
There's even another quirk here: The white flash during its first frame is actually carried over from the previous midboss, which the game still considers as actively getting hit by the player shot that defeated it. It's the regular boilerplate code for rendering a midboss that resets the responsible damage variable, and that code doesn't run during the defeat explosion animation.

Next up: Staying with TH05 and looking at more of the pattern code of its boss fights. Given the remaining TH05 budget, it makes the most sense to continue in in-game order, with Sara and the Stage 2 midboss. If more money comes in towards this goal, I could alternatively go for the Mai & Yuki fight and immediately develop a pretty fix for the cheeto storage glitch. Also, there's a rather intricate pull request for direct ZMBV decoding on the website that I've still got to review…

📝 Posted:
🚚 Summary of:
P0218, P0219, P0220, P0221, P0222
Commits:
(Website) 21f0a4d...8ebf201, (Website) 8ebf201...52375e2, (Website) 52375e2...ba6359b, (Website) ba6359b...94e48e9, (Website) 94e48e9...358e16f
💰 Funded by:
[Anonymous], Yanga, Ember2528
🏷 Tags:

Yes, I'm still alive. This delivery was just plagued by all of the worst luck: Data loss, physical hard drive failure, exploding phone batteries, minor illness… and after taking 4 weeks to recover from all of that, I had to face this beast of a task. 😵

Turns out that neither part of improving video performance and usability on this blog was particularly easy. Decently encoding the videos into all web-supported formats required unexpected trade-offs even for the low-res, low-color material we are working with, and writing custom video player controls added the timing precision resistance of HTML <video> on top of the inherent complexity of frontend web development. Why did this need to be 800 lines of commented JavaScript and 200 lines of commented CSS, and consume almost more than 5 pushes?! Apparently, the latest price increase also seemed to have raised the minimum level of acceptable polish in my work, since that's more than the maximum of 3.67 pushes it should have taken. To fund the rest, I stole some of the reserved JIS trail word rendering research pushes, which means that the next towards anything will go back towards that goal.


The codec situation is especially sad because it seems like so much of a solved problem. ZMBV, the lossless capture codec introduced by DOSBox, is both very well suited for retro game footage and remarkably simple too: DOSBox-X's implementation of both an encoder and decoder comes in at under 650 lines of C++, excluding the Deflate implementation. Heck, the AVI container around the codec is more complicated to write than the compressed video data itself, and AVI is already the easiest choice you have for a widely supported video container format.
Currently, this blog contains 9:02 minutes of video across 86 files, with a total frame count of 24,515. In case this post attracts a general video encoding audience that isn't familiar with what I'm encoding here: The maximum resolution is 640×400, and most of the video uses 16 colors, with some parts occasionally using more. With ZMBV, the lossless source files take up 43.8 MiB, and that's even with AVI's infamously bad overhead. While you can always spend more time on any compression task and precisely tune your algorithm to match your source data even better, 43.8 MiB looks like a more than reasonable amount for this type of content.

Especially compared with what I actually have to ship here, because sadly, ZMBV is not supported by browsers. 😔 Writing a WebAssembly player for ZMBV would have certainly been interesting, but it already took 5 pushes to get to what we have now. So, let's instead shell out to ffmpeg and build a pipeline to convert ZMBV to the ill-suited codecs supported by web browsers, replacing the previously committed VP9 and VP8 files. From that point, we can then look into AV1, the latest and greatest web-supported video codec, to save some additional bandwidth.

But first, we've got to gather all the ZMBV source files. While I was working on the 📝 2022-07-10 blog post, I noticed some weirdly washed-out colors in the converted videos, leading to the shocking realization that my previous, historically grown conversion script didn't actually encode in a lossless way. 😢 By extension, this meant that every video before that post could have had minor discolorations as well.
For the majority of videos, I still had the original ZMBV capture files straight out of DOSBox-X, and reproducing the final videos wasn't too big of a deal. For the few cases where I didn't, I went the extra mile, took the VP9 files, and manually fixed up all the minor color errors based on reference videos from the same gameplay stage. There might be a huge ffmpeg command line with a complicated filter graph to do the job, but for such a small 4-digit number of frames, it is much more straightforward to just dump each frame as an image and perform the color replacement with ImageMagick's -opaque and -fill options. :tannedcirno:


So, time to encode our new definite collection of source files into AV1, and what the hell, how slow is this codec? With ffmpeg's libaom-av1, fully encoding all 86 videos takes almost 9 hours on my mid-range development system, regardless of the quality selected.
But sure, the encoded videos are managed by a cache, and this obviously only needs to be done once. If the results are amazing, they might even justify these glacial encoding speeds. Unfortunately, they don't: In its lossless -crf 0 mode, AV1 performs even worse than VP9, taking up 222 MiB rather than 182 MiB. It might not sound bad now, but as we're later going to find out, we want to have a lot of keyframes in these videos, which will blow up video sizes even further.

So, time to go lossy and maybe take a deep dive into AV1 tuning? Turns out that it only gets worse from there:

Because that's what all this tuning ended up being: a complete waste of time. No matter which tuning options I tried, all they did was cut down encoding time in exchange for slightly larger files on average. If there is a magic tuning option that would suddenly cause AV1 to maybe even beat ZMBV, I haven't found it. Heck, at particularly low settings, -enable-intrabc even caused blocky glitches with certain pellet patterns that looked like the internal frame block hashes were colliding all over the place. Unfortunately, I didn't save the video where it happened.

So yeah, if you've already invested the computation time and encoded your content by just specifying a -crf value and keeping the remaining settings at their time-consuming defaults, any further tuning will make no difference. Which is… an interesting choice from a usability perspective. :thonk: I would have expected the exact opposite: default to a reasonably fast and efficient profile, and leave the vast selection of tuning options for those people to explore who do want to wait 5× as long for their encoder for that additional 5% of compression efficiency. On the other hand, that surely is one way to get people to extensively study your glorious engineering efforts, I guess? You know what would maybe even motivate people to intrinsically do that? Good documentation, with examples of the intent behind every option and its optimal use case. Nobody needs long help strings that just spell out all of the abbreviations that occur in the name of the option…
But hey, that at least means there's no reason to not use anything but ZMBV for storing and archiving the lossless source files. Best compression efficiency, encodes in real-time, and the files are much easier to edit.

OK, end of rant. To understand why anyone could be hyped about AV1 to begin with, we just have to compare it to VP9, not to ZMBV. In that light, AV1 is pretty impressive even at -crf 1, compressing all 86 videos to 68.9 MiB, and even preserving 22.3% of frames completely losslessly. The remaining frames exhibit the exact kind of quality loss you'd want for retro game footage: Minor discoloration in individual pixels, so minuscule that subtracting the encoded image from the source yields an almost completely black image. Even after highlighting the errors by normalizing such a difference image, they are barely visible even if you know where to look. If "compressed PNG size of the normalized difference between ZMBV and AV1 -crf 1" is a useful metric, this would be its median frame among the 77.7% of non-lossless frames:

The lossless source imageThe same image encoded in AV1The normalized difference between both images
That's frame 455 (0-based) of 📝 YuugenMagan's reconstructed Phase 5 pattern on Easy mode. The AV1 version does in fact expand the original image's 16 distinct colors to 38.

For comparison, here's the 13th worst one. The codec only resorts to color bleeding with particularly heavy effects, exactly where it doesn't matter:

The lossless source imageThe same image encoded in AV1The normalized difference between both images
Frame 25 (0-based) of the 📝 TH05 Reimu bomb animation quirk video. 139 colors in the AV1 version.

Whether you can actually spot the difference is pretty much down to the glass between the physical pixels and your eyes. In any case, it's very hard, even if you know where to look. As far as I'm concerned, I can confidently call this "visually lossless", and it's definitely good enough for regular watching and even single-frame stepping on this blog.
Since the appeal of the original lossless files is undeniable though, I also made those more easily available. You can directly download the one for the currently active video with the button in the new video player – or directly get all of them from the Git repository if you don't like clicking.


Unfortunately, even that only made up for half of the complexity in this pipeline. As impressive as the AV1 -crf 1 result may be, it does in fact come with the drawback of also being impressively heavy to decode within today's browsers. Seeking is dog slow, with even the latencies for single-frame stepping being way beyond what I'd consider tolerable. To compensate, we have to invest another 78 MiB into turning every 10th frame into a keyframe until single-stepping through an entire video becomes as fast as it could be on my system.
But fine, 146 MiB, that's still less than the 178 MiB that the old committed VP9 files used to take up. However, we still want to support VP9 for older browsers, older hardware, and people who use Safari. And it's this codec where keyframes are so bad that there is no clear best solution, only compromises. The main issue: The lower you turn VP9's -crf value, the slower the seeking performance with the same number of keyframes. Conversely, this means that raising quality also requires more keyframes for the same seeking performance – and at these file sizes, you really don't want to raise either. We're talking 1.2 GiB for all 86 videos at -crf 10 and -g 5, and even on that configuration, seeking takes 1.3× as long as it would in the optimal case.

Thankfully, a full VP9 encode of all 86 videos only takes some 30 minutes as opposed to 9 hours. At that speed, it made sense to try a larger number of encoding settings during the ongoing development of the player. Here's a table with all the trials I've kept:

Codec -crf -g Other parameters Total size Seek time
VP9 32 20 -vf format=yuv420p 111 MiB 32 s
VP8 10 30 -qmin 10 -qmax 10 -b:v 1G 120 MiB 32 s
VP8 7 30 -qmin 7 -qmax 7 -b:v 1G 140 MiB 32 s
AV1 1 10 146 MiB 32 s
VP8 10 20 -qmin 10 -qmax 10 -b:v 1G 147 MiB 32 s
VP8 6 30 -qmin 6 -qmax 6 -b:v 1G 149 MiB 32 s
VP8 15 10 -qmin 15 -qmax 15 -b:v 1G 177 MiB 32 s
VP8 10 10 -qmin 10 -qmax 10 -b:v 1G 225 MiB 32 s
VP9 32 10 -vf format=yuv422p 329 MiB 32 s
VP8 0-4 10 -qmin 0 -qmax 4 -b:v 1G 376 MiB 32 s
VP8 5 30 -qmin 5 -qmax 5 -b:v 1G 169 MiB 33 s
VP9 63 40 47 MiB 34 s
VP9 32 20 -vf format=yuv422p 146 MiB 34 s
VP8 4 30 -qmin 0 -qmax 4 -b:v 1G 192 MiB 34 s
VP8 4 40 -qmin 4 -qmax 4 -b:v 1G 168 MiB 35 s
VP9 25 20 -vf format=yuv422p 173 MiB 36 s
VP9 15 15 -vf format=yuv422p 252 MiB 36 s
VP9 32 25 -vf format=yuv422p 118 MiB 37 s
VP9 20 20 -vf format=yuv422p 190 MiB 37 s
VP9 19 21 -vf format=yuv422p 187 MiB 38 s
VP9 32 10 553 MiB 38 s
VP9 32 10 -tune-content screen 553 MiB
VP9 32 10 -tile-columns 6 -tile-rows 2 553 MiB
VP9 15 20 -vf format=yuv422p 207 MiB 39 s
VP9 10 5 1210 MiB 43 s
VP9 32 20 264 MiB 45 s
VP9 32 20 -vf format=yuv444p 215 MiB 46 s
VP9 32 20 -vf format=gbrp10le 272 MiB 49 s
VP9 63 24 MiB 67 s
VP8 0-4 -qmin 0 -qmax 4 -b:v 1G 119 MiB 76 s
VP9 32 107 MiB 170 s
The bold rows correspond to the final encoding choices that are live right now. The seeking time was measured by holding → Right on the 📝 cheeto dodge strategy video.

Yup, the compromise ended up including a chroma subsampling conversion to YUV422P. That's the one thing you don't want to do for retro pixel graphics, as it's the exact cause behind washed-out colors and red fringing around edges:

The lossless source imageThe same image encoded in VP9, exhibiting a severe case of chroma subsamplingThe normalized difference between both images
The worst example of chroma subsampling in a VP9-encoded file according to the above metric, from frame 130 (0-based) of 📝 Sariel's restored leaf "spark" animation, featuring smeared-out contours and even an all-around darker image, blowing up the image to a whopping 3653 colors. It's certainly an aesthetic.

But there simply was no satisfying solution around the ~200 MiB mark with RGB colors, and even this compromise is still a disappointment in both size and seeking speed. Let's hope that Safari users do get AV1 support soon… Heck, even VP8, with its exclusive support for YUV420P, performs much better here, with the impact of -crf on seeking speed being much less pronounced. Encoding VP8 also just takes 3 minutes for all 86 videos, so I could have experimented much more. Too bad that it only matters for really ancient systems… :onricdennat:
Two final takeaways about VP9:


Alright, now we're done with codecs and get to finish the work on the pipeline with perhaps its biggest advantage. With a ffmpeg conversion infrastructure in place, we can also easily output a video's first frame as a poster image to be passed into the <video> tag. If this image is kept at the exact resolution of the video, the browser doesn't need to wait for an indeterminate amount of "video metadata" to be loaded, and can reserve the necessary space in the page layout much faster and without any of these dreaded loading spinners. For the big /blog page, this cuts down the minimum amount of required resources from 69.5 MB to 3.6 MB, finally making it usable again without waiting an eternity for the page to fully load. It's become pretty bad, so I really had to prioritize this task before adding any more blog posts on top.

That leaves the player itself, which is basically a sum of lots of little implementation challenges. Single-frame stepping and seeking to discrete frames is the biggest one of them, as it's technically not possible within the <video> tag, which only returns the current time as a continuous value in seconds. It only sort of works for us because the backend can pass the necessary FPS and frame count values to the frontend. These allow us to place a discrete grid of frame "frets" at regular intervals, and thus establish a consistent mapping from frames to seconds and back. The only drawback here is a noticeably weird jump back by one frame when pausing a video within the second half of a frame, caused by snapping the continuous time in seconds back onto the frame grid in order to maintain a consistent frame counter. But the whole feature of frame-based seeking more than makes up for that.
The new scrubbable timeline might be even nicer to use with a mouse or a finger than just letting a video play regularly. With all the tuning work I put into keyframes, seeking is buttery smooth, and much better than the built-in <video> UI of either Chrome or Firefox. Unfortunately, it still costs a whole lot of CPU, but I'd say it's worth it. 🥲

Finally, the new player also has a few features that might not be immediately obvious:

And with that, development hell is over, and I finally get to return to the core business! Just more than one month late. :tannedcirno: Next up: Shipping the oldest still pending order, covering the TH04/TH05 ending script format. Meanwhile, the Seihou community also wants to keep investing in Shuusou Gyoku, so we're also going to see more of that on the side.

📝 Posted:
🚚 Summary of:
P0217
Commits:
(Seihou) pbg...P0217
💰 Funded by:
Arandui
🏷 Tags:

First of all: This blog is now available as a web feed, in three different formats!

Thanks to handlerug for implementing and PR'ing the feature in a very clean way. That makes at least two people I know who wanted to see feed support, so there are probably a few more out there.


So, Shuusou Gyoku. pbg released the original source code for the first two Seihou games back in February 2019, but notably removed the crucial decompression code for the original packfiles due to… various unspecified reasons, considerations, and implications. :thonk: This vague language and subsequent rejection of a pull request to add these features back in were probably the main reasons why no one has publicly done anything with this codebase since.

The only other fork I know about is Priw8's private fork from 2020, but only because WishMakers informed me about it shortly after this push was funded. Both of them might also contribute some features to my fork in the future if their time allows it.
In this fork, Priw8 replaced packfile decompression with raw reads from directories with the pre-extracted contents of all the .DAT files. This works for playing the game, but there are actually two more things that require the original packfile code:

We can surely implement our own simple and uncompressed formats for these things, but it's not the best idea to build all future Shuusou Gyoku features on top of a replay-incompatible fork. So, what do we do? On the one hand, pbg expressed the clear wish to not include data reverse-engineered from the original binary. On the other hand, he released the code under the MIT license, which allows us to modify the code and distribute the results in any way we wish.
So, let's meet in the middle, and go for a clean-room implementation of the missing features as indicated by their usage, without looking at either the original binary or wangqr's reverse-engineered code.


With incremental rebuilds being broken in the latest Visual Studio project files as well, it made sense to start from scratch on pbg's last commit. Of course, I can't pass up a chance to use 📝 Tup, my favorite build system for every project I'm the main developer of. It might not fit Shuusou Gyoku as well as it fits ReC98, but let's see whether it would be reasonable at all…

… and it's actually not too bad! Modern Visual Studio makes this a bit harder than it should be with all the intermediate build artifacts you have to keep track of. In the end though, it's still only 70 lines of Lua to have a nice abstraction for both Debug and Release builds. With this layer underneath, the actual Shuusou Gyoku-specific part can be expressed as succinctly as in any other modern build system, while still making every compiler flag explicit. It might be slightly slower than a traditional .vcxproj build due to launching one cl.exe process per translation unit, but the result is way more reliable and trustworthy compared to anything that involves Visual Studio project files. This simplicity paves the way for expanding the build process to multiple steps, and doing all the static checking on translation strings that I never got to do for thcrap-based patches. Heck, I might even compile all future translations directly into the binary…

Every C++ build system will invariably be hated by someone, so I'd say that your goal should always be to simplify the actually important parts of your build enough to allow everyone else to easily adapt it to their favorite system. This Tupfile definitely does a better job there than your average .vcxproj file – but if you still want such a thing (or, gasp, 🤮 CMake project files 🤮) for better Visual Studio IDE integration, you should have no problem generating them for yourself.
There might still be a point in doing that because that's the one part that unfortunately sucks about this approach. Visual Studio is horribly broken for any nonstandard C++ project even in 2022:

In both cases, IntelliSense doesn't work properly at all even if it appears to be configured correctly, and Tup's dependency tracking appeared to be weirdly cut off for the very final .PDB file. Interestingly though, using the big Visual Studio IDE for just debugging a binary via devenv bin/GIAN07.exe suddenly eliminates all the IntelliSense issues. Looks like there's a lot of essential information stored in the .PDB files that Visual Studio just refuses to read in any other context. :thonk:

But now compare that to Visual Studio Code: Open it from the x64_x86 Cross Tools Command Prompt via code ., launch a build or debug task, or browse the code with perfect IntelliSense. Three small configuration files and everything just works – heck, you even get the Tup progress bar in the terminal. It might be Electron bloatware and horribly slow at times, but Visual Studio Code has long outperformed regular Visual Studio in terms of non-debug functionality.


On to the compression algorithm then… and it's just textbook LZSS, with 13 bits for the offset of a back-reference and 4 bits for its length? Hardly a trade secret there. The hard parts there all come from unexpected inefficiencies in the bitstream format:

  1. Encoding back-references as offsets into an 8 KiB ring buffer dictionary means that the most straightforward implementation actually needs an 8 KiB array for the LZSS sliding window. This could have easily been done with zero additional memory if the offset was encoded as the difference to the current byte instead.
  2. The packfile format stores the uncompressed size of every file in its header, which is a good thing because you want to know in advance how much heap memory to allocate for a specific file. Nevertheless, the original game only stops reading bits from the packfile once it encountered a back-reference with an offset of 0. This means that the compressor not only has to write this technically unneeded back-reference to the end of the compressed bitstream, but also ignore any potential other longest back-reference with an offset of 0 within the file. The latter can easily happen with a ring buffer dictionary.

The original game used a single BIT_DEVICE class with mode flags for every combination of reading and writing memory buffers and on-disk files. Since that would have necessitated a lot of error checking for all (pseudo-)methods of this class, I wrote one dedicated small class for each one of these permutations instead. To further emphasize the clean-room property of this code, these use modern C++ memory ownership features: std::unique_ptr for the fixed-size read-only buffers we get from packfiles, std::vector for the newly compressed buffers where we don't know the size in advance, and std::span for a borrowed reference to an immutable region of memory that we want to treat as a bitstream. Definitely better than using the native Win32 LocalAlloc() and LocalFree() allocator, especially if we want to port the game away from Windows one day.

One feature I didn't use though: C++ fstreams, because those are trash. :tannedcirno: These days, they would seem to be the natural choice with the new std::filesystem::path type from C++17: Correctly constructed, you can pass that type to an fstream constructor and gain both locale independence on Windows and portability to everything else, without writing any Windows-specific UTF-16 code. But even in a Release build, fstreams add ~100 KB of locale-related bloat to the .EXE which adds no value for just reading binary files. That's just too embarrassing if you look at how much space the rest of the game takes up. Writing your own platform layer that calls the Win32 CreateFileW(), ReadFile(), and WriteFile() API functions is apparently still the way to go even in 2022. And with std::filesystem::path still being a welcome addition to C++, it's not too much code to write either.

This gets us file format compatibility with the original release… and a crash as soon as the ending starts, but only in Release mode? As it turns out, this crash is caused by an out-of-bounds array access bug that was present even in the original game, and only turned into a crash now because the optimizer in modern Visual Studio versions reorders static data. As a result, the 6-element pFontInfo array got placed in front of an ECL-related counter variable that then got corrupted by the write to the 7th element, which subsequently crashed the game with a read access to previously deallocated danmaku script data. That just goes to show that these technical bugs are important and worth fixing even if they don't cause issues in the original game. Who knows how many of these will turn into crashes once we get to porting PC-98 Touhou?


So here we go, a new build of Shuusou Gyoku, compiled with Visual Studio 2022, and compatible with all original data formats:

:sh01: Shuusou Gyoku P0217

Inside the regular Shuusou Gyoku installation directory, this binary works as a full-fledged drop-in replacement for the original 秋霜玉.exe. It still has all of the original binary's problems though:

As well as some of its own:

So all in all, it's a strict downgrade at this point in time. :onricdennat: And more of a symbol that we can now start doing actual work on this game. Seihou has been a fun change of pace, and I hope that I get to do more work on the series. There is quite a lot to be done with Shuusou Gyoku alone, and the 21 GitHub issues I've opened are probably only scratching the surface.
However, all the required research for this one consumed more like 1⅔ pushes. Despite just one push being funded, it wouldn't have made sense to release the commits or this binary in any earlier state. To repay this debt, I'm going to put the next for Seihou towards the small code maintenance and performance tasks that I usually do for free, before doing any more feature and bugfix work. Next up: Improving video playback on the blog, and maybe delivering some microtransaction work on the side?

📝 Posted:
🚚 Summary of:
P0216
Commits:
3123c9d...a0ff3f1
💰 Funded by:
JonathKane
🏷 Tags:

On August 15, 1997, at Comiket 52, an unknown doujin developer going by the name of ZUN released his first game, 東方靈異伝 ~ The Highly Responsive to Prayers, marking the start of the Touhou Project game series that keeps running to this day. Today, exactly 25 years later, the C++ source code to version 1.10 of that game has been completely and perfectly reconstructed, reviewed, and documented.

The TH01 title image.

And with that, a warm welcome to all game journalists who have (re-)discovered this project through these news! Here's a summary for everyone who doesn't want to go through 3 years worth of blog posts:

What does this mean?
What does this not mean?

So while this milestone opened the floodgates to PC-98-native mods, I wouldn't advise trying to attempt a port away from PC-98 right now. But then again, I have a financial interest in being a part of the porting process, and who knows, maybe you can just merge in a PC-98 emulator core and get started with something halfway decent in a short amount of time. After all, TH01 is by far the easiest PC-98 Touhou game to port to other systems, as it makes the least use of hardware features. (Edit (2023-03-30): 📝 Turns out that this crown actually goes to TH02. It features the least amount of ZUN-written PC-98-specific rendering code out of all the 5 games, with most of it being decently abstracted via master.lib.)

However, this game in particular raises the question of what exactly one would even want to port. TH01 is a broken flicker-fest that overwhelmingly suffers the drawbacks of PC-98 hardware rather than using it to its advantage. Out of the 78 bugs that I ended up labeling as such, the majority are sprite blitting issues, while you can count the instances of good hardware use on one hand.
And even at the level of game logic, this game features a lot of weird, inconsistent behavior. Less rigorous projects such as uth05win would probably promptly identify these issues as bugs and fix them. On the one hand, this shows that there is a part of the community that wants sane versions of these games which behave as expected. In other parts of the community though, such projects quickly gain the reputation of being too inaccurate to bother about them.

Some terminology might help here. If you look over the ReC98 codebase, you'll find that I classified any weird code into three categories. Edit (2023-03-05): These have been overhauled with a new landmine category for invisible issues. Check CONTRIBUTING.md for the complete and current current definition of all weird code categories.

Some examples:

Since I'm not in the business of writing fanfiction, I won't offer any option that fixes quirks. That's where all of you can come in, and use ReC98 as a base for remasters and remakes. As for bloat and bugs though, there are many ways we could go from here:

Then again, with all these choices in mind, maybe we should just let TH01 be what it is: ZUN's first game, evidence for the truth that no programmer writes good code the first time around, and more of a historical curiosity than anything you'd want to maintain and modernize. The idea of moving on to the next game and decompiling all 5 PC-98 Touhou games in order has certainly shown to be popular among the backers who funded this 100% goal.


Since the beginning of the year, I've been dramatically raising the level of quality and care I've been putting into this project, leading to 9 of the 10 longest blog posts having been written in the past 8 months. The community reception has been even more supportive as well, with all of you still regularly selling out the store in return. To match the level of quality with the community demand, I'm raising push prices from to per push, as of this blog post. 📝 As usual, I'm going to deliver any existing orders in the backlog at the value they were originally purchased at. Due to the way the cap has to be calculated, these contributions now appear to have increased in value by 25%.

However, I do realize that this might make regular pushes prohibitively expensive for some. This could especially prevent all these exciting modding goals from ever getting off the ground. Thinking about it though, the push system is only really necessary for the core reverse-engineering business, where longer, concentrated stretches of work allow me to study a new piece of code in a larger context and improve the quality of the final result. In contrast, modding-related goals could theoretically be segmented into arbitrarily small portions of work, as I have a clear idea of where I want to go and how to get there.
Thus, I'm introducing microtransactions, now available for all modding-related goals. These allow you to order fractional pieces of work for as low as 1 €, which I will immediately deliver without requiring others to fund a full push first. Edit (2022-08-16): And then the store still sold out with a single regular contribution by nrook towards more reverse-engineering. Guess that this experiment will have to wait a little while longer, then… 😅

Next up: Taking a break and recovering from crunch time by improving video playback on this blog and working on Shuusou Gyoku, before returning to Touhou in September.

📝 Posted:
🚚 Summary of:
P0214, P0215
Commits:
158a91e...414770c, 414770c...3123c9d
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

Last blog post before the 100% completion of TH01! The final parts of REIIDEN.EXE would feel rather out of place in a celebratory blog post, after all. They provided quite a neat summary of the typical technical details that are wrong with this game, and that I now get to mention for one final time:

But hey, there's an error message if you start REIIDEN.EXE without a resident MDRV2 or a correctly prepared resident structure! And even a good, user-friendly one, asking the user to launch the batch file instead. For some reason, this convenience went out of fashion in the later games.


The Game Over animation (how fitting) gives us TH01's final piece of weird sprite blitting code, which seriously manages to include 2 bugs and 3 quirks in under 50 lines of code. In test mode (game t or game d), you can trigger this effect by pressing the ⬇️ down arrow key, which certainly explains why I encountered seemingly random Game Over events during all the tests I did with this game…
The animation appears to have changed quite a bit during development, to the point that probably even ZUN himself didn't know what he wanted it to look like in the end:

The original version unblits a 32×32 rectangle around Reimu that only grows on the X axis… for the first 5 frames. The unblitting call is only run if the corresponding sprite wasn't clipped at the edges of the playfield in the frame before, and ZUN uses the animation's frame number rather than the sprite loop variable to index the per-sprite clip flag array. The resulting out-of-bounds access then reads the sprite coordinates instead, which are never 0, thus interpreting all 5 sprites as clipped.
This variant would interpret the declared 5 effect coordinates as distinct sprites and unblit them correctly every frame. The end result is rather wimpy though… hardly appropriate for a Game Over, especially with the original animation in mind.
This variant would not unblit anything, and is probably closest to what the final animation should have been.

Finally, we get to the big main() function, serving as the duct tape that holds this game together. It may read rather disorganized with all the (actually necessary) assignments and function calls, but the only actual minor issue I've seen there is that you're robbed of any pellet destroy bonus collected on the final frame of the final boss. There is a certain charm in directly nesting the infinite main gameplay loop within the infinite per-life loop within the infinite stage loop. But come on, why is there no fourth scene loop? :zunpet: Instead, the game just starts a new REIIDEN.EXE process before and after a boss fight. With all the wildly mutated global state, that was probably a much saner choice.

The final secrets can be found in the debug stage selection. ZUN implemented the prompts using the C standard library's scanf() function, which is the natural choice for quick-and-dirty testing features like this one. However, the C standard library is also complete and utter trash, and so it's not surprising that both of the scanf() calls do… well, probably not what ZUN intended. The guaranteed out-of-bounds memory access in the select_flag route prompt thankfully has no real effect on the game, but it gets really interesting with the 面数 stage prompt.
Back in 2020, I already wrote about 📝 stages 21-24, and how they're loaded from actual data that ZUN shipped with the game. As it now turns out, the code that maps stage IDs to STAGE?.DAT scene numbers contains an explicit branch that maps any (1-based) stage number ≥21 to scene 7. Does this mean that an Extra Stage was indeed planned at some point? That branch seems way too specific to just be meant as a fallback. Maybe Asprey was on to something after all…

However, since ZUN passed the stage ID as a signed integer to scanf(), you can also enter negative numbers. The only place that kind of accidentally checks for them is the aforementioned stage ID → scene mapping, which ensures that (1-based) stages < 5 use the shrine's background image and BGM. With no checks anywhere else, we get a new set of "glitch stages":

TH01's stage -1
Stage -1
TH01's stage -2
Stage -2
TH01's stage -3
Stage -3
TH01's stage -4
Stage -4
TH01's stage -5
Stage -5

The scene loading function takes the entered 0-based stage ID value modulo 5, so these 4 are the only ones that "exist", and lower stage numbers will simply loop around to them. When loading these stages, the function accesses the data in REIIDEN.EXE that lies before the statically allocated 5-element stages-of-scene array, which happens to encompass Borland C++'s locale and exception handling data, as well as a small bit of ZUN's global variables. In particular, the obstacle/card HP on the tile I highlighted in green corresponds to the lowest byte of the 32-bit RNG seed. If it weren't for that and the fact that the obstacles/card HP on the few tiles before are similarly controlled by the x86 segment values of certain initialization function addresses, these glitch stages would be completely deterministic across PC-98 systems, and technically canon… :tannedcirno:
Stage -4 is the only playable one here as it's the only stage to end up below the 📝 heap corruption limit of 102 stage objects. Completing it loads Stage -3, which crashes with a Divide Error just like it does if it's directly selected. Unsurprisingly, this happens because all 50 card bytes at that memory location are 0, so one division (or in this case, modulo operation) by the number of cards is enough to crash the game.
Stage -5 is modulo'd to 0 and thus loads the first regular stage. The only apparent broken element there is the timer, which is handled by a completely different function that still operates with a (0-based) stage ID value of -5. Completing the stage loads Stage -4, which also crashes, but only because its 61 cards naturally cause the 📝 stack overflow in the flip-in animation for any stage with more than 50 cards.

And that's REIIDEN.EXE, the biggest and most bloated PC-98 Touhou executable, fully decompiled! Next up: Finishing this game with the main menu, and hoping I'll actually pull it off within 24 hours. (If I do, we might all have to thank 32th System, who independently decompiled half of the remaining 14 functions…)

📝 Posted:
🚚 Summary of:
P0212, P0213
Commits:
d398a94...363fd54, 363fd54...158a91e
💰 Funded by:
LeyDud, Lmocinemod, GhostRiderCog, Ember2528
🏷 Tags:

Wow, it's been 3 days and I'm already back with an unexpectedly long post about TH01's bonus point screens? 3 days used to take much longer in my previous projects…

Before I talk about graphics for the rest of this post, let's start with the exact calculations for both bonuses. Touhou Wiki already got these right, but it still makes sense to provide them here, in a format that allows you to cross-reference them with the source code more easily. For the card-flipping stage bonus:

Time min((Stage timer * 3), 6553)
Continuous min((Highest card combo * 100), 6553)
Bomb&Player min(((Lives * 200) + (Bombs * 100)), 6553)
STAGE min(((Stage number - 1) * 200), 6553)
BONUS Point Sum of all above values * 10

The boss stage bonus is calculated from the exact same metrics, despite half of them being labeled differently. The only actual differences are in the higher multipliers and in the cap for the stage number bonus. Why remove it if raising it high enough also effectively disables it? :tannedcirno:

Time min((Stage timer * 5), 6553)
Continuous min((Highest card combo * 200), 6553)
MIKOsan min(((Lives * 500) + (Bombs * 200)), 6553)
Clear min((Stage number * 1000), 65530)
TOTLE Sum of all above values * 10

The transition between the gameplay and TOTLE screens is one of the more impressive effects showcased in this game, especially due to how wavy it often tends to look. Aside from the palette interpolation (which is, by the way, the first time ZUN wrote a correct interpolation algorithm between two 4-bit palettes), the core of the effect is quite simple. With the TOTLE image blitted to VRAM page 1:

So it's really more like two interlaced shift effects with opposite directions, starting on different scanlines. No trigonometry involved at all.

Horizontally scrolling pixels on a single VRAM page remains one of the few 📝 appropriate uses of the EGC in a fullscreen 640×400 PC-98 game, regardless of the copied block size. The few inter-page copies in this effect are also reasonable: With 8 new lines starting on each effect frame, up to (8 × 20) = 160 lines are transferred at any given time, resulting in a maximum of (160 × 2 × 2) = 640 VRAM page switches per frame for the newly transferred pixels. Not that frame rate matters in this situation to begin with though, as the game is doing nothing else while playing this effect.
What does sort of matter: Why 32 pixels every 2 frames, instead of 16 pixels on every frame? There's no performance difference between doing one half of the work in one frame, or two halves of the work in two frames. It's not like the overhead of another loop has a serious impact here, especially with the PC-98 VRAM being said to have rather high latencies. 32 pixels over 2 frames is also harder to code, so ZUN must have done it on purpose. Guess he really wanted to go for that 📽 cinematic 30 FPS look 📽 here… :zunpet:

Removing the palette interpolation and transitioning from a black screen to CLEAR3.GRP makes it a lot clearer how the effect works.

Once all the metrics have been calculated, ZUN animates each value with a rather fancy left-to-right typing effect. As 16×16 images that use a single bright-red color, these numbers would be perfect candidates for gaiji… except that ZUN wanted to render them at the more natural Y positions of the labels inside CLEAR3.GRP that are far from aligned to the 8×16 text RAM grid. Not having been in the mood for hardcoding another set of monochrome sprites as C arrays that day, ZUN made the still reasonable choice of storing the image data for these numbers in the single-color .GRC form– yeah, no, of course he once again chose the .PTN hammer, and its 📝 16×16 "quarter" wrapper functions around nominal 32×32 sprites.

.PTN sprite for the TOTLE metric digits of 0, 1, 2, and 3.PTN sprite for the TOTLE metric digits of 4, 5, 6, and 7 .PTN sprite for the TOTLE metric digits of 8 and 9, filled with two blank quarters
The three 32×32 TOTLE metric digit sprites inside NUMB.PTN.

Why do I bring up such a detail? What's actually going on there is that ZUN loops through and blits each digit from 0 to 9, and then continues the loop with "digit" numbers from 10 to 19, stopping before the number whose ones digit equals the one that should stay on screen. No problem with that in theory, and the .PTN sprite selection is correct… but the .PTN quarter selection isn't, as ZUN wrote (digit % 4) instead of the correct ((digit % 10) % 4). :onricdennat: Since .PTN quarters are indexed in a row-major way, the 10-19 part of the loop thus ends up blitting 23016745(nothing):

This footage was slowed down to show one sprite blitting operation per frame. The actual game waits a hardcoded 4 milliseconds between each sprite, so even theoretically, you would only see roughly every 4th digit. And yes, we can also observe the empty quarter here, only blitted if one of the digits is a 9.

Seriously though? If the deadline is looming and you've got to rush some part of your game, a standalone screen that doesn't affect anything is the best place to pick. At 4 milliseconds per digit, the animation goes by so fast that this quirk might even add to its perceived fanciness. It's exactly the reason why I've always been rather careful with labeling such quirks as "bugs". And in the end, the code does perform one more blitting call after the loop to make sure that the correct digit remains on screen.


The remaining ¾ of the second push went towards transferring the final data definitions from ASM to C land. Most of the details there paint a rather depressing picture about ZUN's original code layout and the bloat that came with it, but it did end on a real highlight. There was some unused data between ZUN's non-master.lib VSync and text RAM code that I just moved away in September 2015 without taking a closer look at it. Those bytes kind of look like another hardcoded 1bpp image though… wait, what?!

An unused mouse cursor sprite found in all of TH01's binaries

Lovely! With no mouse-related code left in the game otherwise, this cursor sprite provides some great fuel for wild fan theories about TH01's development history:

  1. Could ZUN have 📝 stolen the basic PC-98 VSync or text RAM function code from a source that also implemented mouse support?
  2. Did he have a mouse-controlled level editor during development? It's highly likely that he had something, given all the 📝 bit twiddling seen in the STAGE?.DAT format.
  3. Or was this game actually meant to have mouse-controllable portions at some point during development? Even if it would have just been the menus.

… Actually, you know what, with all shared data moved to C land, I might as well finish FUUIN.EXE right now. The last secret hidden in its main() function: Just like GAME.BAT supports launching the game in various debug modes from the DOS command line, FUUIN.EXE can directly launch one of the game's endings. As long as the MDRV2 driver is installed, you can enter fuuin t1 for the 魔界/Makai Good Ending, or fuuin t for 地獄/Jigoku Good Ending.
Unfortunately, the command-line parameter can only control the route. Choosing between a Good or Bad Ending is still done exclusively through TH01's resident structure, and the continues_per_scene array in particular. But if you pre-allocate that structure somehow and set one of the members to a nonzero value, it would work. Trainers, anyone?

Alright, gotta get back to the code if I want to have any chance of finishing this game before the 15th… Next up: The final 17 functions in REIIDEN.EXE that tie everything together and add some more debug features on top.

📝 Posted:
🚚 Summary of:
P0207, P0208, P0209, P0210, P0211
Commits:
454c105...c26ef4b, c26ef4b...239a3ec, 239a3ec...5030867, 5030867...149fbca, 149fbca...d398a94
💰 Funded by:
GhostPhanom, Yanga, Arandui, Lmocinemod
🏷 Tags:

Whew, TH01's boss code just had to end with another beast of a boss, taking way longer than it should have and leaving uncomfortably little time for the rest of the game. Let's get right into the overview of YuugenMagan, the most sequential and scripted battle in this game:


At a pixel-perfect 81×61 pixels, the Orb hitboxes are laid out rather generously this time, reaching quite a bit outside the 64×48 eye sprites:

TH01 YuugenMagan's hitboxes.

And that's about the only positive thing I can say about a position calculation in this fight. Phase 0 already starts with the lasers being off by 1 pixel from the center of the iris. Sure, 28 may be a nicer number to add than 29, but the result won't be byte-aligned either way? This is followed by the eastern laser's hitbox somehow being 24 pixels larger than the others, stretching a rather unexpected 70 pixels compared to the 46 of every other laser.
On a more hilarious note, the eye closing keyframe contains the following (pseudo-)code, comprising the only real accidentally "unused" danmaku subpattern in TH01:

// Did you mean ">= RANK_HARD"?
if(rank == RANK_HARD) {
	eye_north.fire_aimed_wide_5_spread();
	eye_southeast.fire_aimed_wide_5_spread();
	eye_southwest.fire_aimed_wide_5_spread();

	// Because this condition can never be true otherwise.
	// As a result, no pellets will be spawned on Lunatic mode.
	// (There is another Lunatic-exclusive subpattern later, though.)
	if(rank == RANK_LUNATIC) {
		eye_west.fire_aimed_wide_5_spread();
		eye_east.fire_aimed_wide_5_spread();
	}
}

Featuring the weirdly extended hitbox for the eastern laser, as well as an initial Reimu position that points out the disparity between byte-aligned rendering and the internal coordinates one final time.

After a few utility functions that look more like a quickly abandoned refactoring attempt, we quickly get to the main attraction: YuugenMagan combines the entire boss script and most of the pattern code into a single 2,634-instruction function, totaling 9,677 bytes inside REIIDEN.EXE. For comparison, ReC98's version of this code consists of at least 49 functions, excluding those I had to add to work around ZUN's little inconsistencies, or the ones I added for stylistic reasons.
In fact, this function is so large that Turbo C++ 4.0J refuses to generate assembly output for it via the -S command-line option, aborting with a Compiler table limit exceeded in function error. Contrary to what the Borland C++ 4.0 User Guide suggests, this instance of the error is not at all related to the number of function bodies or any metric of algorithmic complexity, but is simply a result of the compiler's internal text representation for a single function overflowing a 64 KiB memory segment. Merely shortening the names of enough identifiers within the function can help to get that representation down below 64 KiB. If you encounter this error during regular software development, you might interpret it as the compiler's roundabout way of telling you that it inlined way more function calls than you probably wanted to have inlined. Because you definitely won't explicitly spell out such a long function in newly-written code, right? :tannedcirno:
At least it wasn't the worst copy-pasting job in this game; that trophy still goes to 📝 Elis. And while the tracking code for adjusting an eye's sprite according to the player's relative position is one of the main causes behind all the bloat, it's also 100% consistent, and might have been an inlined class method in ZUN's original code as well.

The clear highlight in this fight though? Almost no coordinate is precisely calculated where you'd expect it to be. In particular, all bullet spawn positions completely ignore the direction the eyes are facing to:

Pellets unexpectedly spawned at the exact
	bottom center of an eye
Combining the bottom of the pupil with the exact horizontal center of the sprite as a whole might sound like a good idea, but looks especially wrong if the eye is facing right.
Missile spawn positions in the TH01
	YuugenMagan fight
Here it's the other way round: OK for a right-facing eye, really wrong for a left-facing one.
Spawn position of the 3-pixel laser in the
	TH01 YuugenMagan fight
Dude, the eye is even supposed to track the laser in this one!
The final center position of the regular
	pentagram in the TH01 YuugenMagan fight
Hint: That's not the center of the playfield. At least the pellets spawned from the corners are sort of correct, but with the corner calculates precomputed, you could only get them wrong on purpose.

Due to their effect on gameplay, these inaccuracies can't even be called "bugs", and made me devise a new "quirk" category instead. More on that in the TH01 100% blog post, though.


While we did see an accidentally unused bullet pattern earlier, I can now say with certainty that there are no truly unused danmaku patterns in TH01, i.e., pattern code that exists but is never called. However, the code for YuugenMagan's phase 5 reveals another small piece of danmaku design intention that never shows up within the parameters of the original game.
By default, pellets are clipped when they fly past the top of the playfield, which we can clearly observe for the first few pellets of this pattern. Interestingly though, the second subpattern actually configures its pellets to fall straight down from the top of the playfield instead. You never see this happening in-game because ZUN limited that subpattern to a downwards angle range of 0x73 or 162°, resulting in none of its pellets ever getting close to the top of the playfield. If we extend that range to a full 360° though, we can see how ZUN might have originally planned the pattern to end:

YuugenMagan's phase 5 patterns on every difficulty, with the second subpattern extended to reveal the different pellet behavior that remained in the final game code. In the original game, the eyes would stop spawning bullets on the marked frame.

If we also disregard everything else about YuugenMagan that fits the upcoming definition of quirk, we're left with 6 "fixable" bugs, all of which are a symptom of general blitting and unblitting laziness. Funnily enough, they can all be demonstrated within a short 9-second part of the fight, from the end of phase 9 up until the pentagram starts spinning in phase 13:

  1. General flickering whenever any sprite overlaps an eye. This is caused by only reblitting each eye every 3 frames, and is an issue all throughout the fight. You might have already spotted it in the videos above.
  2. Each of the two lasers is unblitted and blitted individually instead of each operation being done for both lasers together. Remember how 📝 ZUN unblits 32 horizontal pixels for every row of a line regardless of its width? That's why the top part of the left, right-moving laser is never visible, because it's blitted before the other laser is unblitted.
  3. ZUN forgot to unblit the lasers when phase 9 ends. This footage was recorded by pressing ↵ Return in test mode (game t or game d), and it's probably impossible to achieve this during actual gameplay without TAS techniques. You would have to deal the required 6 points of damage within 491 frames, with the eye being invincible during 240 of them. Simply shooting up an Orb with a horizontal velocity of 0 would also only work a single time, as boss entities always repel the Orb with a horizontal velocity of ±4.
  4. The shrinking pentagram is unblitted after the eyes were blitted, adding another guaranteed frame of flicker on top of the ones in 1). Like in 2), the blockiness of the holes is another result of unblitting 32 pixels per row at a time.
  5. Another missing unblitting call in a phase transition, as the pentagram switches from its not quite correctly interpolated shrunk form to a regular star polygon with a radius of 64 pixels. Indirectly caused by the massively bloated coordinate calculation for the shrink animation being done separately for the unblitting and blitting calls. Instead of, y'know, just doing it once and storing the result in variables that can later be reused.
  6. The pentagram is not reblitted at all during the first 100 frames of phase 13. During that rather long time, it's easily possible to remove it from VRAM completely by covering its area with player shots. Or HARRY UP pellets.

Definitely an appropriate end for this game's entity blitting code. :onricdennat: I'm really looking forward to writing a proper sprite system for the Anniversary Edition…

And just in case you were wondering about the hitboxes of these pentagrams as they slam themselves into Reimu:

62 pixels on the X axis, centered around each corner point of the star, 16 pixels below, and extending infinitely far up. The latter part becomes especially devious because the game always collision-detects all 5 corners, regardless of whether they've already clipped through the bottom of the playfield. The simultaneously occurring shape distortions are simply a result of the line drawing function's rather poor re-interpolation of any line that runs past the 640×400 VRAM boundaries; 📝 I described that in detail back when I debugged the shootout laser crash. Ironically, using fixed-size hitboxes for a variable-sized pentagram means that the larger one is easier to dodge.


The final puzzle in TH01's boss code comes 📝 once again in the form of weird hardware palette changes. The kanji on the background image goes through various colors throughout the fight, which ZUN implemented by gradually incrementing and decrementing either a single one or none of the color's three 4-bit components at the beginning of each even-numbered phase. The resulting color sequence, however, doesn't quite seem to follow these simple rules:

Adding some debug output sheds light on what's going on there:

Since each iteration of phase 12 adds 63 to the red component, integer overflow will cause the color to infinitely alternate between dark-blue and red colors on every 2.03 iterations of the pentagram phase loop. The 65th iteration will therefore be the first one with a dark-blue color for a third iteration in a row – just in case you manage to stall the fight for that long.

Yup, ZUN had so much trust in the color clamping done by his hardware palette functions that he did not clamp the increment operation on the stage_palette itself. :zunpet: Therefore, the 邪 colors and even the timing of their changes from Phase 6 onwards are "defined" by wildly incrementing color components beyond their intended domain, so much that even the underlying signed 8-bit integer ends up overflowing. Given that the decrement operation on the stage_palette is clamped though, this might be another one of those accidents that ZUN deliberately left in the game, 📝 similar to the conclusion I reached with infinite bumper loops.
But guess what, that's also the last time we're going to encounter this type of palette component domain quirk! Later games use master.lib's 8-bit palette system, which keeps the comfort of using a single byte per component, but shifts the actual hardware color into the top 4 bits, leaving the bottom 4 bits for added precision during fades.

OK, but now we're done with TH01's bosses! 🎉That was the 8th PC-98 Touhou boss in total, leaving 23 to go.


With all the necessary research into these quirks going well into a fifth push, I spent the remaining time in that one with transferring most of the data between YuugenMagan and the upcoming rest of REIIDEN.EXE into C land. This included the one piece of technical debt in TH01 we've been carrying around since March 2015, as well as the final piece of the ending sequence in FUUIN.EXE. Decompiling that executable's main() function in a meaningful way requires pretty much all remaining data from REIIDEN.EXE to also be moved into C land, just in case you were wondering why we're stuck at 99.46% there.
On a more disappointing note, the static initialization code for the 📝 5 boss entity slots ultimately revealed why YuugenMagan's code is as bloated and redundant as it is: The 5 slots really are 5 distinct variables rather than a single 5-element array. That's why ZUN explicitly spells out all 5 eyes every time, because the array he could have just looped over simply didn't exist. 😕 And while these slot variables are stored in a contiguous area of memory that I could just have taken the address of and then indexed it as if it were an array, I didn't want to annoy future port authors with what would technically be out-of-bounds array accesses for purely stylistic reasons. At least it wasn't that big of a deal to rewrite all boss code to use these distinct variables, although I certainly had to get a bit creative with Elis.

Next up: Finding out how many points we got in totle, and hoping that ZUN didn't hide more unexpected complexities in the remaining 45 functions of this game. If you have to spare, there are two ways in which that amount of money would help right now:

📝 Posted:
🚚 Summary of:
P0205, P0206
Commits:
3259190...327730f, 327730f...454c105
💰 Funded by:
[Anonymous], Yanga
🏷 Tags:

Oh look, it's another rather short and straightforward boss with a rather small number of bugs and quirks. Yup, contrary to the character's popularity, Mima's premiere is really not all that special in terms of code, and continues the trend established with 📝 Kikuri and 📝 SinGyoku. I've already covered 📝 the initial sprite-related bugs last November, so this post focuses on the main code of the fight itself. The overview:


And there aren't even any weird hitboxes this time. What is maybe special about Mima, however, is how there's something to cover about all of her patterns. Since this is TH01, it's won't surprise anyone that the rotating square patterns are one giant copy-pasta of unblitting, updating, and rendering code. At least ZUN placed the core polar→Cartesian transformation in a separate function for creating regular polygons with an arbitrary number of sides, which might hint toward some more varied shapes having been planned at one point?
5 of the 6 patterns even follow the exact same steps during square update frames:

  1. Calculate square corner coordinates
  2. Unblit the square
  3. Update the square angle and radius
  4. Use the square corner coordinates for spawning pellets or missiles
  5. Recalculate square corner coordinates
  6. Render the square

Notice something? Bullets are spawned before the corner coordinates are updated. That's why their initial positions seem to be a bit off – they are spawned exactly in the corners of the square, it's just that it's the square from 8 frames ago. :tannedcirno:

Mima's first pattern on Normal difficulty.

Once ZUN reached the final laser pattern though, he must have noticed that there's something wrong there… or maybe he just wanted to fire those lasers independently from the square unblit/update/render timer for a change. Spending an additional 16 bytes of the data segment for conveniently remembering the square corner coordinates across frames was definitely a decent investment.

Mima's laser pattern on Lunatic difficulty, now with correct laser spawn positions. If this pattern reminds you of the game crashing immediately when defeating Mima, 📝 check out the Elis blog post for the details behind this bug, and grab the bugfix patch from there.

When Mima isn't shooting bullets from the corners of a square or hopping across the playfield, she's raising flame pillars from the bottom of the playfield within very specifically calculated random ranges… which are then rendered at byte-aligned VRAM positions, while collision detection still uses their actual pixel position. Since I don't want to sound like a broken record all too much, I'll just direct you to 📝 Kikuri, where we've seen the exact same issue with the teardrop ripple sprites. The conclusions are identical as well.

Mima's flame pillar pattern. This video was recorded on a particularly unlucky seed that resulted in great disparities between a pillar's internal X coordinate and its byte-aligned on-screen appearance, leading to lots of right-shifted hitboxes.
Also note how the change from the meteor animation to the three-arm 🚫 casting sprite doesn't unblit the meteor, and leaves that job to any sprite that happens to fly over those pixels.

However, I'd say that the saddest part about this pattern is how choppy it is, with the circle/pillar entities updating and rendering at a meager 7 FPS. Why go that low on purpose when you can just make the game render ✨ smoothly ✨ instead?

So smooth it's almost uncanny.

The reason quickly becomes obvious: With TH01's lack of optimization, going for the full 56.4 FPS would have significantly slowed down the game on its intended 33 MHz CPUs, requiring more than cheap surface-level ASM optimization for a stable frame rate. That might very well have been ZUN's reason for only ever rendering one circle per frame to VRAM, and designing the pattern with these time offsets in mind. It's always been typical for PC-98 developers to target the lowest-spec models that could possibly still run a game, and implementing dynamic frame rates into such an engine-less game is nothing I would wish on anybody. And it's not like TH01 is particularly unique in its choppiness anyway; low frame rates are actually a rather typical part of the PC-98 game aesthetic.


The final piece of weirdness in this fight can be found in phase 1's hop pattern, and specifically its palette manipulation. Just from looking at the pattern code itself, each of the 4 hops is supposed to darken the hardware palette by subtracting #444 from every color. At the last hop, every color should have therefore been reduced to a pitch-black #000, leaving the player completely blind to the movement of the chasing pellets for 30 frames and making the pattern quite ghostly indeed. However, that's not what we see in the actual game:

Nothing in the pattern's code would cause the hardware palette to get brighter before the end of the pattern, and yet…
The expected version doesn't look all too unfair, even on Lunatic… well, at least at the default rank pellet speed shown in this video. At maximum pellet speed, it is in fact rather brutal.

Looking at the frame counter, it appears that something outside the pattern resets the palette every 40 frames. The only known constant with a value of 40 would be the invincibility frames after hitting a boss with the Orb, but we're not hitting Mima here… :thonk:
But as it turns out, that's exactly where the palette reset comes from: The hop animation darkens the hardware palette directly, while the 📝 infamous 12-parameter boss collision handler function unconditionally resets the hardware palette to the "default boss palette" every 40 frames, regardless of whether the boss was hit or not. I'd classify this as a bug: That function has no business doing periodic hardware palette resets outside the invincibility flash effect, and it completely defies common sense that it does.

That explains one unexpected palette change, but could this function possibly also explain the other infamous one, namely, the temporary green discoloration in the Konngara fight? That glitch comes down to how the game actually uses two global "default" palettes: a default boss palette for undoing the invincibility flash effect, and a default stage palette for returning the colors back to normal at the end of the bomb animation or when leaving the Pause menu. And sure enough, the stage palette is the one with the green color, while the boss palette contains the intended colors used throughout the fight. Sending the latter palette to the graphics chip every 40 frames is what corrects the discoloration, which would otherwise be permanent.

The green color comes from BOSS7_D1.GRP, the scrolling background of the entrance animation. That's what turns this into a clear bug: The stage palette is only set a single time in the entire fight, at the beginning of the entrance animation, to the palette of this image. Apart from consistency reasons, it doesn't even make sense to set the stage palette there, as you can't enter the Pause menu or bomb during a blocking animation function.
And just 3 lines of code later, ZUN loads BOSS8_A1.GRP, the main background image of the fight. Moving the stage palette assignment there would have easily prevented the discoloration.

But yeah, as you can tell, palette manipulation is complete jank in this game. Why differentiate between a stage and a boss palette to begin with? The blocking Pause menu function could have easily copied the original palette to a local variable before darkening it, and then restored it after closing the menu. It's not so easy for bombs as the intended palette could change between the start and end of the animation, but the code could have still been simplified a lot if there was just one global "default palette" variable instead of two. Heck, even the other bosses who manipulate their palettes correctly only do so because they manually synchronize the two after every change. The proper defense against bugs that result from wild mutation of global state is to get rid of global state, and not to put up safety nets hidden in the middle of existing effect code.

The easiest way of reproducing the green discoloration bug in the TH01 Konngara fight, timed to show the maximum amount of time the discoloration can possibly last.

In any case, that's Mima done! 7th PC-98 Touhou boss fully decompiled, 24 bosses remaining, and 59 functions left in all of TH01.


In other thrilling news, my call for secondary funding priorities in new TH01 contributions has given us three different priorities so far. This raises an interesting question though: Which of these contributions should I now put towards TH01 immediately, and which ones should I leave in the backlog for the time being? Since I've never liked deciding on priorities, let's turn this into a popularity contest instead: The contributions with the least popular secondary priorities will go towards TH01 first, giving the most popular priorities a higher chance to still be left over after TH01 is done. As of this delivery, we'd have the following popularity order:

  1. TH05 (1.67 pushes), from T0182
  2. Seihou (1 push), from T0184
  3. TH03 (0.67 pushes), from T0146

Which means that T0146 will be consumed for TH01 next, followed by T0184 and then T0182. I only assign transactions immediately before a delivery though, so you all still have the chance to change up these priorities before the next one.

Next up: The final boss of TH01 decompilation, YuugenMagan… if the current or newly incoming TH01 funds happen to be enough to cover the entire fight. If they don't turn out to be, I will have to pass the time with some Seihou work instead, missing the TH01 anniversary deadline as a result. Edit (2022-07-18): Thanks to Yanga for securing the funding for YuugenMagan after all! That fight will feature slightly more than half of all remaining code in TH01's REIIDEN.EXE and the single biggest function in all of PC-98 Touhou, let's go!

📝 Posted:
🚚 Summary of:
P0203, P0204
Commits:
4568bf7...86cdf5f, 86cdf5f...0c682b5
💰 Funded by:
GhostRiderCog, [Anonymous], Yanga
🏷 Tags:

Let's start right with the milestones:


So, how did this card-flipping stage obstacle delivery get so horribly delayed? With all the different layouts showcased in the 28 card-flipping stages, you'd expect this to be among the more stable and bug-free parts of the codebase. Heck, with all stage objects being placed on a 32×32-pixel grid, this is the first TH01-related blog post this year that doesn't have to describe an alignment-related unblitting glitch!

That alone doesn't mean that this code is free from quirky behavior though, and we have to look no further than the first few lines of the collision handling for round bumpers to already find a whole lot of that. Simplified, they do the following:

pixel_t delta_y_between_orb_and_bumper = (orb.top - bumper.top);
if(delta_y_between_orb_and_bumper <= 0) {
	orb.top = (bumper.top - 24);
} else {
	orb.top = (bumper.top + 24);
}

Immediately, you wonder why these assignments only exist for the Y coordinate. Sure, hitting a bumper from the left or right side should happen less often, but it's definitely possible. Is it really a good idea to warp the Orb to the top or bottom edge of a bumper regardless?
What's more important though: The fact that these immediate assignments exist at all. The game's regular Orb physics work by producing a Y velocity from the single force acting on the Orb and a gravity factor, and are completely independent of its current Y position. A bumper collision does also apply a new force onto the Orb further down in the code, but these assignments still bypass the physics system and are bound to have some knock-on effect on the Orb's movement.

To observe that effect, we just have to enter Stage 18 on the 地獄/Jigoku route, where it's particularly trivial to reproduce. At a 📝 horizontal velocity of ±4, these assignments are exactly what can cause the Orb to endlessly bounce between two bumpers. As rudimentary as the Orb's physics may be, just letting them do their work would have entirely prevented these loops:

One of at least three infinite bumper loop constellations within just this 10×5-tile section of TH01's Stage 18 on the 地獄/Jigoku route. With an effective 56 horizontal pixels between both hitboxes, the Orb would have to travel an absolute Y distance of at least 16 vertical pixels within (56 / 4) = 14 frames to escape the other bumper's hitbox. If the initial bounce reduces the Orb's Y velocity far enough for it to not manage that distance the first time, it will never reach the necessary speed again. In this loop, the bounce-off force even stabilizes, though this doesn't have to happen. The blue areas indicate the pixel-perfect* hitboxes of each bumper.
TH01 bumper collision handling without ZUN's manual assignment of the Y coordinate. The Orb still bounces back and forth between two bumpers for a while, but its top position always follows naturally from its Y velocity and the force applied to it, and gravity wins out in the end. The blue areas indicate the pixel-perfect* hitboxes of each bumper.

Now, you might be thinking that these Y assignments were just an attempt to prevent the Orb from colliding with the same bumper again on the next frame. After all, those 24 pixels exactly correspond to ⅓ of the height of a bumper's hitbox with an additional pixel added on top. However, the game already perfectly prevents repeated collisions by turning off collision testing with the same bumper for the next 7 frames after a collision. Thus, we can conclude that ZUN either explicitly coded bumper collision handling to facilitate these loops, or just didn't take out that code after inevitably discovering what it did. This is not janky code, it's not a glitch, it's not sarcasm from my end, and it's not the game's physics being bad.

But wait. Couldn't these assignments just be a remnant from a time in development before ZUN decided on the 7-frame delay on further collisions? Well, even that explanation stops holding water after the next few lines of code. Simplified, again:

pixel_t delta_x_between_orb_and_bumper = (orb.left - bumper.left);
if((orb.velocity.x == +4) && (delta_x_between_orb_and_bumper < 0)) {
	orb.velocity.x = -4;
} else if((orb.velocity.x == -4) && (delta_x_between_orb_and_bumper > 0)) {
	orb.velocity.x = +4;
}

What's important here is the part that's not in the code – namely, anything that handles X velocities of -8 or +8. In those cases, the Orb simply continues in the same horizontal direction. The manual Y assignment is the only part of the code that actually prevents a collision there, as the newly applied force is not guaranteed to be enough:

An infinite loop across three bumpers, made possible by the edge of the playfield and bumper bars on opposite sides, an unchanged horizontal direction, and the Y assignments neatly placing the Orb on either the top or bottom side of a bumper. The alternating sign of the force further ensures that the Orb will travel upwards half the time, canceling out gravity during the short time between two hitboxes.
With the unchanged horizontal direction and the Y assignments removed, nothing keeps an Orb at ±8 pixels per frame from flying into/over a bumper. The collision force pushes the Orb slightly, but not enough to truly matter. The final force sends the Orb on a significant downward trajectory beyond the next bumper's hitbox, breaking the original loop.

Forgetting to handle ⅖ of your discrete X velocity cases is simply not something you do by accident. So we might as well say that ZUN deliberately designed the game to behave exactly as it does in this regard.


Bumpers also come in vertical or horizontal bar shapes. Their collision handling also turns off further collision testing for the next 7 frames, and doesn't do any manual coordinate assignment. That's definitely a step up in cleanliness from round bumpers, but it doesn't seem to keep in mind that the player can fire a new shot every 4 frames when standing still. That makes it immediately obvious why this works:

The green numbers show the amount of frames since the last detected collision with the respective bumper bar, and indicate that collision testing with the bar below is currently disabled.

That's the most well-known case of reducing the Orb's horizontal velocity to 0 by exactly hitting it with shots in its center and then button-mashing it through a horizontal bar. This also works with vertical bars and yields even more interesting results there, but if we want to have any chance of understanding what happens there, we have to first go over some basics:

However, if that were everything the game did, kicking the Orb into a column of vertical bumper bars would lead them to behave more like a rope that the Orb can climb, as the initial collision with two hitboxes cancels out the intended sign change that reflects the Orb away from the bars:

This footage was recorded without the workaround I am about to describe. It does not reflect the behavior of the original game. You cannot do this in the original game.
While the visualization reveals small sections where three hitboxes overlap, the Orb can never actually collide with three of them at the same time, as those 3-hitbox regions are 2 pixels smaller than they would need to be to fit the Orb. That's exactly the difference between using < rather than <= in these hitbox comparisons.

While that would have been a fun gameplay mechanic on its own, it immediately breaks apart once you place two vertical bumper bars next to each other. Due to how these bumper bar hitboxes extend past their sprites, any two adjacent vertical bars will end up with the exact same hitbox in absolute screen coordinates. Stage 17 on the 魔界/Makai route contains exactly such a layout:

The collision handlers of adjacent vertical bars always activate in the same frame, independently invert the Orb's X velocity, and therefore fully cancel out their intended effect on the Orb… if the game did not have the workaround I am about to describe. This cannot happen in the original game.

ZUN's workaround: Setting a "vertical bumper bar block flag" after any collision with such a bar, which simply disables any collision with any vertical bar for the next 7 frames. This quick hack made all vertical bars work as intended, and avoided the need for involving the Orb's X velocity in any kind of physics system. :zunpet:


Edit (2022-07-12): This flag only works around glitches that would be caused by simultaneously colliding with more than one vertical bar. The actual response to a bumper bar collision still remains unaffected, and is very naive:

These conditions are only correct if the Orb comes in at an angle roughly between 45° and 135° on either side of a bar. If it's anywhere close to 0° or 180°, this response will be incorrect, and send the Orb straight through the bar. Since the large hitboxes make this easily possible, you can still get the Orb to climb a vertical column, or glide along a horizontal row:

Here's the hitbox overlay for 地獄/Jigoku Stage 19, and here's an updated version of the 📝 Orb physics debug mod that now also shows bumper bar collision frame numbers: 2022-07-10-TH01OrbPhysicsDebug.zip See the th01_orb_debug branch for the code. To use it, simply replace REIIDEN.EXE, and run the game in debug mode, via game d on the DOS prompt. If you encounter a gameplay situation that doesn't seem to be covered by this blog post, you can now verify it for yourself. Thanks to touhou-memories for bringing these issues to my attention! That definitely was a glaring omission from the initial version of this blog post.


With that clarified, we can now try mashing the Orb into these two vertical bars:

At first, that workaround doesn't seem to make a difference here. As we expect, the frame numbers now tell us that only one of the two bumper bars in a row activates, but we couldn't have told otherwise as the number of bars has no effect on newly applied Y velocity forces. On a closer look, the Orb's rise to the top of the playfield is in fact caused by that workaround though, combined with the unchanged top-to-bottom order of collision testing. As soon as any bumper bar completed its 7 collision delay frames, it resets the aforementioned flag, which already reactivates collision handling for any remaining vertical bumper bars during the same frame. Look out for frames with both a 7 and a 1, like the one marked in the video above: The 7 will always appear before the 1 in the row-major order. Whenever this happens, the current oscillation period is cut down from 7 to 6 frames – and because collision testing runs from top to bottom, this will always happen during the falling part. Depending on the Y velocity, the rising part may also be cut down to 6 frames from time to time, but that one at least has a chance to last for the full 7 frames. This difference adds those crucial extra frames of upward movement, which add up to send the Orb to the top. Without the flag, you'd always see the Orb oscillating between a fixed range of the bar column.
Finally, it's the "top of playfield" force that gradually slows down the Orb and makes sure it ultimately only moves at sub-pixel velocities, which have no visible effect. Because 📝 the regular effect of gravity is reset with each newly applied force, it's completely negated during most of the climb. This even holds true once the Orb reached the top: Since the Orb requires a negative force to repeatedly arrive up there and be bounced back, this force will stay active for the first 5 of the 7 collision frames and not move the Orb at all. Once gravity kicks in at the 5th frame and adds 1 to the Y velocity, it's already too late: The new velocity can't be larger than 0.5, and the Orb only has 1 or 2 frames before the flag reset causes it to be bounced back up to the top again.


Portals, on the other hand, turn out to be much simpler than the old description that ended up on Touhou Wiki in October 2005 might suggest. Everything about their teleportations is random: The destination portal, the exit force (as an integer between -9 and +9), as well as the exit X velocity, with each of the 📝 5 distinct horizontal velocities having an equal chance of being chosen. Of course, if the destination portal is next to the left or right edge of the playfield and it chooses to fire the Orb towards that edge, it immediately bounces off into the opposite direction, whereas the 0 velocity is always selected with a constant 20% probability.

The selection process for the destination portal involves a bit more than a single rand() call. The game bundles all obstacles in a single structure of dynamically allocated arrays, and only knows how many obstacles there are in total, not per type. Now, that alone wouldn't have much of an impact on random portal selection, as you could simply roll a random obstacle ID and try again if it's not a portal. But just to be extra cute, ZUN instead iterates over all obstacles, selects any non-entered portal with a chance of ¼, and just gives up if that dice roll wasn't successful after 16 loops over the whole array, defaulting to the entered portal in that case.
In all its silliness though, this works perfectly fine, and results in a chance of 0.7516(𝑛 - 1) for the Orb exiting out of the same portal it entered, with 𝑛 being the total number of portals in a stage. That's 1% for two portals, and 0.01% for three. Pretty decent for a random result you don't want to happen, but that hurts nobody if it does.

The one tiny ZUN bug with portals is technically not even part of the newly decompiled code here. If Reimu gets hit while the Orb is being sent through a portal, the Orb is immediately kicked out of the portal it entered, no matter whether it already shows up inside the sprite of the destination portal. Neither of the two portal sprites is reset when this happens, leading to "two Orbs" being visible simultaneously. :tannedcirno::onricdennat:
This makes very little sense no matter how you look at it. The Orb doesn't receive a new velocity or force when this happens, so it will simply re-enter the same portal once the gameplay resumes on Reimu's next life:

And that's it! At least the turrets don't have anything notable to say about them 📝 that I haven't said before.


That left another ½ of a push over at the end. Way too much time to finish FUUIN.exe, way too little time to start with Mima… but the bomb animation fit perfectly in there. No secrets or bugs there, just a bunch of sprite animation code wasting at least another 82 bytes in the data segment. The special effect after the kuji-in sprites uses the same single-bitplane 32×32 square inversion effect seen at the end of Kikuri's and Sariel's entrance animation, except that it's a 3-stack of 16-rings moving at 6, 7, and 8 pixels per frame respectively. At these comparatively slow speeds, the byte alignment of each square adds some further noise to the discoloration pattern… if you even notice it below all the shaking and seizure-inducing hardware palette manipulation.
And yes, due to the very destructive nature of the effect, the game does in fact rely on it only being applied to VRAM page 0. While that will cause every moving sprite to tear holes into the inverted squares along its trajectory, keeping a clean playfield on VRAM page 1 is what allows all that pixel damage to be easily undone at the end of this 89-frame animation.

Next up: Mima! Let's hope that stage obstacles already were the most complex part remaining in TH01…

📝 Posted:
🚚 Summary of:
P0201, P0202
Commits:
9342665...ff49e9e, ff49e9e...4568bf7
💰 Funded by:
Ember2528, Yanga, [Anonymous]
🏷 Tags:

The positive:

The negative:

The overview:


This time, we're back to the Orb hitbox being a logical 49×49 pixels in SinGyoku's center, and the shot hitbox being the weird one. What happens if you want the shot hitbox to be both offset to the left a bit and stretch the entire width of SinGyoku's sprite? You get a hitbox that ends in mid-air, far away from the right edge of the sprite:

Due to VRAM byte alignment, all player shots fired between gx = 376 and gx = 383 inclusive appear at the same visual X position, but are internally already partly outside the hitbox and therefore won't hit SinGyoku – compare the marked shot at gx = 376 to the one at gx = 380. So much for precisely visualizing hitboxes in this game…

Since the female and male forms also use the sphere entity's coordinates, they share the same hitbox.


Onto the rendering glitches then, which can – you guessed it – all be found in the sphere form's slam movement:

By having the sphere move from the right edge of the playfield to the left, this video demonstrates both the lazy reblitting and broken unblitting at the right edge for negative X velocities. Also, isn't it funny how Reimu can partly disappear from all the sloppy SinGyoku-related unblitting going on after her sprite was blitted?

Due to the low contrast of the sphere against the background, you typically don't notice these glitches, but the white invincibility flashing after a hit really does draw attention to them. This time, all of these glitches aren't even directly caused by ZUN having never learned about the EGC's bit length register – if he just wrote correct code for SinGyoku, none of this would have been an issue. Sigh… I wonder how many more glitches will be caused by improper use of this one function in the last 18% of REIIDEN.EXE.

There's even another bug here, with ZUN hardcoding a horizontal delta of 8 pixels rather than just passing the actual X velocity. Luckily, the maximum movement speed is 6 pixels on Lunatic, and this would have only turned into an additional observable glitch if the X velocity were to exceed 24 pixels. But that just means it's the kind of bug that still drains RE attention to prove that you can't actually observe it in-game under some circumstances.


The 5 pellet patterns are all pretty straightforward, with nothing to talk about. The code architecture during phase 2 does hint towards ZUN having had more creative patterns in mind – especially for the male form, which uses the transformation function's three pattern callback slots for three repetitions of the same pellet group.
There is one more oddity to be found at the very end of the fight:

The first frame of TH01 SinGyoku's defeat animation, showing the sphere blitted on top of a potentially active person form

Right before the defeat white-out animation, the sphere form is explicitly reblitted for no reason, on top of the form that was blitted to VRAM in the previous frame, and regardless of which form is currently active. If SinGyoku was meant to immediately transform back to the sphere form before being defeated, why isn't the person form unblitted before then? Therefore, the visibility of both forms is undeniably canon, and there is some lore meaning to be found here… :thonk:
In any case, that's SinGyoku done! 6th PC-98 Touhou boss fully decompiled, 25 remaining.


No FUUIN.EXE code rounding out the last push for a change, as the 📝 remaining missile code has been waiting in front of SinGyoku for a while. It already looked bad in November, but the angle-based sprite selection function definitely takes the cake when it comes to unnecessary and decadent floating-point abuse in this game.
The algorithm itself is very trivial: Even with 📝 .PTN requiring an additional quarter parameter to access 16×16 sprites, it's essentially just one bit shift, one addition, and one binary AND. For whatever reason though, ZUN casts the 8-bit missile angle into a 64-bit double, which turns the following explicit comparisons (!) against all possible 4 + 16 boundary angles (!!) into FPU operations. :zunpet: Even with naive and readable division and modulo operations, and the whole existence of this function not playing well with Turbo C++ 4.0J's terrible code generation at all, this could have been 3 lines of code and 35 un-inlined constant-time instructions. Instead, we've got this 207-instruction monster… but hey, at least it works. 🤷
The remaining time then went to YuugenMagan's initialization code, which allowed me to immediately remove more declarations from ASM land, but more on that once we get to the rest of that boss fight.

That leaves 76 functions until we're done with TH01! Next up: Card-flipping stage obstacles.

📝 Posted:
🚚 Summary of:
P0198, P0199, P0200
Commits:
48db0b7...440637e, 440637e...5af2048, 5af2048...67e46b5
💰 Funded by:
Ember2528, Lmocinemod, Yanga
🏷 Tags:

What's this? A simple, straightforward, easy-to-decompile TH01 boss with just a few minor quirks and only two rendering-related ZUN bugs? Yup, 2½ pushes, and Kikuri was done. Let's get right into the overview:

So yeah, there's your new timeout challenge. :godzun:


The few issues in this fight all relate to hitboxes, starting with the main one of Kikuri against the Orb. The coordinates in the code clearly describe a hitbox in the upper center of the disc, but then ZUN wrote a < sign instead of a > sign, resulting in an in-game hitbox that's not quite where it was intended to be…

Kikuri's actual hitbox. Since the Orb sprite doesn't change its shape, we can visualize the hitbox in a pixel-perfect way here. The Orb must be completely within the red area for a hit to be registered.
TODO TH01 Kikuri's intended hitboxTH01 Kikuri's actual hitbox

Much worse, however, are the teardrop ripples. It already starts with their rendering routine, which places the sprites from TAMAYEN.PTN at byte-aligned VRAM positions in the ultimate piece of if(…) {…} else if(…) {…} else if(…) {…} meme code. Rather than tracking the position of each of the five ripple sprites, ZUN suddenly went purely functional and manually hardcoded the exact rendering and collision detection calls for each frame of the animation, based on nothing but its total frame counter. :zunpet:
Each of the (up to) 5 columns is also unblitted and blitted individually before moving to the next column, starting at the center and then symmetrically moving out to the left and right edges. This wouldn't be a problem if ZUN's EGC-powered unblitting function didn't word-align its X coordinates to a 16×1 grid. If the ripple sprites happen to start at an odd VRAM byte position, their unblitting coordinates get rounded both down and up to the nearest 16 pixels, thus touching the adjacent 8 pixels of the previously blitted columns and leaving the well-known black vertical bars in their place. :tannedcirno:

OK, so where's the hitbox issue here? If you just look at the raw calculation, it's a slightly confusingly expressed, but perfectly logical 17 pixels. But this is where byte-aligned blitting has a direct effect on gameplay: These ripples can be spawned at any arbitrary, non-byte-aligned VRAM position, and collisions are calculated relative to this internal position. Therefore, the actual hitbox is shifted up to 7 pixels to the right, compared to where you would expect it from a ripple sprite's on-screen position:

Due to the deterministic nature of this part of the fight, it's always 5 pixels for this first set of ripples. These visualizations are obviously not pixel-perfect due to the different potential shapes of Reimu's sprite, so they instead relate to her 32×32 bounding box, which needs to be entirely inside the red area.

We've previously seen the same issue with the 📝 shot hitbox of Elis' bat form, where pixel-perfect collision detection against a byte-aligned sprite was merely a sidenote compared to the more serious X=Y coordinate bug. So why do I elevate it to bug status here? Because it directly affects dodging: Reimu's regular movement speed is 4 pixels per frame, and with the internal position of an on-screen ripple sprite varying by up to 7 pixels, any micrododging (or "grazing") attempt turns into a coin flip. It's sort of mitigated by the fact that Reimu is also only ever rendered at byte-aligned VRAM positions, but I wouldn't say that these two bugs cancel out each other.
Oh well, another set of rendering issues to be fixed in the hypothetical Anniversary Edition – obviously, the hitboxes should remain unchanged. Until then, you can always memorize the exact internal positions. The sequence of teardrop spawn points is completely deterministic and only controlled by the fixed per-difficulty spawn interval.


Aside from more minor coordinate inaccuracies, there's not much of interest in the rest of the pattern code. In another parallel to Elis though, the first soul pattern in phase 4 is aimed on every difficulty except Lunatic, where the pellets are once again statically fired downwards. This time, however, the pattern's difficulty is much more appropriately distributed across the four levels, with the simultaneous spinning circle pellets adding a constant aimed component to every difficulty level.

Kikuri's phase 4 patterns, on every difficulty.


That brings us to 5 fully decompiled PC-98 Touhou bosses, with 26 remaining… and another ½ of a push going to the cutscene code in FUUIN.EXE.
You wouldn't expect something as mundane as the boss slideshow code to contain anything interesting, but there is in fact a slight bit of speculation fuel there. The text typing functions take explicit string lengths, which precisely match the corresponding strings… for the most part. For the "Gatekeeper 'SinGyoku'" string though, ZUN passed 23 characters, not 22. Could that have been the "h" from the Hepburn romanization of 神玉?!
Also, come on, if this text is already blitted to VRAM for no reason, you could have gone for perfect centering at unaligned byte positions; the rendering function would have perfectly supported it. Instead, the X coordinates are still rounded up to the nearest byte.

The hardcoded ending cutscene functions should be even less interesting – don't they just show a bunch of images followed by frame delays? Until they don't, and we reach the 地獄/Jigoku Bad Ending with its special shake/"boom" effect, and this picture:

Picture #2 from ED2A.GRP.

Which is rendered by the following code:

for(int i = 0; i <= boom_duration; i++) { // (yes, off-by-one)
	if((i & 3) == 0) {
		graph_scrollup(8);
	} else {
		graph_scrollup(0);
	}

	end_pic_show(1); // ← different picture is rendered
	frame_delay(2);  // ← blocks until 2 VSync interrupts have occurred

	if(i & 1) {
		end_pic_show(2); // ← picture above is rendered
	} else {
		end_pic_show(1);
	}
}

Notice something? You should never see this picture because it's immediately overwritten before the frame is supposed to end. And yet it's clearly flickering up for about one frame with common emulation settings as well as on my real PC-9821 Nw133, clocked at 133 MHz. master.lib's graph_scrollup() doesn't block until VSync either, and removing these calls doesn't change anything about the blitted images. end_pic_show() uses the EGC to blit the given 320×200 quarter of VRAM from page 1 to the visible page 0, so the bottleneck shouldn't be there either…

…or should it? After setting it up via a few I/O port writes, the common method of EGC-powered blitting works like this:

  1. Read 16 bits from the source VRAM position on any single bitplane. This fills the EGC's 4 16-bit tile registers with the VRAM contents at that specific position on every bitplane. You do not care about the value the CPU returns from the read – in optimized code, you would make sure to just read into a register to avoid useless additional stores into local variables.
  2. Write any 16 bits to the target VRAM position on any single bitplane. This copies the contents of the EGC's tile registers to that specific position on every bitplane.

To transfer pixels from one VRAM page to another, you insert an additional write to I/O port 0xA6 before 1) and 2) to set your source and destination page… and that's where we find the bottleneck. Taking a look at the i486 CPU and its cycle counts, a single one of these page switches costs 17 cycles – 1 for MOVing the page number into AL, and 16 for the OUT instruction itself. Therefore, the 8,000 page switches required for EGC-copying a 320×200-pixel image require 136,000 cycles in total.

And that's the optimal case of using only those two instructions. 📝 As I implied last time, TH01 uses a function call for VRAM page switches, complete with creating and destroying a useless stack frame and unnecessarily updating a global variable in main memory. I tried optimizing ZUN's code by throwing out unnecessary code and using 📝 pseudo-registers to generate probably optimal assembly code, and that did speed up the blitting to almost exactly 50% of the original version's run time. However, it did little about the flickering itself. Here's a comparison of the first loop with boom_duration = 16, recorded in DOSBox-X with cputype=auto and cycles=max, and with i overlaid using the text chip. Caution, flashing lights:

The original animation, completing in 50 frames instead of the expected 34, thanks to slow blitting. Combined with the lack of double-buffering, this results in noticeable tearing as the screen refreshes while blitting is still in progress. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic #1.)
This optimized version completes in the expected 34 frames. No tearing happens to be visible in this recording, but the ドカーン image is still visible on every second loop iteration. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic #1.)

I pushed the optimized code to the th01_end_pic_optimize branch, to also serve as an example of how to get close to optimal code out of Turbo C++ 4.0J without writing a single ASM instruction.
And if you really want to use the EGC for this, that's the best you can do. It really sucks that it merely expanded the GRCG's 4×8-bit tile register to 4×16 bits. With 32 bits, ≥386 CPUs could have taken advantage of their wider registers and instructions to double the blitting performance. Instead, we now know the reason why 📝 Promisence Soft's EGC-powered sprite driver that ZUN later stole for TH03 is called SPRITE16 and not SPRITE32. What a massive disappointment.

But what's perhaps a bigger surprise: Blitting planar images from main memory is much faster than EGC-powered inter-page VRAM copies, despite the required manual access to all 4 bitplanes. In fact, the blitting functions for the .CDG/.CD2 format, used from TH03 onwards, would later demonstrate the optimal method of using REP MOVSD for blitting every line in 32-pixel chunks. If that was also used for these ending images, the core blitting operation would have taken ((12 + (3 × (320 / 32))) × 200 × 4) = 33,600 cycles, with not much more overhead for the surrounding row and bitplane loops. Sure, this doesn't factor in the whole infamous issue of VRAM being slow on PC-98, but the aforementioned 136,000 cycles don't even include any actual blitting either. And as you move up to later PC-98 models with Pentium CPUs, the gap between OUT and REP MOVSD only becomes larger. (Note that the page I linked above has a typo in the cycle count of REP MOVSD on Pentium CPUs: According to the original Intel Architecture and Programming Manual, it's 13+𝑛, not 3+𝑛.)
This difference explains why later games rarely use EGC-"accelerated" inter-page VRAM copies, and keep all of their larger images in main memory. It especially explains why TH04 and TH05 can get away with naively redrawing boss backdrop images on every frame.

In the end, the whole fact that ZUN did not define how long this image should be visible is enough for me to increment the game's overall bug counter. Who would have thought that looking at endings of all things would teach us a PC-98 performance lesson… Sure, optimizing TH01 already seemed promising just by looking at its bloated code, but I had no idea that its performance issues extended so far past that level.

That only leaves the common beginning part of all endings and a short main() function before we're done with FUUIN.EXE, and 98 functions until all of TH01 is decompiled! Next up: SinGyoku, who not only is the quickest boss to defeat in-game, but also comes with the least amount of code. See you very soon!

📝 Posted:
🚚 Summary of:
P0193, P0194, P0195, P0196, P0197
Commits:
e1f3f9f...183d7a2, 183d7a2...5d93a50, 5d93a50...e18c53d, e18c53d...57c9ac5, 57c9ac5...48db0b7
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

With Elis, we've not only reached the midway point in TH01's boss code, but also a bunch of other milestones: Both REIIDEN.EXE and TH01 as a whole have crossed the 75% RE mark, and overall position independence has also finally cracked 80%!

And it got done in 4 pushes again? Yup, we're back to 📝 Konngara levels of redundancy and copy-pasta. This time, it didn't even stop at the big copy-pasted code blocks for the rift sprite and 256-pixel circle animations, with the words "redundant" and "unnecessary" ending up a total of 18 times in my source code comments.
But damn is this fight broken. As usual with TH01 bosses, let's start with a high-level overview:

This puts the earliest possible end of the fight at the first frame of phase 5. However, nothing prevents Elis' HP from reaching 0 before that point. You can nicely see this in 📝 debug mode: Wait until the HP bar has filled up to avoid heap corruption, hold ↵ Return to reduce her HP to 0, and watch how Elis still goes through a total of two patterns* and four teleport animations before accepting defeat.

But wait, heap corruption? Yup, there's a bug in the HP bar that already affected Konngara as well, and it isn't even just about the graphical glitches generated by negative HP:

Since Elis starts with 14 HP, which is an even number, this corruption is trivial to cause: Simply hold ↵ Return from the beginning of the fight, and the completion condition will never be true, as the HP and frame numbers run past the off-by-one meeting point.

Edit (2023-07-21): Pressing ↵ Return to reduce HP also works in test mode (game t). There, the game doesn't even check the heap, and consequently won't report any corruption, allowing the HP bar to be glitched even further.

Regular gameplay, however, entirely prevents this due to the fixed start positions of Reimu and the Orb, the Orb's fixed initial trajectory, and the 50 frames of delay until a bomb deals damage to a boss. These aspects make it impossible to hit Elis within the first 14 frames of phase 1, and ensure that her HP bar is always filled up completely. So ultimately, this bug ends up comparable in seriousness to the 📝 recursion / stack overflow bug in the memory info screen.


These wavy teleport animations point to a quite frustrating architectural issue in this fight. It's not even the fact that unblitting the yellow star sprites rips temporary holes into Elis' sprite; that's almost expected from TH01 at this point. Instead, it's all because of this unused frame of the animation:

An unused wave animation frame from TH01's BOSS5.BOS

With this sprite still being part of BOSS5.BOS, Girl-Elis has a total of 9 animation frames, 1 more than the 📝 8 per-entity sprites allowed by ZUN's architecture. The quick and easy solution would have been to simply bump the sprite array size by 1, but… nah, this would have added another 20 bytes to all 6 of the .BOS image slots. :zunpet: Instead, ZUN wrote the manual position synchronization code I mentioned in that 2020 blog post. Ironically, he then copy-pasted this snippet of code often enough that it ended up taking up more than 120 bytes in the Elis fight alone – with, you guessed it, some of those copies being redundant. Not to mention that just going from 8 to 9 sprites would have allowed ZUN to go down from 6 .BOS image slots to 3. That would have actually saved 420 bytes in addition to the manual synchronization trouble. Looking forward to SinGyoku, that's going to be fun again…


As for the fight itself, it doesn't take long until we reach its most janky danmaku pattern, right in phase 1:

The "pellets along circle" pattern on Lunatic, in its original version and with fanfiction fixes for everything that can potentially be interpreted as a bug.

Then again, it might very well be that all of this was intended, or, most likely, just left in the game as a happy accident. The latter interpretation would explain why ZUN didn't just delete the rendering calls for the lower-right quarter of the circle, because seriously, how would you not spot that? The phase 3 patterns continue with more minor graphical glitches that aren't even worth talking about anymore.


And then Elis transforms into her bat form at the beginning of Phase 5, which displays some rather unique hitboxes. The one against the Orb is fine, but the one against player shots…

… uses the bat's X coordinate for both X and Y dimensions. :zunpet: In regular gameplay, it's not too bad as most of the bat patterns fire aimed pellets which typically don't allow you to move below her sprite to begin with. But if you ever tried destroying these pellets while standing near the middle of the playfield, now you know why that didn't work. This video also nicely points out how the bat, like any boss sprite, is only ever blitted at positions on the 8×1-pixel VRAM byte grid, while collision detection uses the actual pixel position.

The bat form patterns are all relatively simple, with little variation depending on the difficulty level, except for the "slow pellet spreads" pattern. This one is almost easiest to dodge on Lunatic, where the 5-spreads are not only always fired downwards, but also at the hardcoded narrow delta angle, leaving plenty of room for the player to move out of the way:

The "slow pellet spreads" pattern of Elis' bat form, on every difficulty. Which version do you think is the easiest one?

Finally, we've got another potential timesave in the girl form's "safety circle" pattern:

After the circle spawned completely, you lose a life by moving outside it, but doing that immediately advances the pattern past the circle part. This part takes 200 frames, but the defeat animation only takes 82 frames, so you can save up to 118 frames there.

Final funny tidbit: As with all dynamic entities, this circle is only blitted to VRAM page 0 to allow easy unblitting. However, it's also kind of static, and there needs to be some way to keep the Orb, the player shots, and the pellets from ripping holes into it. So, ZUN just re-blits the circle every… 4 frames?! 🤪 The same is true for the Star of David and its surrounding circle, but there you at least get a flash animation to justify it. All the overlap is actually quite a good reason for not even attempting to 📝 mess with the hardware color palette instead.


And that's the 4th PC-98 Touhou boss decompiled, 27 to go… but wait, all these quirks, and I still got nothing about the one actual crash that can appear in regular gameplay? There has even been a recent video about it. The cause has to be in Elis' main function, after entering the defeat branch and before the blocking white-out animation. It can't be anywhere else other than in the 📝 central line blitting and unblitting function, called from 📝 that one broken laser reset+unblit function, because everything else in that branch looks fine… and I think we can rule out a crash in MDRV2's non-blocking fade-out call. That's going to need some extra research, and a 5th push added on top of this delivery.

Reproducing the crash was the whole challenge here. Even after moving Elis and Reimu to the exact positions seen in Pearl's video and setting Elis' HP to 0 on the exact same frame, everything ran fine for me. It's definitely no division by 0 this time, the function perfectly guards against that possibility. The line specified in the function's parameters is always clipped to the VRAM region as well, so we can also rule out illegal memory accesses here…

… or can we? Stepping through it all reminded me of how this function brings unblitting sloppiness to the next level: For each VRAM byte touched, ZUN actually unblits the 4 surrounding bytes, adding one byte to the left and two bytes to the right, and using a single 32-bit read and write per bitplane. So what happens if the function tries to unblit the topmost byte of VRAM, covering the pixel positions from (0, 0) to (7, 0) inclusive? The VRAM offset of 0x0000 is decremented to 0xFFFF to cover the one byte to the left, 4 bytes are written to this address, the CPU's internal offset overflows… and as it turns out, that is illegal even in Real Mode as of the 80286, and will raise a General Protection Fault. Which is… ignored by DOSBox-X, every Neko Project II version in common use, the CSCP emulators, SL9821, and T98-Next. Only Anex86 accurately emulates the behavior of real hardware here.

OK, but no laser fired by Elis ever reaches the top-left corner of the screen. How can such a fault even happen in practice? That's where the broken laser reset+unblit function comes in: Not only does it just flat out pass the wrong parameters to the line unblitting function – describing the line already traveled by the laser and stopping where the laser begins – but it also passes them wrongly, in the form of raw 32-bit fixed-point Q24.8 values, with no conversion other than a truncation to the signed 16-bit pixels expected by the function. What then follows is an attempt at interpolation and clipping to find a line segment between those garbage coordinates that actually falls within the boundaries of VRAM:

  1. right/bottom correspond to a laser's origin position, and left/top to the leftmost pixel of its moved-out top line. The bug therefore only occurs with lasers that stopped growing and have started moving.
  2. Moreover, it will only happen if either (left % 256) or (right % 256) is ≤ 127 and the other one of the two is ≥ 128. The typecast to signed 16-bit integers then turns the former into a large positive value and the latter into a large negative value, triggering the function's clipping code.
  3. The function then follows Bresenham's algorithm: left is ensured to be smaller than right by swapping the two values if necessary. If that happened, top and bottom are also swapped, regardless of their value – the algorithm does not care about their order.
  4. The slope in the X dimension is calculated using an integer division of ((bottom - top) / (right - left)). Both subtractions are done on signed 16-bit integers, and overflow accordingly.
  5. (-left × slope_x) is added to top, and left is set to 0.
  6. If both top and bottom are < 0 or ≥ 640, there's nothing to be unblitted. Otherwise, the final coordinates are clipped to the VRAM range of [(0, 0), (639, 399)].
  7. If the function got this far, the line to be unblitted is now very likely to reach from
    1. the top-left to the bottom-right corner, starting out at (0, 0) right away, or
    2. from the bottom-left corner to the top-right corner. In this case, you'd expect unblitting to end at (639, 0), but thanks to an off-by-one error, it actually ends at (640, -1), which is equivalent to (0, 0). Why add clipping to VRAM offset calculations when everything else is clipped already, right? :godzun:
Possible laser states that will cause the fault, with some debug output to help understand the cause, and any pellets removed for better readability. This can happen for all bosses that can potentially have shootout lasers on screen when being defeated, so it also applies to Mima. Fixing this is easier than understanding why it happens, but since y'all love reading this stuff…

tl;dr: TH01 has a high chance of freezing at a boss defeat sequence if there are diagonally moving lasers on screen, and if your PC-98 system raises a General Protection Fault on a 4-byte write to offset 0xFFFF, and if you don't run a TSR with an INT 0Dh handler that might handle this fault differently.

The easiest fix option would be to just remove the attempted laser unblitting entirely, but that would also have an impact on this game's… distinctive visual glitches, in addition to touching a whole lot of code bytes. If I ever get funded to work on a hypothetical TH01 Anniversary Edition that completely rearchitects the game to fix all these glitches, it would be appropriate there, but not for something that purports to be the original game.

(Sidenote to further hype up this Anniversary Edition idea for PC-98 hardware owners: With the amount of performance left on the table at every corner of this game, I'm pretty confident that we can get it to work decently on PC-98 models with just an 80286 CPU.)

Since we're in critical infrastructure territory once again, I went for the most conservative fix with the least impact on the binary: Simply changing any VRAM offsets >= 0xFFFD to 0x0000 to avoid the GPF, and leaving all other bugs in place. Sure, it's rather lazy and "incorrect"; the function still unblits a 32-pixel block there, but adding a special case for blitting 24 pixels would add way too much code. And seriously, it's not like anything happens in the 8 pixels between (24, 0) and (31, 0) inclusive during gameplay to begin with. To balance out the additional per-row if() branch, I inlined the VRAM page change I/O, saving two function calls and one memory write per unblitted row.

That means it's time for a new community_choice_fixes build, containing the new definitive bugfixed versions of these games: 2022-05-31-community-choice-fixes.zip Check the th01_critical_fixes branch for the modified TH01 code. It also contains a fix for the HP bar heap corruption in test or debug mode – simply changing the == comparison to <= is enough to avoid it, and negative HP will still create aesthetic glitch art.


Once again, I then was left with ½ of a push, which I finally filled with some FUUIN.EXE code, specifically the verdict screen. The most interesting part here is the player title calculation, which is quite sneaky: There are only 6 skill levels, but three groups of titles for each level, and the title you'll see is picked from a random group. It looks like this is the first time anyone has documented the calculation?
As for the levels, ZUN definitely didn't expect players to do particularly well. With a 1cc being the standard goal for completing a Touhou game, it's especially funny how TH01 expects you to continue a lot: The code has branches for up to 21 continues, and the on-screen table explicitly leaves room for 3 digits worth of continues per 5-stage scene. Heck, these counts are even stored in 32-bit long variables.

Next up: 📝 Finally finishing the long overdue Touhou Patch Center MediaWiki update work, while continuing with Kikuri in the meantime. Originally I wasn't sure about what to do between Elis and Seihou, but with Ember2528's surprise contribution last week, y'all have demonstrated more than enough interest in the idea of getting TH01 done sooner rather than later. And I agree – after all, we've got the 25th anniversary of its first public release coming up on August 15, and I might still manage to completely decompile this game by that point…

📝 Posted:
🚚 Summary of:
P0190, P0191, P0192
Commits:
5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:
nrook, -Tom-, [Anonymous]
🏷 Tags:

The important things first:

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:


Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels
A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.


And then the code quality takes a nosedive with Shinki's main function. :onricdennat: Even in TH05, these boss and midboss update functions are still very imperative:

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.


The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.


Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

📝 Posted:
🚚 Summary of:
P0189
Commits:
22abdd1...b4876b6
💰 Funded by:
Arandui, Lmocinemod
🏷 Tags:

(Before we start: Make sure you've read the current version of the FAQ section on a potential takedown of this project, updated in light of the recent DMCA claims against PC-98 Touhou game downloads.)


Slight change of plans, because we got instructions for reliably reproducing the TH04 Kurumi Divide Error crash! Major thanks to Colin Douglas Howell. With those, it also made sense to immediately look at the crash in the Stage 4 Marisa fight as well. This way, I could release both of the obligatory bugfix mods at the same time.
Especially since it turned out that I was wrong: Both crashes are entirely unrelated to the custom entity structure that would have required PI-centric progress. They are completely specific to Kurumi's and Marisa's danmaku-pattern code, and really are two separate bugs with no connection to each other. All of the necessary research nicely fit into Arandui's 0.5 pushes, with no further deep understanding required here.

But why were there still three weeks between Colin's message and this blog post? DMCA distractions aside: There are no easy fixes this time, unlike 📝 back when I looked at the Stage 5 Yuuka crash. Just like how division by zero is undefined in mathematics, it's also, literally, undefined what should happen instead of these two Divide error crashes. This means that any possible "fix" can only ever be a fanfiction interpretation of the intentions behind ZUN's code. The gameplay community should be aware of this, and might decide to handle these cases differently. And if we have to go into fanfiction territory to work around crashes in the canon games, we'd better document what exactly we're fixing here and how, as comprehensible as possible.

  1. Kurumi's crash
  2. Marisa's crash

With that out of the way, let's look at Kurumi's crash first, since it's way easier to grasp. This one is known to primarily happen to new players, and it's easy to see why:

The pattern that causes the crash in Kurumi's fight. Also demonstrates how the number of bullets in a ring is always halved on Easy Mode after the rank-based tuning, leading to just a 3-ring on playperf = 16.

So, what should the workaround look like? Obviously, we want to modify neither the default number of ring bullets nor the tuning algorithm – that would change all other non-crashing variations of this pattern on other difficulties and ranks, creating a fork of the original gameplay. Instead, I came up with four possible workarounds that all seemed somewhat logical to me:

  1. Firing no bullet, i.e., interpreting 0-ring literally. This would create the only constellation in which a call to the bullet group spawn functions would not spawn at least one new bullet.
  2. Firing a "1-ring", i.e., a single bullet. This would be consistent with how the bullet spawn functions behave for "0-way" stack and spread groups.
  3. Firing a "∞-ring", i.e., 200 bullets, which is as much as the game's cap on 16×16 bullets would allow. This would poke fun at the whole "division by zero" idea… but given that we're still talking about Easy Mode (and especially new players) here, it might be a tad too cruel. Certainly the most trollish interpretation.
  4. Triggering an immediate Game Over, exchanging the hard crash for a softer and more controlled shutdown. Certainly the option that would be closest to the behavior of the original games, and perhaps the only one to be accepted in Serious, High-Level Play™.

As I was writing this post, it felt increasingly wrong for me to make this decision. So I once again went to Twitter, where 56.3% voted in favor of the 1-bullet option. Good that I asked! I myself was more leaning towards the 0-bullet interpretation, which only got 28.7% of the vote. Also interesting are the 2.3% in favor of the Game Over option but I get it, low-rank Easy Mode isn't exactly the most competitive mode of playing TH04.
There are reports of Kurumi crashing on higher difficulties as well, but I could verify none of them. If they aren't fixed by this workaround, they're caused by an entirely different bug that we have yet to discover.


Onto the Stage 4 Marisa crash then, which does in fact apply to all difficulty levels. I was also wrong on this one – it's a hell of a lot more intricate than being just a division by the number of on-screen bits. Without having decompiled the entire fight, I can't give a completely accurate picture of what happens there yet, but here's the rough idea:

Reference points for Marisa's point-reflected movement. Cyan: Marisa's position, green: (192, 112), yellow: the intended end point.
One of the two patterns in TH04's Stage 4 Marisa boss fight that feature frame number-dependent point-reflected movement. The bits were hacked to self-destruct on the respective frame.

tl;dr: "Game crashes if last bit destroyed within 4-frame window near end of two patterns". For an informed decision on a new movement behavior for these last 8 frames, we definitely need to know all the details behind the crash though. Here's what I would interpret into the code:

  1. Not moving at all, i.e., interpreting 0 as the middle ground between positive and negative movement. This would also make sense because a 12-frame duration implies 100% of the movement to consist of the braking phase – and Marisa wasn't moving before, after all.
  2. Move at maximum speed, i.e., dividing by 1 rather than 0. Since the movement duration is still 12 in this case, Marisa will immediately start braking. In total, she will move exactly ¾ of the way from her initial position to (192, 112) within the 8 frames before the pattern ends.
  3. Directly warping to (192, 112) on frame 0, and to the point-reflected target on 4, respectively. This "emulates" the division by zero by moving Marisa at infinite speed to the exact two points indicated by the velocity formula. It also fits nicely into the 8 frames we have to fill here. Sure, Marisa can't reach these points at any other duration, but why shouldn't she be able to, with infinite speed? Then again, if Marisa is far away enough from (192, 112), this workaround would warp her across the entire playfield. Can Marisa teleport according to lore? I have no idea… :tannedcirno:
  4. Triggering an immediate Game O– hell no, this is the Stage 4 boss, people already hate losing runs to this bug!

Asking Twitter worked great for the Kurumi workaround, so let's do it again! Gotta attach a screenshot of an earlier draft of this blog post though, since this stuff is impossible to explain in tweets…

…and it went through the roof, becoming the most successful ReC98 tweet so far?! Apparently, y'all really like to just look at descriptions of overly complex bugs that I'd consider way beyond the typical attention span that can be expected from Twitter. Unfortunately, all those tweet impressions didn't quite translate into poll turnout. The results were pretty evenly split between 1) and 2), with option 1) just coming out slightly ahead at 49.1%, compared to 41.5% of option 2).

(And yes, I only noticed after creating the poll that warping to both the green and yellow points made more sense than warping to just one of the two. Let's hope that this additional variant wouldn't have shifted the results too much. Both warp options only got 9.4% of the vote after all, and no one else came up with the idea either. :onricdennat: In the end, you can always merge together your preferred combination of workarounds from the Git branches linked below.)


So here you go: The new definitive version of TH04, containing not only the community-chosen Kurumi and Stage 4 Marisa workaround variant, but also the 📝 No-EMS bugfix from last year. Edit (2022-05-31): This package is outdated, 📝 the current version is here! 2022-04-18-community-choice-fixes.zip Oh, and let's also add spaztron64's TH03 GDC clock fix from 2019 because why not. This binary was built from the community_choice_fixes branch, and you can find the code for all the individual workarounds on these branches:

Again, because it can't be stated often enough: These fixes are fanfiction. The gameplay community should be aware of this, and might decide to handle these cases differently.


With all of that taking way more time to evaluate and document, this research really had to become part of a proper push, instead of just being covered in the quick non-push blog post I initially intended. With ½ of a push left at the end, TH05's Stage 1-5 boss background rendering functions fit in perfectly there. If you wonder how these static backdrop images even need any boss-specific code to begin with, you're right – it's basically the same function copy-pasted 4 times, differing only in the backdrop image coordinates and some other inconsequential details.
Only Sara receives a nice variation of the typical 📝 blocky entrance animation: The usually opaque bitmap data from ST00.BB is instead used as a transition mask from stage tiles to the backdrop image, by making clever use of the tile invalidation system:

TH04 uses the same effect a bit more frequently, for its first three bosses.

Next up: Shinki, for real this time! I've already managed to decompile 10 of her 11 danmaku patterns within a little more than one push – and yes, that one is included in there. Looks like I've slightly overestimated the amount of work required for TH04's and TH05's bosses…

📝 Posted:
🚚 Summary of:
P0186, P0187, P0188
Commits:
a21ab3d...bab5634, bab5634...426a531, 426a531...e881f95
💰 Funded by:
Blue Bolt, [Anonymous], nrook
🏷 Tags:

Did you know that moving on top of a boss sprite doesn't kill the player in TH04, only in TH05?

Screenshot of Reimu moving on top of Stage 6 Yuuka, demonstrating the lack of boss↔player collision in TH04
Yup, Reimu is not getting hit… yet.

That's the first of only three interesting discoveries in these 3 pushes, all of which concern TH04. But yeah, 3 for something as seemingly simple as these shared boss functions… that's still not quite the speed-up I had hoped for. While most of this can be blamed, again, on TH04 and all of its hardcoded complexities, there still was a lot of work to be done on the maintenance front as well. These functions reference a bunch of code I RE'd years ago and that still had to be brought up to current standards, with the dependencies reaching from 📝 boss explosions over 📝 text RAM overlay functionality up to in-game dialog loading.

The latter provides a good opportunity to talk a bit about x86 memory segmentation. Many aspiring PC-98 developers these days are very scared of it, with some even going as far as to rather mess with Protected Mode and DOS extenders just so that they don't have to deal with it. I wonder where that fear comes from… Could it be because every modern programming language I know of assumes memory to be flat, and lacks any standard language-level features to even express something like segments and offsets? That's why compilers have a hard time targeting 16-bit x86 these days: Doing anything interesting on the architecture requires giving the programmer full control over segmentation, which always comes down to adding the typical non-standard language extensions of compilers from back in the day. And as soon as DOS stopped being used, these extensions no longer made sense and were subsequently removed from newer tools. A good example for this can be found in an old version of the NASM manual: The project started as an attempt to make x86 assemblers simple again by throwing out most of the segmentation features from MASM-style assemblers, which made complete sense in 1996 when 16-bit DOS and Windows were already on their way out. But there was a point to all those features, and that's why ReC98 still has to use the supposedly inferior TASM.

Not that this fear of segmentation is completely unfounded: All the segmentation-related keywords, directives, and #pragmas provided by Borland C++ and TASM absolutely can be the cause of many weird runtime bugs. Even if the compiler or linker catches them, you are often left with confusing error messages that aged just as poorly as memory segmentation itself.
However, embracing the concept does provide quite the opportunity for optimizations. While it definitely was a very crazy idea, there is a small bit of brilliance to be gained from making proper use of all these segmentation features. Case in point: The buffer for the in-game dialog scripts in TH04 and TH05.

// Thanks to the semantics of `far` pointers, we only need a single 32-bit
// pointer variable for the following code.
extern unsigned char far *dialog_p;

// This master.lib function returns a `void __seg *`, which is a 16-bit
// segment-only pointer. Converting to a `far *` yields a full segment:offset
// pointer to offset 0000h of that segment.
dialog_p = (unsigned char far *)hmem_allocbyte(/* … */);

// Running the dialog script involves pointer arithmetic. On a far pointer,
// this only affects the 16-bit offset part, complete with overflow at 64 KiB,
// from FFFFh back to 0000h.
dialog_p += /* … */;
dialog_p += /* … */;
dialog_p += /* … */;

// Since the segment part of the pointer is still identical to the one we
// allocated above, we can later correctly free the buffer by pulling the
// segment back out of the pointer.
hmem_free((void __seg *)dialog_p);

If dialog_p was a huge pointer, any pointer arithmetic would have also adjusted the segment part, requiring a second pointer to store the base address for the hmem_free call. Doing that will also be necessary for any port to a flat memory model. Depending on how you look at it, this compression of two logical pointers into a single variable is either quite nice, or really, really dumb in its reliance on the precise memory model of one single architecture. :tannedcirno:


Why look at dialog loading though, wasn't this supposed to be all about shared boss functions? Well, TH04 unnecessarily puts certain stage-specific code into the boss defeat function, such as loading the alternate Stage 5 Yuuka defeat dialog before a Bad Ending, or initializing Gengetsu after Mugetsu's defeat in the Extra Stage.
That's TH04's second core function with an explicit conditional branch for Gengetsu, after the 📝 dialog exit code we found last year during EMS research. And I've heard people say that Shinki was the most hardcoded fight in PC-98 Touhou… Really, Shinki is a perfectly regular boss, who makes proper use of all internal mechanics in the way they were intended, and doesn't blast holes into the architecture of the game. Even within TH05, it's Mai and Yuki who rely on hacks and duplicated code, not Shinki.

The worst part about this though? How the function distinguishes Mugetsu from Gengetsu. Once again, it uses its own global variable to track whether it is called the first or the second time within TH04's Extra Stage, unrelated to the same variable used in the dialog exit function. But this time, it's not just any newly created, single-use variable, oh no. In a misguided attempt to micro-optimize away a few bytes of conventional memory, TH04 reserves 16 bytes of "generic boss state", which can (and are) freely used for anything a boss doesn't want to store in a more dedicated variable.
It might have been worth it if the bosses actually used most of these 16 bytes, but the majority just use (the same) two, with only Stage 4 Reimu using a whopping seven different ones. To reverse-engineer the various uses of these variables, I pretty much had to map out which of the undecompiled danmaku-pattern functions corresponds to which boss fight. In the end, I assigned 29 different variable names for each of the semantically different use cases, which made up another full push on its own.

Now, 16 bytes of wildly shared state, isn't that the perfect recipe for bugs? At least during this cursory look, I haven't found any obvious ones yet. If they do exist, it's more likely that they involve reused state from earlier bosses – just how the Shinki death glitch in TH05 is caused by reusing cheeto data from way back in Stage 4 – and hence require much more boss-specific progress.
And yes, it might have been way too early to look into all these tiny details of specific boss scripts… but then, this happened:

TH04 crashing to the DOS prompt in the Stage 4 Marisa fight, right as the last of her bits is destroyed

Looks similar to another screenshot of a crash in the same fight that was reported in December, doesn't it? I was too much in a hurry to figure it out exactly, but notice how both crashes happen right as the last of Marisa's four bits is destroyed. KirbyComment has suspected this to be the cause for a while, and now I can pretty much confirm it to be an unguarded division by the number of on-screen bits in Marisa-specific pattern code. But what's the cause for Kurumi then? :thonk:
As for fixing it, I can go for either a fast or a slow option:

  1. Superficially fixing only this crash will probably just take a fraction of a push.
  2. But I could also go for a deeper understanding by looking at TH04's version of the 📝 custom entity structure. It not only stores the data of Marisa's bits, but is also very likely to be involved in Kurumi's crash, and would get TH04 a lot closer to 100% PI. Taking that look will probably need at least 2 pushes, and might require another 3-4 to completely decompile Marisa's fight, and 2-3 to decompile Kurumi's.

OK, now that that's out of the way, time to finish the boss defeat function… but not without stumbling over the third of TH04's quirks, relating to the Clear Bonus for the main game or the Extra Stage:

And after another few collision-related functions, we're now truly, finally ready to decompile bosses in both TH04 and TH05! Just as the anything funds were running out… :onricdennat: The remaining ¼ of the third push then went to Shinki's 32×32 ball bullets, rounding out this delivery with a small self-contained piece of the first TH05 boss we're probably going to look at.

Next up, though: I'm not sure, actually. Both Shinki and Elis seem just a little bit larger than the 2¼ or 4 pushes purchased so far, respectively. Now that there's a bunch of room left in the cap again, I'll just let the next contribution decide – with a preference for Shinki in case of a tie. And if it will take longer than usual for the store to sell out again this time (heh), there's still the 📝 PC-98 text RAM JIS trail word rendering research waiting to be documented.

📝 Posted:
🚚 Summary of:
P0184, P0185
Commits:
f9d983e...f918298, f918298...a21ab3d
💰 Funded by:
-Tom-, Blue Bolt, [Anonymous]
🏷 Tags:

Two years after 📝 the first look at TH04's and TH05's bullets, we finally get to finish their logic code by looking at the special motion types. Bullets as a whole still aren't completely finished as the rendering code is still waiting to be RE'd, but now we've got everything about them that's required for decompiling the midboss and boss fights of these games.

Just like the motion types of TH01's pellets, the ones we've got here really are special enough to warrant an enum, despite all the overlap in the "slow down and turn" and "bounce at certain edges of the playfield" types. Sure, including them in the bitfield I proposed two years ago would have allowed greater variety, but it wouldn't have saved any memory. On the contrary: These types use a single global state variable for the maximum turn count and delta speed, which a proper customizable architecture would have to integrate into the bullet structure. Maybe it is possible to stuff everything into the same amount of bytes, but not without first completely rearchitecting the bullet structure and removing every single piece of redundancy in there. Simply extending the system by adding a new enum value for a new motion type would be way more straightforward for modders.

Speaking about memory, TH05 already extends the bullet structure by 6 bytes for the "exact linear movement" type exclusive to that game. This type is particularly interesting for all the prospective PC-98 game developers out there, as it nicely points out the precision limits of Q12.4 subpixels.
Regular bullet movement works by adding a Q12.4 velocity to a Q12.4 position every frame, with the velocity typically being calculated only once on spawn time from an 8-bit angle and a Q12.4 speed. Quantization errors from this initial calculation can quickly compound over all the frames a bullet spends moving across the playfield. If a bullet is only supposed to move on a straight line though, there is a more precise way of calculating its position: By storing the origin point, movement angle, and total distance traveled, you can perform a full polar→Cartesian transformation every frame. Out of the 10 danmaku patterns in TH05 that use this motion type, the difference to regular bullet movement can be best seen in Louise's final pattern:

Louise's final pattern in its original form, demonstrating exact linear bullet movement. Note how each bullet spawns slightly behind the delay cloud: ZUN simply forgot to shift the fixed origin point along with it.
The same pattern with standard bullet movement, corrupting its intended appearance. No delay cloud-related oversights here though, at least.

Not far away from the regular bullet code, we've also got the movement function for the infamous curve / "cheeto" bullets. I would have almost called them "cheetos" in the code as well, which surely fits more nicely into 8.3 filenames than "curve bullets" does, but eh, trademarks…

As for hitboxes, we got a 16×16 one on the head node, and a 12×12 one on the 16 trail nodes. The latter simply store the position of the head node during the last 16 frames, Snake style. But what you're all here for is probably the turning and homing algorithm, right? Boiled down to its essence, it works like this:

// [head] points to the controlled "head" part of a curve bullet entity.
// Angles are stored with 8 bits representing a full circle, providing free
// normalization on arithmetic overflow.
// The directions are ordered as you would expect:
// • 0x00: right	(sin(0x00) =  0, cos(0x00) = +1)
// • 0x40: down 	(sin(0x40) = +1, cos(0x40) =  0)
// • 0x80: left 	(sin(0x80) =  0, cos(0x80) = -1)
// • 0xC0: up   	(sin(0xC0) = -1, cos(0xC0) =  0)
uint8_t angle_delta = (head->angle - player_angle_from(
	head->pos.cur.x, head->pos.cur.y
));

// Stop turning if the player is 1/128ths of a circle away from this bullet
const uint8_t SNAP = 0x02;

// Else, turn either clockwise or counterclockwise by 1/256th of a circle,
// depending on what would reach the player the fastest.
if((angle_delta > SNAP) && (angle_delta < static_cast<uint8_t>(-SNAP))) {
	angle_delta = (angle_delta >= 0x80) ? -0x01 : +0x01;
}
head_p->angle -= angle_delta;

5 lines of code, and not all too difficult to follow once you are familiar with 8-bit angles… unlike what ZUN actually wrote. Which is 26 lines, and includes an unused "friction" variable that is never set to any value that makes a difference in the formula. :zunpet: uth05win correctly saw through that all and simplified this code to something equivalent to my explanation. Redoing that work certainly wasted a bit of my time, and means that I now definitely need to spend another push on RE'ing all the shared boss functions before I can start with Shinki.

So while a curve bullet's speed does get faster over time, its angular velocity is always limited to 1/256th of a circle per frame. This reveals the optimal strategy for dodging them: Maximize this delta angle by staying as close to 180° away from their current direction as possible, and let their acceleration do the rest.

At least that's the theory for dodging a single one. As a danmaku designer, you can now of course place other bullets at these technically optimal places to prevent a curve bullet pattern from being cheesed like that. I certainly didn't record the video above in a single take either… :tannedcirno:


After another bunch of boring entity spawn and update functions, the playfield shaking feature turned out as the most notable (and tricky) one to round out these two pushes. It's actually implemented quite well in how it simply "un-shakes" the screen by just marking every stage tile to be redrawn. In the context of all the other tile invalidation that can take place during a frame, that's definitely more performant than 📝 doing another EGC-accelerated memmove(). Due to these two games being double-buffered via page flipping, this invalidation only really needs to happen for the frame after the next one though. The immediately next frame will show the regular, un-shaken playfield on the other VRAM page first, except during the multi-frame shake animation when defeating a midboss, where it will also appear shifted in a different direction… 😵 Yeah, no wonder why ZUN just always invalidates all stage tiles for the next two frames after every shaking animation, which is guaranteed to handle both sporadic single-frame shakes and continuous ones. So close to good-code here.

Finally, this delivery was delayed a bit because -Tom- requested his round-up amount to be limited to the cap in the future. Since that makes it kind of hard to explain on a static page how much money he will exactly provide, I now properly modeled these discounts in the website code. The exact round-up amount is now included in both the pre-purchase breakdown, as well as the cap bar on the main page.
With that in place, the system is now also set up for round-up offers from other patrons. If you'd also like to support certain goals in this way, with any amount of money, now's the time for getting in touch with me about that. Known contributors only, though! 😛

Next up: The final bunch of shared boring boss functions. Which certainly will give me a break from all the maintenance and research work, and speed up delivery progress again… right?

📝 Posted:
🚚 Summary of:
P0182, P0183
Commits:
313450f...1e2c7ad, 1e2c7ad...f9d983e
💰 Funded by:
Lmocinemod, [Anonymous], Yanga
🏷 Tags:

Been 📝 a while since we last looked at any of TH03's game code! But before that, we need to talk about Y coordinates.

During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its line-doubled 640×200 resolution, which gives the in-game portion its distinctive stretched low-res look. This lower resolution is a consequence of using 📝 Promisence Soft's SPRITE16 driver: Its performance simply stems from the fact that it expects sprites to be stored in the bottom half of VRAM, which allows them to be blitted using the same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all other games. Reducing the visible resolution also means that the sprites can be stored on both VRAM pages, allowing the game to still be double-buffered. If you force the graphics chip to run at 640×400, you can see them:

The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
TH03's VRAM at regular line-doubled 640×200 resolutionTH03's VRAM at full 640×400 resolution, including the SPRITE16 sprite areaTH03's text layer during an in-game round.

Note that the text chip still displays its overlaid contents at 640×400, which means that TH03's in-game portion technically runs at two resolutions at the same time.

But that means that any mention of a Y coordinate is ambiguous: Does it refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially people who have known about the line doubling for years might almost expect technical blog posts on this game to use undoubled VRAM coordinates. So, let's introduce a new formatting convention for both on-screen 640×400 and undoubled 640×200 coordinates, and always write out both to minimize the confusion.


Alright, now what's the thing gonna be? The enemy structure is highly overloaded, being used for enemies, fireballs, and explosions with seemingly different semantics for each. Maybe a bit too much to be figured out in what should ideally be a single push, especially with all the functions that would need to be decompiled? Bullet code would be easier, but not exactly single-push material either. As it turns out though, there's something more fundamental left to be done first, which both of these subsystems depend on: collision detection!

And it's implemented exactly how I always naively imagined collision detection to be implemented in a fixed-resolution 2D bullet hell game with small hitboxes: By keeping a separate 1bpp bitmap of both playfields in memory, drawing in the collidable regions of all entities on every frame, and then checking whether any pixels at the current location of the player's hitbox are set to 1. It's probably not done in the other games because their single data segment was already too packed for the necessary 17,664 bytes to store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly useful, as the AI also uses it to elegantly learn what's on the playfield. By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total of 6,624 bytes of memory for the collision bitmaps of both playfields.

So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. :zunpet: And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX, BX, CX, and DX registers can also be accessed as two 8-bit registers, calculations that change the semantic meaning behind the value of a register, or just straight-up reassignments of different values to the same small set of registers. Sure, in some way it is impressive, and it all does work and correctly covers every edge case, but come on. This could have all been a lot more readable in exchange for just a few CPU cycles.

What's most interesting though are the actual shapes that these functions draw into the collision bitmap. On the surface, we have:

  1. vertical slopes at any angle across the whole playfield; exclusively used for Chiyuri's diagonal laser EX attack
  2. straight vertical lines, with a width of 1 tile; exclusively used for the 2×2 / 2×1 hitboxes of bullets
  3. rectangles at arbitrary sizes

But only 2) actually draws a full solid line. 1) and 3) are only ever drawn as horizontal stripes, with a hardcoded distance of 2 vertical tiles between every stripe of a slope, and 4 vertical tiles between every stripe of a rectangle. That's 66-75% of each rectangular entity's intended hitbox not actually taking part in collision detection. Now, if player hitboxes were ≤ 6 / 3 pixels, we'd have one possible explanation of how the AI can "cheat", because it could just precisely move through those blank regions at TAS speeds. So, let's make this two pushes after all and tell the complete story, since this is one of the more interesting aspects to still be documented in this game.


And the code only gets worse. :godzun: While the player collision detection function is decompilable, it might as well not have been, because it's just more of the same "optimized", hard-to-follow assembly. With the four splittable 16-bit registers having a total of 20 different meanings in this function, I would have almost preferred self-modifying code…

In fact, it was so bad that it prompted some maintenance work on my inline assembly coding standards as a whole. Turns out that the _asm keyword is not only still supported in modern Visual Studio compilers, but also in Clang with the -fms-extensions flag, and compiles fine there even for 64-bit targets. While that might sound like amazing news at first ("awesome, no need to rewrite this stuff for my x86_64 Linux port!"), you quickly realize that almost all inline assembly in this codebase assumes either PC-98 hardware, segmented 16-bit memory addressing, or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s register pseudovariables where possible. While they certainly have their drawbacks, being a non-standard extension that's not supported in other x86-targeting C compilers, their advantages are quite significant: They allow this code to stay in the same language, and provide slightly more immediate portability to any other architecture, together with 📝 readability and maintainability improvements that can get quite significant when combined with inlining:

// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);

However, register pseudovariables might cause potential portability issues as soon as they are mixed with inline assembly instructions that rely on their state. The lazy way of "supporting pseudo-registers" in other compilers would involve declaring the full set as global variables, which would immediately break every one of those instances:

_DI = 0;
_AX = 0xFFFF;

// Special x86 instruction doing the equivalent of
//
// 	*reinterpret_cast(MK_FP(_ES, _DI)) = _AX;
// 	_DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }

What's also not all too standardized, though, are certain variants of the asm keyword. That's why I've now introduced a distinction between the _asm keyword for "decently sane" inline assembly, and the slightly less standard asm keyword for inline assembly that relies on the contents of pseudo-registers, and should break on compilers that don't support them.
So yeah, have some minor portability work in exchange for these two pushes not having all that much in RE'd content.

With that out of the way and the function deciphered, we can confirm the player hitboxes to be a constant 8×8 / 8×4 pixels, and prove that the hit stripes are nothing but an adequate optimization that doesn't affect gameplay in any way.


And what's the obvious thing to immediately do if you have both the collision bitmap and the player hitbox? Writing a "real hitbox" mod, of course:

  1. Reorder the calls to rendering functions so that player and shot sprites are rendered after bullets
  2. Blank out all player sprite pixels outside an 8×8 / 8×4 box around the center point
  3. After the bullet rendering function, turn on the GRCG in RMW mode and set the tile register set to the background color
  4. Stretch the negated contents of collision bitmap onto each playfield, leaving only collidable pixels untouched
  5. Do the same with the actual, non-negated contents and a white color, for extra contrast against the background. This also makes sure to show any collidable areas whose sprite pixels are transparent, such as with the moon enemy. (Yeah, how unfair.) Doing that also loses a lot of information about the playfield, such as enemy HP indicated by their color, but what can you do:
A decently busy TH03 in-game frame.The underlying content of the collision bitmap, showing off all three different shapes together with the player hitboxes.
A decently busy TH03 in-game frame and its underlying collision bitmap, showing off all three different collision shapes together with the player hitboxes.

2022-02-18-TH03-real-hitbox.zip The secret for writing such mods before having reached a sufficient level of position independence? Put your new code segment into DGROUP, past the end of the uninitialized data section. That's why this modded MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these uninitialized 0 bytes between the end of the data segment and the first instruction of the mod code – normally, this number is simply a part of the MZ EXE header, and doesn't need to be redundantly stored on disk. Check the th03_real_hitbox branch for the code.

And now we know why so many "real hitbox" mods for the Windows Touhou games are inaccurate: The games would simply be unplayable otherwise – or can you dodge rapidly moving 2×2 / 2×1 blocks as an 8×8 / 8×4 rectangle that is smaller than your shot sprites, especially without focused movement? I can't. :tannedcirno: Maybe it will feel more playable after making explosions visible, but that would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both playfields per frame doesn't significantly drop the game's frame rate – so why did the drawing functions have to be micro-optimized again? It would be possible in one pass by using the GRCG's TDW mode, which should theoretically be 8× faster, but I have to stop somewhere. :onricdennat:

Next up: The final missing piece of TH04's and TH05's bullet-moving code, which will include a certain other type of projectile as well.

📝 Posted:
🚚 Summary of:
P0174, P0175, P0176, P0177, P0178, P0179, P0180, P0181
Commits:
27f901c...a0fe812, a0fe812...40ac9a7, 40ac9a7...c5dc45b, c5dc45b...5f0cabc, 5f0cabc...60621f8, 60621f8...9e5b344, 9e5b344...091f19f, 091f19f...313450f
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

Here we go, TH01 Sariel! This is the single biggest boss fight in all of PC-98 Touhou: If we include all custom effect code we previously decompiled, it amounts to a total of 10.31% of all code in TH01 (and 3.14% overall). These 8 pushes cover the final 8.10% (or 2.47% overall), and are likely to be the single biggest delivery this project will ever see. Considering that I only managed to decompile 6.00% across all games in 2021, 2022 is already off to a much better start!

So, how can Sariel's code be that large? Well, we've got:

In total, it's just under 3,000 lines of C++ code, containing a total of 8 definite ZUN bugs, 3 of them being subpixel/pixel confusions. That might not look all too bad if you compare it to the 📝 player control function's 8 bugs in 900 lines of code, but given that Konngara had 0… (Edit (2022-07-17): Konngara contains two bugs after all: A 📝 possible heap corruption in test or debug mode, and the infamous 📝 temporary green discoloration.) And no, the code doesn't make it obvious whether ZUN coded Konngara or Sariel first; there's just as much evidence for either.

Some terminology before we start: Sariel's first form is separated into four phases, indicated by different background images, that cycle until Sariel's HP reach 0 and the second, single-phase form starts. The danmaku patterns within each phase are also on a cycle, and the game picks a random but limited number of patterns per phase before transitioning to the next one. The fight always starts at pattern 1 of phase 1 (the random purple lasers), and each new phase also starts at its respective first pattern.


Sariel's bugs already start at the graphics asset level, before any code gets to run. Some of the patterns include a wand raise animation, which is stored in BOSS6_2.BOS:

TH01 BOSS6_2.BOS
Umm… OK? The same sprite twice, just with slightly different colors? So how is the wand lowered again?

The "lowered wand" sprite is missing in this file simply because it's captured from the regular background image in VRAM, at the beginning of the fight and after every background transition. What I previously thought to be 📝 background storage code has therefore a different meaning in Sariel's case. Since this captured sprite is fully opaque, it will reset the entire 128×128 wand area… wait, 128×128, rather than 96×96? Yup, this lowered sprite is larger than necessary, wasting 1,967 bytes of conventional memory.
That still doesn't quite explain the second sprite in BOSS6_2.BOS though. Turns out that the black part is indeed meant to unblit the purple reflection (?) in the first sprite. But… that's not how you would correctly unblit that?

VRAM after blitting the first sprite of TH01's BOSS6_2.BOS VRAM after blitting the second sprite of TH01's BOSS6_2.BOS

The first sprite already eats up part of the red HUD line, and the second one additionally fails to recover the seal pixels underneath, leaving a nice little black hole and some stray purple pixels until the next background transition. :tannedcirno: Quite ironic given that both sprites do include the right part of the seal, which isn't even part of the animation.


Just like Konngara, Sariel continues the approach of using a single function per danmaku pattern or custom entity. While I appreciate that this allows all pattern- and entity-specific state to be scoped locally to that one function, it quickly gets ugly as soon as such a function has to do more than one thing.
The "bird function" is particularly awful here: It's just one if(…) {…} else if(…) {…} else if(…) {…} chain with different branches for the subfunction parameter, with zero shared code between any of these branches. It also uses 64-bit floating-point double as its subpixel type… and since it also takes four of those as parameters (y'know, just in case the "spawn new bird" subfunction is called), every call site has to also push four double values onto the stack. Thanks to Turbo C++ even using the FPU for pushing a 0.0 constant, we have already reached maximum floating-point decadence before even having seen a single danmaku pattern. Why decadence? Every possible spawn position and velocity in both bird patterns just uses pixel resolution, with no fractional component in sight. And there goes another 720 bytes of conventional memory.

Speaking about bird patterns, the red-bird one is where we find the first code-level ZUN bug: The spawn cross circle sprite suddenly disappears after it finished spawning all the bird eggs. How can we tell it's a bug? Because there is code to smoothly fly this sprite off the playfield, that code just suddenly forgets that the sprite's position is stored in Q12.4 subpixels, and treats it as raw screen pixels instead. :zunpet: As a result, the well-intentioned 640×400 screen-space clipping rectangle effectively shrinks to 38×23 pixels in the top-left corner of the screen. Which the sprite is always outside of, and thus never rendered again.
The intended animation is easily restored though:

Sariel's third pattern, and the first to spawn birds, in its original and fixed versions. Note that I somewhat fixed the bird hatch animation as well: ZUN's code never unblits any frame of animation there, and simply blits every new one on top of the previous one.

Also, did you know that birds actually have a quite unfair 14×38-pixel hitbox? Not that you'd ever collide with them in any of the patterns…

Another 3 of the 8 bugs can be found in the symmetric, interlaced spawn rays used in three of the patterns, and the 32×32 debris "sprites" shown at their endpoint, at the edge of the screen. You kinda have to commend ZUN's attention to detail here, and how he wrote a lot of code for those few rapidly animated pixels that you most likely don't even notice, especially with all the other wrong pixels resulting from rendering glitches. One of the bugs in the very final pattern of phase 4 even turns them into the vortex sprites from the second pattern in phase 1 during the first 5 frames of the first time the pattern is active, and I had to single-step the blitting calls to verify it.
It certainly was annoying how much time I spent making sense of these bugs, and all weird blitting offsets, for just a few pixels… Let's look at something more wholesome, shall we?


So far, we've only seen the PC-98 GRCG being used in RMW (read-modify-write) mode, which I previously 📝 explained in the context of TH01's red-white HP pattern. The second of its three modes, TCR (Tile Compare Read), affects VRAM reads rather than writes, and performs "color extraction" across all 4 bitplanes: Instead of returning raw 1bpp data from one plane, a VRAM read will instead return a bitmask, with a 1 bit at every pixel whose full 4-bit color exactly matches the color at that offset in the GRCG's tile register, and 0 everywhere else. Sariel uses this mode to make sure that the 2×2 particles and the wind effect are only blitted on top of "air color" pixels, with other parts of the background behaving like a mask. The algorithm:

  1. Set the GRCG to TCR mode, and all 8 tile register dots to the air color
  2. Read N bits from the target VRAM position to obtain an N-bit mask where all 1 bits indicate air color pixels at the respective position
  3. AND that mask with the alpha plane of the sprite to be drawn, shifted to the correct start bit within the 8-pixel VRAM byte
  4. Set the GRCG to RMW mode, and all 8 tile register dots to the color that should be drawn
  5. Write the previously obtained bitmask to the same position in VRAM

Quite clever how the extracted colors double as a secondary alpha plane, making for another well-earned good-code tag. The wind effect really doesn't deserve it, though:

As far as I can tell, ZUN didn't use TCR mode anywhere else in PC-98 Touhou. Tune in again later during a TH04 or TH05 push to learn about TDW, the final GRCG mode!


Speaking about the 2×2 particle systems, why do we need three of them? Their only observable difference lies in the way they move their particles:

  1. Up or down in a straight line (used in phases 4 and 2, respectively)
  2. Left or right in a straight line (used in the second form)
  3. Left and right in a sinusoidal motion (used in phase 3, the "dark orange" one)

Out of all possible formats ZUN could have used for storing the positions and velocities of individual particles, he chose a) 64-bit / double-precision floating-point, and b) raw screen pixels. Want to take a guess at which data type is used for which particle system?

If you picked double for 1) and 2), and raw screen pixels for 3), you are of course correct! :godzun: Not that I'm implying that it should have been the other way round – screen pixels would have perfectly fit all three systems use cases, as all 16-bit coordinates are extended to 32 bits for trigonometric calculations anyway. That's what, another 1.080 bytes of wasted conventional memory? And that's even calculated while keeping the current architecture, which allocates space for 3×30 particles as part of the game's global data, although only one of the three particle systems is active at any given time.

That's it for the first form, time to put on "Civilization of Magic"! Or "死なばもろとも"? Or "Theme of 地獄めくり"? Or whatever SYUGEN is supposed to mean…


… and the code of these final patterns comes out roughly as exciting as their in-game impact. With the big exception of the very final "swaying leaves" pattern: After 📝 Q4.4, 📝 Q28.4, 📝 Q24.8, and double variables, this pattern uses… decimal subpixels? Like, multiplying the number by 10, and using the decimal one's digit to represent the fractional part? Well, sure, if you really insist on moving the leaves in cleanly represented integer multiples of ⅒, which is infamously impossible in IEEE 754. Aside from aesthetic reasons, it only really combines less precision (10 possible fractions rather than the usual 16) with the inferior performance of having to use integer divisions and multiplications rather than simple bit shifts. And it's surely not because the leaf sprites needed an extended integer value range of [-3276, +3276], compared to Q12.4's [-2047, +2048]: They are clipped to 640×400 screen space anyway, and are removed as soon as they leave this area.

This pattern also contains the second bug in the "subpixel/pixel confusion hiding an entire animation" category, causing all of BOSS6GR4.GRC to effectively become unused:

The "swaying leaves" pattern. ZUN intended a splash animation to be shown once each leaf "spark" reaches the top of the playfield, which is never displayed in the original game.

At least their hitboxes are what you would expect, exactly covering the 30×30 pixels of Reimu's sprite. Both animation fixes are available on the th01_sariel_fixes branch.

After all that, Sariel's main function turned out fairly unspectacular, just putting everything together and adding some shake, transition, and color pulse effects with a bunch of unnecessary hardware palette changes. There is one reference to a missing BOSS6.GRP file during the first→second form transition, suggesting that Sariel originally had a separate "first form defeat" graphic, before it was replaced with just the shaking effect in the final game.
Speaking about the transition code, it is kind of funny how the… um, imperative and concrete nature of TH01 leads to these 2×24 lines of straight-line code. They kind of look like ZUN rattling off a laundry list of subsystems and raw variables to be reinitialized, making damn sure to not forget anything.


Whew! Second PC-98 Touhou boss completely decompiled, 29 to go, and they'll only get easier from here! 🎉 The next one in line, Elis, is somewhere between Konngara and Sariel as far as x86 instruction count is concerned, so that'll need to wait for some additional funding. Next up, therefore: Looking at a thing in TH03's main game code – really, I have little idea what it will be!

Now that the store is open again, also check out the 📝 updated RE progress overview I've posted together with this one. In addition to more RE, you can now also directly order a variety of mods; all of these are further explained in the order form itself.

📝 Posted:
🚚 Summary of:
P0172, P0173
Commits:
49e6789...2d5491e, 2d5491e...27f901c
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:

TH03 finally passed 20% RE, and the newly decompiled code contains no serious ZUN bugs! What a nice way to end the year.

There's only a single unlockable feature in TH03: Chiyuri and Yumemi as playable characters, unlocked after a 1CC on any difficulty. Just like the Extra Stages in TH04 and TH05, YUME.NEM contains a single designated variable for this unlocked feature, making it trivial to craft a fully unlocked score file without recording any high scores that others would have to compete against. So, we can now put together a complete set for all PC-98 Touhou games: 2021-12-27-Fully-unlocked-clean-score-files.zip It would have been cool to set the randomly generated encryption keys in these files to a fixed value so that they cancel out and end up not actually encrypting the file. Too bad that TH03 also started feeding each encrypted byte back into its stream cipher, which makes this impossible.

The main loading and saving code turned out to be the second-cleanest implementation of a score file format in PC-98 Touhou, just behind TH02. Only two of the YUME.NEM functions come with nonsensical differences between OP.EXE and MAINL.EXE, rather than 📝 all of them, as in TH01 or 📝 too many of them, as in TH04 and TH05. As for the rest of the per-difficulty structure though… well, it quickly becomes clear why this was the final score file format to be RE'd. The name, score, and stage fields are directly stored in terms of the internal REGI*.BFT sprite IDs used on the high score screen. TH03 also stores 10 score digits for each place rather than the 9 possible ones, keeps any leading 0 digits, and stores the letters of entered names in reverse order… yeah, let's decompile the high score screen as well, for a full understanding of why ZUN might have done all that. (Answer: For no reason at all. :zunpet:)


And wow, what a breath of fresh air. It's surely not good-code: The overlapping shadows resulting from using a 24-pixel letterspacing with 32-pixel glyphs in the name column led ZUN to do quite a lot of unnecessary and slightly confusing rendering work when moving the cursor back and forth, and he even forgot about the EGC there. But it's nowhere close to the level of jank we saw in 📝 TH01's high score menu last year. Good to see that ZUN had learned a thing or two by his third game – especially when it comes to storing the character map cursor in terms of a character ID, and improving the layout of the character map:

The alphabet available for TH03 high score names.

That's almost a nicely regular grid there. With the question mark and the double-wide SP, BS, and END options, the cursor movement code only comes with a reasonable two exceptions, which are easily handled. And while I didn't get this screen completely decompiled, one additional push was enough to cover all important code there.

The only potential glitch on this screen is a result of ZUN's continued use of binary-coded decimal digits without any bounds check or cap. Like the in-game HUD score display in TH04 and TH05, TH03's high score screen simply uses the next glyph in the character set for the most significant digit of any score above 1,000,000,000 points – in this case, the period. Still, it only really gets bad at 8,000,000,000 points: Once the glyphs are exhausted, the blitting function ends up accessing garbage data and filling the entire screen with garbage pixels. For comparison though, the current world record is 133,650,710 points, so good luck getting 8 billion in the first place.

Next up: Starting 2022 with the long-awaited decompilation of TH01's Sariel fight! Due to the 📝 recent price increase, we now got a window in the cap that is going to remain open until tomorrow, providing an early opportunity to set a new priority after Sariel is done.

📝 Posted:
🚚 Summary of:
P0170, P0171
Commits:
(Website) 0c4ab41...4f04091, (Website) 4f04091...e12cf26
💰 Funded by:
[Anonymous]
🏷 Tags:

The "bad" news first: Expanding to Stripe in order to support Google Pay requires bureaucratic effort that is not quite justified yet, and would only be worth it after the next price increase.

Visualizing technical debt has definitely been overdue for a while though. With 1 of these 2 pushes being focused on this topic, it makes sense to summarize once again what "technical debt" means in the context of ReC98, as this info was previously kind of scattered over multiple blog posts. Mainly, it encompasses

Technically (ha), it would also include all of master.lib, which has always been compiled into the binaries in this way, and which will require quite a bit of dedicated effort to be moved out into a properly linkable library, once it's feasible. But this code has never been part of any progress metric – in fact, 0% RE is defined as the total number of x86 instructions in the binary minus any library code. There is also no relation between instruction numbers and the time it will take to finalize master.lib code, let alone a precedent of how much it would cost.

If we now want to express technical debt as a percentage, it's clear where the 100% point would be: when all RE'd code is also compiled in from a translation unit outside the big .ASM one. But where would 0% be? Logically, it would be the point where no reverse-engineered code has ever been moved out of the big translation units yet, and nothing has ever been decompiled. With these boundary points, this is what we get:

Visualizing technical debt in terms of the total amount of instructions that could possibly be not finalized

Not too bad! So it's 6.22% of total RE that we will have to revisit at some point, concentrated mostly around TH04 and TH05 where it resulted from a focus on position independence. The prices also give an accurate impression of how much more work would be required there.

But is that really the best visualization? After all, it requires an understanding of our definition of technical debt, so it's maybe not the most useful measurement to have on a front page. But how about subtracting those 6.22% from the number shown on the RE% bars? Then, we get this:

Visualizing technical debt in terms of the absolute number of 'finalized' instructions per binary

Which is where we get to the good news: Twitter surprisingly helped me out in choosing one visualization over the other, voting 7:2 in favor of the Finalized version. While this one requires you to manually calculate € finalized - € RE'd to obtain the raw financial cost of technical debt, it clearly shows, for the first time, how far away we are from the main goal of fully decompiling all 5 games… at least to the extent it's possible.


Now that the parser is looking at these recursively included .ASM files for the first time, it needed a small number of improvements to correctly handle the more advanced directives used there, which no automatic disassembler would ever emit. Turns out I've been counting some directives as instructions that never should have been, which is where the additional 0.02% total RE came from.

One more overcounting issue remains though. Some of the RE'd assembly slices included by multiple games contain different if branches for each game, like this:

; An example assembly file included by both TH04's and TH05's MAIN.EXE:
if (GAME eq 5)
	; (Code for TH05)
else
	; (Code for TH04)
endif

Currently, the parser simply ignores if, else, and endif, leading to the combined code of all branches being counted for every game that includes such a file. This also affects the calculated speed, and is the reason why finalization seems to be slightly faster than reverse-engineering, at currently 471 instructions per push compared to 463. However, it's not that bad of a signal to send: Most of the not yet finalized code is shared between TH04 and TH05, so finalizing it will roughly be twice as fast as regular reverse-engineering to begin with. (Unless the code then turns out to be twice as complex than average code… :tannedcirno:).

For completeness, finalization is now also shown as part of the per-commit metrics. Now it's clearly visible what I was doing in those very slow five months between P0131 and P0140, where the progress bar didn't move at all: Repaying 3.49% of previously accumulated technical debt across all games. 👌


As announced, I've also implemented a new caching system for this website, as the second main feature of these two pushes. By appending a hash string to the URLs of static resources, your browser should now both cache them forever and re-download them once they did change on the server. This avoids the unnecessary (and quite frankly, embarrassing) re-requests for all static resources that typically just return a 304 Not Modified response. As a result, the blog should now load a bit faster on repeated visits, especially on slower connections. That should allow me to deliberately not paginate it for another few years, without it getting all too slow – and should prepare us for the day when our first game reaches 100% and the server will get smashed. :onricdennat: However, I am open to changing the progress blog link in the navigation bar at the top to the list of tags, once people start complaining.

Apart frome some more invisible correctness and QoL improvements, I've also prepared some new funding goals, but I'll cover those once the store reopens, next year. Syntax highlighting for code snippets would have also been cool, but unfortunately didn't make it into those two pushes. It's still on the list though!

Next up: Back to RE with the TH03 score file format, and other code that surrounds it.

📝 Posted:
🏷 Tags:

Made it through almost three years without a price increase! It's been overdue for a while, though.

With the last months being full of rather research- and documentation-heavy pushes, I've been just about able to keep up with the existing subscriptions. By now, the amount of quality control and documentation I found myself putting into this project has far surpassed the raw reverse-engineering work. Back at the beginning of 2019 when I decided on the previous push price of 30 €, I didn't have this blog nor the current aspirations at code quality. Neither of these have ever been reflected in the price, and I still find it hard to put a number on them. On the other hand, I continue to dislike the typical Patreon model of no inherent defined obligations on my part, and no direct association of the resulting work with the person who funded it. You might have noticed that I don't use the word "donations" anywhere, and instead refer to them as "orders" or "purchases" – and that's precisely for this reason.
The result, however, has been a sold-out store for pretty much all of 2021. I can only begin to imagine how much potential revenue I've already lost from people who might have wanted to contribute at one point, but couldn't, and have already written off this project…

Raising prices is pretty much the only way to get the pending workload back to a more comfortable amount. I also thought about a two-tiered system: Have a documentation-less option for 30 €, and take 60 € for any push that should be accompanied by a blog post. However, skimping on documentation will compromise the quality of the code as well. Writing these blog posts presents another chance of improving it before release, which has made quite a difference on many occasions. And after all, this documentation is the one thing about ReC98 that people mainly interact with. As long as we haven't hit 100% RE, the actual code seems to be an afterthought, which is perfectly understandable: Why start work on a bigger mod or port now if the code is steadily improving in every aspect, and it all will be just a bit more maintainable in a few months?

But why go more commercial then, and especially now? If the recent attention to spaztron64's PC-98 Touhou collection package is any indication, ReC98 has a way bigger career potential than the dead-end RL job I found myself in. Demand for fixed translations and replay support is definitely there – and given that these haven't been done so far, it's very likely that I'll end up as the one to implement such mods, especially if that should happen before reaching 100% RE or PI. People also still seem to want* a port to IBM-compatible DOS, 📝 even though this makes no sense? But if this is something you all want to pay for, then sure, why not. :tannedcirno:
And even right now, working on ReC98 sure beats writing junk software using ill-suited technologies for highly corporate clients, or living close to a world where academic papers are valued higher than working and maintained code. I am in fact very happy whenever I'm done with that for the day, and get to work on ReC98! Who would have thought.

So let's try to grow this into an actual business and raise prices to match demand, going up to a nicely divisible 60 € per push. If you all still manage to regularly sell out the store at this level and I get to raise prices again, I should be able to reduce RL work further and therefore raise the cap as well. Now that I've also clarified a potential route towards self-employment, I'm going to react to these sell-out events more quickly, and with smaller raises. So, no further immediate doubling in the future.

Now, will this delay the currently highly awaited 100% completion of TH01 past August 2022, the 25th anniversary of its release? We'll see once we're back to an almost empty backlog, after I'm done with the TH01 Sariel fight. I'm hopeful that such a price increase will give a new voice to the goals and priorities of less wealthy potential patrons. This crowdfunding is very much designed to be hacked by "microtransactions" – small contributions with specific requests that require other, larger generic contributions to be fulfilled – and I'd like to see more of that. 😛 And even if 60 € per push is already more than the combined fandom wants to pay, that means I can get the 📝 16-bit build system done before the first big 100% release. (Trust me, you really want that!)

I will still deliver the entire current backlog at the value the contributions were originally purchased at. Due to the way the cap has to be calculated, these contributions now appear to have doubled in value. All existing subscriptions will then pay for half of their original pushes starting with their respective December 2021 transaction.

Next up: A bunch of smaller website features, including:

📝 Posted:
🚚 Summary of:
P0168, P0169
Commits:
c2de6ab...8b046da, 8b046da...479b766
💰 Funded by:
rosenrose, Blue Bolt
🏷 Tags:

EMS memory! The infamous stopgap measure between the 640 KiB ("ought to be enough for everyone") of conventional memory offered by DOS from the very beginning, and the later XMS standard for accessing all the rest of memory up to 4 GiB in the x86 Protected Mode. With an optionally active EMS driver, TH04 and TH05 will make use of EMS memory to preload a bunch of situational .CDG images at the beginning of MAIN.EXE:

  1. The "eye catch" game title image, shown while stages are loaded
  2. The character-specific background image, shown while bombing
  3. The player character dialog portraits
  4. TH05 additionally stores the boss portraits there, preloading them at the beginning of each stage. (TH04 instead keeps them in conventional memory during the entire stage.)

Once these images are needed, they can then be copied into conventional memory and accessed as usual.

Uh… wait, copied? It certainly would have been possible to map EMS memory to a regular 16-bit Real Mode segment for direct access, bank-switching out rarely used system or peripheral memory in exchange for the EMS data. However, master.lib doesn't expose this functionality, and only provides functions for copying data from EMS to regular memory and vice versa.
But even that still makes EMS an excellent fit for the large image files it's used for, as it's possible to directly copy their pixel data from EMS to VRAM. (Yes, I tried!) Well… would, because ZUN doesn't do that either, and always naively copies the images to newly allocated conventional memory first. In essence, this dumbs down EMS into just another layer of the memory hierarchy, inserted between conventional memory and disk: Not quite as slow as disk, but still requiring that memcpy() to retrieve the data. Most importantly though: Using EMS in this way does not increase the total amount of memory simultaneously accessible to the game. After all, some other data will have to be freed from conventional memory to make room for the newly loaded data.


The most idiomatic way to define the game-specific layout of the EMS area would be either a struct or an enum. Unfortunately, the total size of all these images exceeds the range of a 16-bit value, and Turbo C++ 4.0J supports neither 32-bit enums (which are silently degraded to 16-bit) nor 32-bit structs (which simply don't compile). That still leaves raw compile-time constants though, you only have to manually define the offset to each image in terms of the size of its predecessor. But instead of doing that, ZUN just placed each image at a nice round decimal offset, each slightly larger than the actual memory required by the previous image, just to make sure that everything fits. :tannedcirno: This results not only in quite a bit of unnecessary padding, but also in technically the single biggest amount of "wasted" memory in PC-98 Touhou: Out of the 180,000 (TH04) and 320,000 (TH05) EMS bytes requested, the game only uses 135,552 (TH04) and 175,904 (TH05) bytes. But hey, it's EMS, so who cares, right? Out of all the opportunities to take shortcuts during development, this is among the most acceptable ones. Any actual PC-98 model that could run these two games comes with plenty of memory for this to not turn into an actual issue.

On to the EMS-using functions themselves, which are the definition of "cross-cutting concerns". Most of these have a fallback path for the non-EMS case, and keep the loaded .CDG images in memory if they are immediately needed. Which totally makes sense, but also makes it difficult to find names that reflect all the global state changed by these functions. Every one of these is also just called from a single place, so inlining them would have saved me a lot of naming and documentation trouble there.
The TH04 version of the EMS allocation code was actually displayed on ZUN's monitor in the 2010 MAG・ネット documentary; WindowsTiger already transcribed the low-quality video image in 2019. By 2015 ReC98 standards, I would have just run with that, but the current project goal is to write better code than ZUN, so I didn't. 😛 We sure ain't going to use magic numbers for EMS offsets.

The dialog init and exit code then is completely different in both games, yet equally cross-cutting. TH05 goes even further in saving conventional memory, loading each individual player or boss portrait into a single .CDG slot immediately before blitting it to VRAM and freeing the pixel data again. People who play TH05 without an active EMS driver are surely going to enjoy the hard drive access lag between each portrait change… :godzun: TH04, on the other hand, also abuses the dialog exit function to preload the Mugetsu defeat / Gengetsu entrance and Gengetsu defeat portraits, using a static variable to track how often the function has been called during the Extra Stage… who needs function parameters anyway, right? :zunpet:

This is also the function in which TH04 infamously crashes after the Stage 5 pre-boss dialog when playing with Reimu and without any active EMS driver. That crash is what motivated this look into the games' EMS usage… but the code looks perfectly fine? Oh well, guess the crash is not related to EMS then. Next u–

OK, of course I can't leave it like that. Everyone is expecting a fix now, and I still got half of a push left over after decompiling the regular EMS code. Also, I've now RE'd every function that could possibly be involved in the crash, and this is very likely to be the last time I'll be looking at them.


Turns out that the bug has little to do with EMS, and everything to do with ZUN limiting the amount of conventional RAM that TH04's MAIN.EXE is allowed to use, and then slightly miscalculating this upper limit. Playing Stage 5 with Reimu is the most asset-intensive configuration in this game, due to the combination of

The star image used in TH04's Stage 5.
The star image used in TH04's Stage 5.

Remove any single one of the above points, and this crash would have never occurred. But with all of them combined, the total amount of memory consumed by TH04's MAIN.EXE just barely exceeds ZUN's limit of 320,000 bytes, by no more than 3,840 bytes, the size of the star image.

But wait: As we established earlier, EMS does nothing to reduce the amount of conventional memory used by the game. In fact, if you disabled TH04's EMS handling, you'd still get this crash even if you are running an EMS driver and loaded DOS into the High Memory Area to free up as much conventional RAM as possible. How can EMS then prevent this crash in the first place?

The answer: It's only because ZUN's usage of EMS bypasses the need to load the cached images back out of the XOR-encrypted 東方幻想.郷 packfile. Leaving aside the general stupidity of any game data file encryption*, master.lib's decryption implementation is also quite wasteful: It uses a separate buffer that receives fixed-size chunks of the file, before decrypting every individual byte and copying it to its intended destination buffer. That really resembles the typical slowness of a C fread() implementation more than it does the highly optimized ASM code that master.lib purports to be… And how large is this well-hidden decryption buffer? 4 KiB. :onricdennat:

So, looking back at the game, here is what happens once the Stage 5 pre-battle dialog ends:

  1. Reimu's bomb background image, which was previously freed to make space for her dialog portraits, has to be loaded back into conventional memory from disk
  2. BB0.CDG is found inside the 東方幻想.郷 packfile
  3. file_ropen() ends up allocating a 4 KiB buffer for the encrypted packfile data, getting us the decisive ~4 KiB closer to the memory limit
  4. The .CDG loader tries to allocate 52 608 contiguous bytes for the pixel data of Reimu's bomb image
  5. This would exceed the memory limit, so hmem_allocbyte() fails and returns a nullptr
  6. ZUN doesn't check for this case (as usual)
  7. The pixel data is loaded to address 0000:0000, overwriting the Interrupt Vector Table and whatever comes after
  8. The game crashes
The final frame rendered before the TH04 Stage 5 Reimu No-EMS crash
The final frame rendered by a crashing TH04.

The 4 KiB encryption buffer would only be freed by the corresponding file_close() call, which of course never happens because the game crashes before it gets there. At one point, I really did suspect the cause to be some kind of memory leak or fragmentation inside master.lib, which would have been quite delightful to fix.
Instead, the most straightforward fix here is to bump up that memory limit by at least 4 KiB. Certainly easier than squeezing in a cdg_free() call for the star image before the pre-boss dialog without breaking position dependence.

Or, even better, let's nuke all these memory limits from orbit because they make little sense to begin with, and fix every other potential out-of-memory crash that modders would encounter when adding enough data to any of the 4 games that impose such limits on themselves. Unless you want to launch other binaries (which need to do their own memory allocations) after launching the game, there's really no reason to restrict the amount of memory available to a DOS process. Heck, whenever DOS creates a new one, it assigns all remaining free memory by default anyway.
Removing the memory limits also removes one of ZUN's few error checks, which end up quitting the game if there isn't at least a given maximum amount of conventional RAM available. While it might be tempting to reserve enough memory at the beginning of execution and then never check any allocation for a potential failure, that's exactly where something like TH04's crash comes from.
This game is also still running on DOS, where such an initial allocation failure is very unlikely to happen – no one fills close to half of conventional RAM with TSRs and then tries running one of these games. It might have been useful to detect systems with less than 640 KiB of actual, physical RAM, but none of the PC-98 models with that little amount of memory are fast enough to run these games to begin with. How ironic… a place where ZUN actually added an error check, and then it's mostly pointless.

Here's an archive that contains both fix variants, just in case. These were compiled from the th04_noems_crash_fix and mem_assign_all branches, and contain as little code changes as possible.
Edit (2022-04-18): For TH04, you probably want to download the 📝 community choice fix package instead, which contains this fix along with other workarounds for the Divide error crashes. 2021-11-29-Memory-limit-fixes.zip

So yeah, quite a complex bug, leaving no time for the TH03 scorefile format research after all. Next up: Raising prices.

📝 Posted:
🚚 Summary of:
P0165, P0166, P0167
Commits:
7a0e5d8...f2bca01, f2bca01...e697907, e697907...c2de6ab
💰 Funded by:
Ember2528
🏷 Tags:

OK, TH01 missile bullets. Can we maybe have a well-behaved entity type, without any weirdness? Just once?

Ehh, kinda. Apart from another 150 bytes wasted on unused structure members, this code is indeed more on the low end in terms of overall jank. It does become very obvious why dodging these missiles in the YuugenMagan, Mima, and Elis fights feels so awful though: An unfair 46×46 pixel hitbox around Reimu's center pixel, combined with the comeback of 📝 interlaced rendering, this time in every stage. ZUN probably did this because missiles are the only 16×16 sprite in TH01 that is blitted to unaligned X positions, which effectively ends up touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would have been totally possible to render every missile in every frame at roughly the same amount of CPU time that the original game uses for interlaced rendering:

That's an optimization that would have significantly benefitted the game, in contrast to all of the fake ones introduced in later games. Then again, this optimization is actually something that the later games do, and it might have in fact been necessary to achieve their higher bullet counts without significant slowdown.

Unfortunately, it was only worth decompiling half of the missile code right now, thanks to gratuitous FPU usage in the other half, where 📝 double variables are compared to float literals. That one will have to wait 📝 until after SinGyoku.


After some effectively unused Mima sprite effect code that is so broken that it's impossible to make sense out of it, we get to the final feature I wanted to cover for all bosses in parallel before returning to Sariel: The separate sprite background storage for moving or animated boss sprites in the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from main memory on every frame. After all, TH01 and TH02 had a minimum required clock speed of 33 MHz, half of the speed required for the later three games. So, he simply blitted these boss sprites to both VRAM pages, leading the usual unblitting calls to only remove the other sprites on top of the boss. However, these bosses themselves want to move across the screen… and this makes it necessary to save the stage background behind them in some other way.

Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM into a sprite slot. No problem with that approach in theory, as the size of all these bigger sprites is a multiple of 32×32; splitting a larger sprite into these smaller 32×32 chunks makes the code look just a little bit clumsy (and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is blitted on top of her regular sprite, using just regular sprite transparency, she ends up with her infamous third arm:

TH01 Mima's third arm

Ironically, there's an unused code path in Mima's unblit function where ZUN assumes a height of 48 pixels for Mima's animation sprites rather than the actual 64. This leads to even clumsier .PTN function calls for the bottom 128×16 pixels… Failing to unblit the bottom 16 pixels would have also yielded that third arm, although it wouldn't have looked as natural. Still wouldn't say that it was intentional; maybe this casting sprite was just added pretty late in the game's development?


So, mission accomplished, Sariel unblocked… at 2¼ pushes. :thonk: That's quite some time left for some smaller stage initialization code, which bundles a bunch of random function calls in places where they logically really don't belong. The stage opening animation then adds a bunch of VRAM inter-page copies that are not only redundant but can't even be understood without knowing the hidden internal state of the last VRAM page accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any complexity limit on inlining arithmetic expressions, as long as they only operate on compile-time constants. That's how we get macro-free, compile-time Shift-JIS to JIS X 0208 conversion of the individual code points in the 東方★靈異伝 string, in a compiler from 1994. As long as you don't store any intermediate results in variables, that is… :tannedcirno:

But wait, there's more! With still ¼ of a push left, I also went for the boss defeat animation, which includes the route selection after the SinGyoku fight.
As in all other instances, the 2× scaled font is accomplished by first rendering the text at regular 1× resolution to the other, invisible VRAM page, and then scaled from there to the visible one. However, the route selection is unique in that its scaled text is both drawn transparently on top of the stage background (not onto a black one), and can also change colors depending on the selection. It would have been no problem to unblit and reblit the text by rendering the 1× version to a position on the invisible VRAM page that isn't covered by the 2× version on the visible one, but ZUN (needlessly) clears the invisible page before rendering any text. :zunpet: Instead, he assigned a separate VRAM color for both the 魔界 and 地獄 options, and only changed the palette value for these colors to white or gray, depending on the correct selection. This is another one of the 📝 rare cases where TH01 demonstrates good use of PC-98 hardware, as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.

Then, why does this still not count as good-code? When changing palette colors, you kinda need to be aware of everything else that can possibly be on screen, which colors are used there, and which aren't and can therefore be used for such an effect without affecting other sprites. In this case, well… hover over the image below, and notice how Reimu's hair and the bomb sprites in the HUD light up when Makai is selected:

Demonstration of palette changes in TH01's route selection

This push did end on a high note though, with the generic, non-SinGyoku version of the defeat animation being an easily parametrizable copy. And that's how you decompile another 2.58% of TH01 in just slightly over three pushes.


Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and SinGyoku without needing any more detours into non-boss code. Thanks to the current TH01 funding subscriptions, I can plan to cover most, if not all, of Sariel in a single push series, but the currently 3 pending pushes probably won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got quite a lot of not specifically TH01-related funds in the backlog to pass the time though.

Due to recent developments, it actually makes quite a lot of sense to take a break from TH01: spaztron64 has managed what every Touhou download site so far has failed to do: Bundling all 5 game onto a single .HDI together with pre-configured PC-98 emulators and a nice boot menu, and hosting the resulting package on a proper website. While this first release is already quite good (and much better than my attempt from 2014), there is still a bit of room for improvement to be gained from specific ReC98 research. Next up, therefore:

📝 Posted:
🚚 Summary of:
P0162, P0163, P0164
Commits:
81dd96e...24b3a0d, 24b3a0d...6d572b3, 6d572b3...7a0e5d8
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

No technical obstacles for once! Just pure overcomplicated ZUN code. Unlike 📝 Konngara's main function, the main TH01 player function was every bit as difficult to decompile as you would expect from its size.

With TH01 using both separate left- and right-facing sprites for all of Reimu's moves and separate classes for Reimu's 32×32 and 48×* sprites, we're already off to a bad start. Sure, sprite mirroring is minimally more involved on PC-98, as the planar nature of VRAM requires the bits within an 8-pixel byte to also be mirrored, in addition to writing the sprite bytes from right to left. TH03 uses a 256-byte lookup table for this, generated at runtime by an infamous micro-optimized and undecompilable ASM algorithm. With TH01's existing architecture, ZUN would have then needed to write 3 additional blitting functions. But instead, he chose to waste a total of 26,112 bytes of memory on pre-mirrored sprites… :godzun:

Alright, but surely selecting those sprites from code is no big deal? Just store the direction Reimu is facing in, and then add some branches to the rendering code. And there is in fact a variable for Reimu's direction… during regular arrow-key movement, and another one while shooting and sliding, and a third as part of the special attack types, launched out of a slide.
Well, OK, technically, the last two are the same variable. But that's even worse, because it means that ZUN stores two distinct enums at the same place in memory: Shooting and sliding uses 1 for left, 2 for right, and 3 for the "invalid" direction of holding both, while the special attack types indicate the direction in their lowest bit, with 0 for right and 1 for left. I decompiled the latter as bitflags, but in ZUN's code, each of the 8 permutations is handled as a distinct type, with copy-pasted and adapted code… :zunpet: The interpretation of this two-enum "sub-mode" union variable is controlled by yet another "mode" variable… and unsurprisingly, two of the bugs in this function relate to the sub-mode variable being interpreted incorrectly.

Also, "rendering code"? This one big function basically consists of separate unblit→update→render code snippets for every state and direction Reimu can be in (moving, shooting, swinging, sliding, special-attacking, and bombing), pasted together into a tangled mess of nested if(…) statements. While a lot of the code is copy-pasted, there are still a number of inconsistencies that defeat the point of my usual refactoring treatment. After all, with a total of 85 conditional branches, anything more than I did would have just obscured the control flow too badly, making it even harder to understand what's going on.
In the end, I spotted a total of 8 bugs in this function, all of which leave Reimu invisible for one or more frames:

Thanks to the last one, Reimu's first swing animation frame is never actually rendered. So whenever someone complains about TH01 sprite flickering on an emulator: That emulator is accurate, it's the game that's poorly written. :tannedcirno:

And guess what, this function doesn't even contain everything you'd associate with per-frame player behavior. While it does handle Yin-Yang Orb repulsion as part of slides and special attacks, it does not handle the actual player/Orb collision that results in lives being lost. The funny thing about this: These two things are done in the same function… :onricdennat:

Therefore, the life loss animation is also part of another function. This is where we find the final glitch in this 3-push series: Before the 16-frame shake, this function only unblits a 32×32 area around Reimu's center point, even though it's possible to lose a life during the non-deflecting part of a 48×48-pixel animation. In that case, the extra pixels will just stay on screen during the shake. They are unblitted afterwards though, which suggests that ZUN was at least somewhat aware of the issue?
Finally, the chance to see the alternate life loss sprite Alternate TH01 life loss sprite is exactly ⅛.


As for any new insights into game mechanics… you know what? I'm just not going to write anything, and leave you with this flowchart instead. Here's the definitive guide on how to control Reimu in TH01 we've been waiting for 24 years:

(SVG download)

Pellets are deflected during all gray states. Not shown is the obvious "double-tap Z and X" transition from all non-(#1) states to the Bomb state, but that would have made this diagram even more unwieldy than it turned out. And yes, you can shoot twice as fast while moving left or right.

While I'm at it, here are two more animations from MIKO.PTN which aren't referenced by any code:

An unused animation from TH01's MIKO.PTNAn unused animation from TH01's MIKO.PTN

With that monster of a function taken care of, we've only got boss sprite animation as the final blocker of uninterrupted Sariel progress. Due to some unfavorable code layout in the Mima segment though, I'll need to spend a bit more time with some of the features used there. Next up: The missile bullets used in the Mima and YuugenMagan fights.

📝 Posted:
🚚 Summary of:
P0160, P0161
Commits:
e491cd7...42ba4a5, 42ba4a5...81dd96e
💰 Funded by:
Yanga, [Anonymous]
🏷 Tags:

Nothing really noteworthy in TH01's stage timer code, just yet another HUD element that is needlessly drawn into VRAM. Sure, ZUN applies his custom boldfacing effect on top of the glyphs retrieved from font ROM, but he could have easily installed those modified glyphs as gaiji.
Well, OK, halfwidth gaiji aren't exactly well documented, and sometimes not even correctly emulated 📝 due to the same PC-98 hardware oddity I was researching last month. I've reserved two of the pending anonymous "anything" pushes for the conclusion of this research, just in case you were wondering why the outstanding workload is now lower after the two delivered here.

And since it doesn't seem to be clearly documented elsewhere: Every 2 ticks on the stage timer correspond to 4 frames.


So, TH01 rank pellet speed. The resident pellet speed value is a factor ranging from a minimum of -0.375 up to a maximum of 0.5 (pixels per frame), multiplied with the difficulty-adjusted base speed for each pellet and added on top of that same speed. This multiplier is modified

Apparently, ZUN noted that these deltas couldn't be losslessly stored in an IEEE 754 floating-point variable, and therefore didn't store the pellet speed factor exactly in a way that would correspond to its gameplay effect. Instead, it's stored similar to Q12.4 subpixels: as a simple integer, pre-multiplied by 40. This results in a raw range of -15 to 20, which is what the undecompiled ASM calls still use. When spawning a new pellet, its base speed is first multiplied by that factor, and then divided by 40 again. This is actually quite smart: The calculation doesn't need to be aware of either Q12.4 or the 40× format, as ((Q12.4 * factor×40) / factor×40) still comes out as a Q12.4 subpixel even if all numbers are integers. The only limiting issue here would be the potential overflow of the 16-bit multiplication at unadjusted base speeds of more than 50 pixels per frame, but that'd be seriously unplayable.
So yeah, pellet speed modifications are indeed gradual, and don't just fall into the coarse three "high, normal, and low" categories.


That's ⅝ of P0160 done, and the continue and pause menus would make good candidates to fill up the remaining ⅜… except that it seemed impossible to figure out the correct compiler options for this code?
The issues centered around the two effects of Turbo C++ 4.0J's -O switch:

  1. Optimizing jump instructions: merging duplicate successive jumps into a single one, and merging duplicated instructions at the end of conditional branches into a single place under a single branch, which the other branches then jump to
  2. Compressing ADD SP and POP CX stack-clearing instructions after multiple successive CALLs to __cdecl functions into a single ADD SP with the combined parameter stack size of all function calls

But how can the ASM for these functions exhibit #1 but not #2? How can it be seemingly optimized and unoptimized at the same time? The only option that gets somewhat close would be -O- -y, which emits line number information into the .OBJ files for debugging. This combination provides its own kind of #1, but these functions clearly need the real deal.

The research into this issue ended up consuming a full push on its own. In the end, this solution turned out to be completely unrelated to compiler options, and instead came from the effects of a compiler bug in a totally different place. Initializing a local structure instance or array like

const uint4_t flash_colors[3] = { 3, 4, 5 };

always emits the { 3, 4, 5 } array into the program's data segment, and then generates a call to the internal SCOPY@ function which copies this data array to the local variable on the stack. And as soon as this SCOPY@ call is emitted, the -O optimization #1 is disabled for the entire rest of the translation unit?!
So, any code segment with an SCOPY@ call followed by __cdecl functions must strictly be decompiled from top to bottom, mirroring the original layout of translation units. That means no TH01 continue and pause menus before we haven't decompiled the bomb animation, which contains such an SCOPY@ call. 😕
Luckily, TH01 is the only game where this bug leads to significant restrictions in decompilation order, as later games predominantly use the pascal calling convention, in which each function itself clears its stack as part of its RET instruction.


What now, then? With 51% of REIIDEN.EXE decompiled, we're slowly running out of small features that can be decompiled within ⅜ of a push. Good that I haven't been looking a lot into OP.EXE and FUUIN.EXE, which pretty much only got easy pieces of code left to do. Maybe I'll end up finishing their decompilations entirely within these smaller gaps?
I still ended up finding one more small piece in REIIDEN.EXE though: The particle system, seen in the Mima fight.

I like how everything about this animation is contained within a single function that is called once per frame, but ZUN could have really consolidated the spawning code for new particles a bit. In Mima's fight, particles are only spawned from the top and right edges of the screen, but the function in fact contains unused code for all other 7 possible directions, written in quite a bloated manner. This wouldn't feel quite as unused if ZUN had used an angle parameter instead… :thonk: Also, why unnecessarily waste another 40 bytes of the BSS segment?

But wait, what's going on with the very first spawned particle that just stops near the bottom edge of the screen in the video above? Well, even in such a simple and self-contained function, ZUN managed to include an off-by-one error. This one then results in an out-of-bounds array access on the 80th frame, where the code attempts to spawn a 41st particle. If the first particle was unlucky to be both slow enough and spawned away far enough from the bottom and right edges, the spawning code will then kill it off before its unblitting code gets to run, leaving its pixel on the screen until something else overlaps it and causes it to be unblitted.
Which, during regular gameplay, will quickly happen with the Orb, all the pellets flying around, and your own player movement. Also, the RNG can easily spawn this particle at a position and velocity that causes it to leave the screen more quickly. Kind of impressive how ZUN laid out the structure of arrays in a way that ensured practically no effect of this bug on the game; this glitch could have easily happened every 80 frames instead. He almost got close to all bugs canceling out each other here! :godzun:

Next up: The player control functions, including the second-biggest function in all of PC-98 Touhou.

📝 Posted:
🚚 Summary of:
P0158, P0159
Commits:
bf7bb7e...c0c0ebc, c0c0ebc...e491cd7
💰 Funded by:
Yanga
🏷 Tags:

Of course, Sariel's potentially bloated and copy-pasted code is blocked by even more definitely bloated and copy-pasted code. It's TH01, what did you expect? :tannedcirno:

But even then, TH01's item code is on a new level of software architecture ridiculousness. First, ZUN uses distinct arrays for both types of items, with their own caps of 4 for bomb items, and 10 for point items. Since that obviously makes any type-related switch statement redundant, he also used distinct functions for both types, with copy-pasted boilerplate code. The main per-item update and render function is shared though… and takes every single accessed member of the item structure as its own reference parameter. Like, why, you have a structure, right there?! That's one way to really practice the C++ language concept of passing arbitrary structure fields by mutable reference… :zunpet:
To complete the unwarranted grand generic design of this function, it calls back into per-type collision detection, drop, and collect functions with another three reference parameters. Yeah, why use C++ virtual methods when you can also implement the effectively same polymorphism functionality by hand? Oh, and the coordinate clamping code in one of these callbacks could only possibly have come from nested min() and max() preprocessor macros. And that's how you extend such dead-simple functionality to 1¼ pushes…

Amidst all this jank, we've at least got a sensible item↔player hitbox this time, with 24 pixels around Reimu's center point to the left and right, and extending from 24 pixels above Reimu down to the bottom of the playfield. It absolutely didn't look like that from the initial naive decompilation though. Changing entity coordinates from left/top to center was one of the better lessons from TH01 that ZUN implemented in later games, it really makes collision detection code much more intuitive to grasp.


The card flip code is where we find out some slightly more interesting aspects about item drops in this game, and how they're controlled by a hidden cycle variable:

Then again, score players largely ignore point items anyway, as card combos simply have a much bigger effect on the score. With this, I should have RE'd all information necessary to construct a tool-assisted score run, though?
Edit: Turns out that 1) point items are becoming increasingly important in score runs, and 2) Pearl already did a TAS some months ago. Thanks to spaztron64 for the info!

The Orb↔card hitbox also makes perfect sense, with 24 pixels around the center point of a card in every direction.

The rest of the code confirms the card flip score formula documented on Touhou Wiki, as well as the way cards are flipped by bombs: During every of the 90 "damaging" frames of the 140-frame bomb animation, there is a 75% chance to flip the card at the [bomb_frame % total_card_count_in_stage] array index. Since stages can only have up to 50 cards 📝 thanks to a bug, even a 75% chance is high enough to typically flip most cards during a bomb. Each of these flips still only removes a single card HP, just like after a regular collision with the Orb.
Also, why are the card score popups rendered before the cards themselves? That's two needless frames of flicker during that 25-frame animation. Not all too noticeable, but still.


And that's over 50% of REIIDEN.EXE decompiled as well! Next up: More HUD update and rendering code… with a direct dependency on rank pellet speed modifications?

📝 Posted:
🚚 Summary of:
P0157
Commits:
4bc6405...bf7bb7e
💰 Funded by:
Yanga
🏷 Tags:

Yup, there still are features that can be fully covered in a single push and don't lead to sprawling blog posts. The giant STAGE number and HARRY UP messages, as well as the flashing transparent 東方★靈異伝 at the beginning of each scene are drawn by retrieving the glyphs for each letter from font ROM, and then "blitting" them to text RAM by placing a colored fullwidth 16×16 square at every pixel that is set in the font bitmap.
And 📝 once again, ZUN's code there matches the mediocre example code for the related hardware interrupt from the PC-9801 Programmers' Bible. It's not 100% copied this time, but definitely inspired by the code on page 121. Therefore, we can conclude that these letters are probably only displayed as these 16× scaled glyphs because that book had code on how to achieve this effect.

ZUN "improved" on the example code by implementing a write-only cursor over the entire text RAM that fills every 16×16 cell with a differently colored space character, fully clearing the text RAM as a side effect. For once, he even removed some redundancy here by using helper functions! It's all still far from good-code though. For example, there's a function for filling 5 rows worth of cells, which he uses for both the top and bottom margin of these letters. But since the bottom margin starts at the 22nd line, the code writes past the 25th line and into the second TRAM page. Good that this page is not used by either the hardware or the game.

These cursor functions can actually write any fullwidth JIS code point to text RAM… and seem to do that in a rather simplified way, because shouldn't you set the most significant bit to indicate the right half of a fullwidth character? That's what's written in the same book that ZUN copied all functions out of, after all. 🤔 Researching this led me down quite the rabbit hole, where I found an oddity in PC-98 text RAM rendering that no single one of the widely-used PC-98 emulators gets completely right. I'm almost done with the 2-push research into this issue, which will include fixes for DOSBox-X and Neko Project II. The only thing I'm missing to get these fully accurate is a screenshot of the output created by this binary, on any PC-98 model made by EPSON: 2021-09-12-jist0x28.com.zip That's the reason why this push was rather delayed. Thanks in advance to anyone who'd like to help with this!


In maybe more disappointing news: Sariel is going to be delayed for a while longer. 😕 The player- and HUD-related functions, which previously delayed further progress there, turned out to call a lot of not yet RE'd functions themselves. Seems as if we're doing most of the card-flipping code second, after all? Next up: Point and bomb items, which at least are a significant step in terms of position independence.

📝 Posted:
🚚 Summary of:
P0153, P0154, P0155, P0156
Commits:
624e0cb...d05c9ba, d05c9ba...031b526, 031b526...9ad578e, 9ad578e...4bc6405
💰 Funded by:
Ember2528
🏷 Tags:

📝 7 pushes to get Konngara done, according to my previous estimate? Well, how about being twice as fast, and getting the entire boss fight done in 3.5 pushes instead? So much copy-pasted code in there… without any flashy unused content, apart from four calculations with an unclear purpose. And the three strings "ANGEL", "OF", "DEATH", which were probably meant to be rendered using those giant upscaled font ROM glyphs that also display the STAGE # and HARRY UP strings? Those three strings are also part of Sariel's code, though.

On to the remaining 11 patterns then! Konngara's homing snakes, shown in the video above, are one of the more notorious parts of this battle. They occur in two patterns – one with two snakes and one with four – with all of the spawn, aim, update, and render code copy-pasted between the two. :zunpet: Three gameplay-related discoveries here:


This was followed by really weird aiming code for the "sprayed pellets from cup" pattern… which can only possibly have been done on purpose, but is sort of mitigated by the spraying motion anyway.
After a bunch of long if(…) {…} else if(…) {…} else if(…) {…} chains, which remain quite popular in certain corners of the game dev scene to this day, we've got the three sword slash patterns as the final notable ones. At first, it seemed as if ZUN just improvised those raw number constants involved in the pellet spawner's movement calculations to describe some sort of path that vaguely resembles the sword slash. But once I tried to express these numbers in terms of the slash animation's keyframes, it all worked out perfectly, and resulted in this:

Triangular path of the pellet spawner during Konngara's slash patterns

Yup, the spawner always takes an exact path along this triangle. Sometimes, I wonder whether I should just rush this project and don't bother about naming these repeated number literals. Then I gain insights like these, and it's all worth it.


Finally, we've got Konngara's main function, which coordinates the entire fight. Third-longest function in both TH01 and all of PC-98 Touhou, only behind some player-related stuff and YuugenMagan's gigantic main function… and it's even more of a copy-pasta, making it feel not nearly as long as it is. Key insights there:

Seriously, 📝 line drawing was much harder to decompile.


And that's it for Konngara! First boss with not a single piece of ASM left, 30 more to go! 🎉 But wait, what about the cause behind the temporary green discoloration after leaving the Pause menu? I expected to find something on that as well, but nope, it's nothing in Konngara's code segment. We'll probably only get to figure that out near the very end of TH01's decompilation, once we get to the one function that directly calls all of the boss-specific main functions in a switch statement.
Edit (2022-07-17): 📝 Only took until Mima.

So, Sariel next? With half of a push left, I did cover Sariel's first few initialization functions, but all the sprite unblitting and HUD manipulation will need some extra attention first. The first one of these functions is related to the HUD, the stage timer, and the HARRY UP mode, whose pellet pattern I've also decompiled now.

All of this brings us past 75% PI in all games, and TH01 to under 30,000 remaining ASM instructions, leaving TH03 as the now most expensive game to be completely decompiled. Looking forward to how much more TH01's code will fall apart if you just tap it lightly… Next up: The aforementioned helper functions related to HARRY UP, drawing the HUD, and unblitting the other bosses whose sprites are a bit more animated.

📝 Posted:
🚚 Summary of:
P0149, P0150, P0151, P0152
Commits:
e1a26bb...05e4c4a, 05e4c4a...768251d, 768251d...4d24ca5, 4d24ca5...81fc861
💰 Funded by:
Blue Bolt, Ember2528, -Tom-, [Anonymous]
🏷 Tags:

…or maybe not that soon, as it would have only wasted time to untangle the bullet update commits from the rest of the progress. So, here's all the bullet spawning code in TH04 and TH05 instead. I hope you're ready for this, there's a lot to talk about!

(For the sake of readability, "bullets" in this blog post refers to the white 8×8 pellets and all 16×16 bullets loaded from MIKO16.BFT, nothing else.)


But first, what was going on 📝 in 2020? Spent 4 pushes on the basic types and constants back then, still ended up confusing a couple of things, and even getting some wrong. Like how TH05's "bullet slowdown" flag actually always prevents slowdown and fires bullets at a constant speed instead. :tannedcirno: Or how "random spread" is not the best term to describe that unused bullet group type in TH04.
Or that there are two distinct ways of clearing all bullets on screen, which deserve different names:

Mechanic #1: Clearing bullets for a custom amount of time, awarding 1000 points for all bullets alive on the first frame, and 100 points for all bullets spawned during the clear time.
Mechanic #2: Zapping bullets for a fixed 16 frames, awarding a semi-exponential and loudly announced Bonus!! for all bullets alive on the first frame, and preventing new bullets from being spawned during those 16 frames. In TH04 at least; thanks to a ZUN bug, zapping got reduced to 1 frame and no animation in TH05…

Bullets are zapped at the end of most midboss and boss phases, and cleared everywhere else – most notably, during bombs, when losing a life, or as rewards for extends or a maximized Dream bonus. The Bonus!! points awarded for zapping bullets are calculated iteratively, so it's not trivial to give an exact formula for these. For a small number 𝑛 of bullets, it would exactly be 5𝑛³ - 10𝑛² + 15𝑛 points – or, using uth05win's (correct) recursive definition, Bonus(𝑛) = Bonus(𝑛-1) + 15𝑛² - 5𝑛 + 10. However, one of the internal step variables is capped at a different number of points for each difficulty (and game), after which the points only increase linearly. Hence, "semi-exponential".


On to TH04's bullet spawn code then, because that one can at least be decompiled. And immediately, we have to deal with a pointless distinction between regular bullets, with either a decelerating or constant velocity, and special bullets, with preset velocity changes during their lifetime. That preset has to be set somewhere, so why have separate functions? In TH04, this separation continues even down to the lowest level of functions, where values are written into the global bullet array. TH05 merges those two functions into one, but then goes too far and uses self-modifying code to save a grand total of two local variables… Luckily, the rest of its actual code is identical to TH04.

Most of the complexity in bullet spawning comes from the (thankfully shared) helper function that calculates the velocities of the individual bullets within a group. Both games handle each group type via a large switch statement, which is where TH04 shows off another Turbo C++ 4.0 optimization: If the range of case values is too sparse to be meaningfully expressed in a jump table, it usually generates a linear search through a second value table. But with the -G command-line option, it instead generates branching code for a binary search through the set of cases. 𝑂(log 𝑛) as the worst case for a switch statement in a C++ compiler from 1994… that's so cool. But still, why are the values in TH04's group type enum all over the place to begin with? :onricdennat:
Unfortunately, this optimization is pretty rare in PC-98 Touhou. It only shows up here and in a few places in TH02, compared to at least 50 switch value tables.

In all of its micro-optimized pointlessness, TH05's undecompilable version at least fixes some of TH04's redundancy. While it's still not even optimal, it's at least a decently written piece of ASM… if you take the time to understand what's going on there, because it certainly took quite a bit of that to verify that all of the things which looked like bugs or quirks were in fact correct. And that's how the code for this function ended up with 35% comments and blank lines before I could confidently call it "reverse-engineered"…
Oh well, at least it finally fixes a correctness issue from TH01 and TH04, where an invalid bullet group type would fill all remaining slots in the bullet array with identical versions of the first bullet.

Something that both games also share in these functions is an over-reliance on globals for return values or other local state. The most ridiculous example here: Tuning the speed of a bullet based on rank actually mutates the global bullet template… which ZUN then works around by adding a wrapper function around both regular and special bullet spawning, which saves the base speed before executing that function, and restores it afterward. :zunpet: Add another set of wrappers to bypass that exact tuning, and you've expanded your nice 1-function interface to 4 functions. Oh, and did I mention that TH04 pointlessly duplicates the first set of wrapper functions for 3 of the 4 difficulties, which can't even be explained with "debugging reasons"? That's 10 functions then… and probably explains why I've procrastinated this feature for so long.

At this point, I also finally stopped decompiling ZUN's original ASM just for the sake of it. All these small TH05 functions would look horribly unidiomatic, are identical to their decompiled TH04 counterparts anyway, except for some unique constant… and, in the case of TH05's rank-based speed tuning function, actually become undecompilable as soon as we want to return a C++ class to preserve the semantic meaning of the return value. Mainly, this is because Turbo C++ does not allow register pseudo-variables like _AX or _AL to be cast into class types, even if their size matches. Decompiling that function would have therefore lowered the quality of the rest of the decompiled code, in exchange for the additional maintenance and compile-time cost of another translation unit. Not worth it – and for a TH05 port, you'd already have to decompile all the rest of the bullet spawning code anyway!


The only thing in there that was still somewhat worth being decompiled was the pre-spawn clipping and collision detection function. Due to what's probably a micro-optimization mistake, the TH05 version continues to spawn a bullet even if it was spawned on top of the player. This might sound like it has a different effect on gameplay… until you realize that the player got hit in this case and will either lose a life or deathbomb, both of which will cause all on-screen bullets to be cleared anyway. So it's at most a visual glitch.

But while we're at it, can we please stop talking about hitboxes? At least in the context of TH04 and TH05 bullets. The actual collision detection is described way better as a kill delta of 8×8 pixels between the center points of the player and a bullet. You can distribute these pixels to any combination of bullet and player "hitboxes" that make up 8×8. 4×4 around both the player and bullets? 1×1 for bullets, and 8×8 for the player? All equally valid… or perhaps none of them, once you keep in mind that other entity types might have different kill deltas. With that in mind, the concept of a "hitbox" turns into just a confusing abstraction.

The same is true for the 36×44 graze box delta. For some reason, this one is not exactly around the center of a bullet, but shifted to the right by 2 pixels. So, a bullet can be grazed up to 20 pixels right of the player, but only up to 16 pixels left of the player. uth05win also spotted this… and rotated the deltas clockwise by 90°?!


Which brings us to the bullet updates… for which I still had to research a decompilation workaround, because 📝 P0148 turned out to not help at all? Instead, the solution was to lie to the compiler about the true segment distance of the popup function and declare its signature far rather than near. This allowed ZUN to save that ridiculous overhead of 1 additional far function call/return per frame, and those precious 2 bytes in the BSS segment that he didn't have to spend on a segment value. 📝 Another function that didn't have just a single declaration in a common header file… really, 📝 how were these games even built???

The function itself is among the longer ones in both games. It especially stands out in the indentation department, with 7 levels at its most indented point – and that's the minimum of what's possible without goto. Only two more notable discoveries there:

  1. Bullets are the only entity affected by Slow Mode. If the number of bullets on screen is ≥ (24 + (difficulty * 8) + rank) in TH04, or (42 + (difficulty * 8)) in TH05, Slow Mode reduces the frame rate by 33%, by waiting for one additional VSync event every two frames.
    The code also reveals a second tier, with 50% slowdown for a slightly higher number of bullets, but that conditional branch can never be executed :zunpet:
  2. Bullets must have been grazed in a previous frame before they can be collided with. (Note how this does not apply to bullets that spawned on top of the player, as explained earlier!)

Whew… When did ReC98 turn into a full-on code review?! 😅 And after all this, we're still not done with TH04 and TH05 bullets, with all the special movement types still missing. That should be less than one push though, once we get to it. Next up: Back to TH01 and Konngara! Now have fun rewriting the Touhou Wiki Gameplay pages 😛

📝 Posted:
🚚 Summary of:
P0148
Commits:
c940059...e1a26bb
💰 Funded by:
[Anonymous]
🏷 Tags:

Back after taking way too long to get Touhou Patch Center's MediaWiki update feature complete… I'm still waiting for more translators to test and review the new translation interface before delivering and deploying it all, which will most likely lead to another break from ReC98 within the next few months. For now though, I'm happy to have mostly addressed the nagging responsibility I still had after willing that site into existence, and to be back working on ReC98. 🙂

As announced, the next few pushes will focus on TH04's and TH05's bullet spawning code, before I get to put all that accumulated TH01 money towards finishing all of konngara's code in TH01. For a full picture of what's happening with bullets, we'd really also like to have the bullet update function as readable C code though.
Clearing all bullets on the playfield will trigger a Bonus!! popup, displayed as 📝 gaiji in that proportional font. Unfortunately, TLINK refused to link the code as soon as I referenced the function for animating the popups at the top of the playfield? Which can only mean that we have to decompile that function first… :thonk:

So, let's turn that piece of technical debt into a full push, and first decompile another random set of previously reverse-engineered TH04 and TH05 functions. Most of these are stored in a different place within the two MAIN.EXE binaries, and the tried-and-true method of matching segment names would therefore have introduced several unnecessary translation units. So I resorted to a segment splitting technique I should have started using way earlier: Simply creating new segments with names derived from their functions, at the exact positions they're needed. All the new segment start and end directives do bloat the ASM code somewhat, and certainly contributed to this push barely removing any actual lines of code. However, what we get in return is total freedom as far as decompilation order is concerned, 📝 which should be the case for any ReC project, really. And in the end, all these tiny code segments will cancel out anyway.
If only we could do the same with the data segment…


The popup function happened to be the final one I RE'd before my long break in the spring of 2019. Back then, I didn't even bother looking into that 64-frame delay between changing popups, and what that meant for the game.
Each of these popups stays on screen for 128 frames, during which, of course, another popup-worthy event might happen. Handling this cleanly without removing previous popups too early would involve some sort of event queue, whose size might even be meaningfully limited to the number of distinct events that can happen. But still, that'd be a data structure, and we're not gonna have that! :zunpet: Instead, ZUN simply keeps two variables for the new and current popup ID. During an active popup, any change to that ID will only be committed once the current popup has been shown for at least 64 frames. And during that time, that new ID can be freely overwritten with a different one, which drops any previous, undisplayed event. But surely, there won't be more than two events happening within 63 frames, right? :tannedcirno:

The rest was fairly uneventful – no newly RE'd functions in this push, after all – until I reached the widely used helper function for applying the current vertical scrolling offset to a Y coordinate. Its combination of a function parameter, the pascal calling convention, and no stack frame was previously thought to be undecompilable… except that it isn't, and the decompilation didn't even require any new workarounds to be developed? Good thing that I already forgot how impossible it was to decompile the first function I looked at that fell into this category!
Oh well, this discovery wasn't too groundbreaking. Looking back at all the other functions with that combination only revealed a grand total of 1 additional one where a decompilation made sense: TH05's version of snd_kaja_interrupt(), which is now compiled from the same C++ file for all 4 games that use it. And well, looks like some quirks really remain unnoticed and undocumented until you look at a function for the 11th time: Its return value is undefined if BGM is inactive – that is, if the user disabled it, or if no FM board is installed. Not that it matters for the original code, which never uses this function to retrieve anything from KAJA's drivers. But people apparently do copy ReC98 code into their own projects, so it is something to keep in mind.


All in all, nothing quite at jank level in this one, but we were surely grazing that tag. Next up, with that out of the way: The bullet update/step function! Very soon in fact, since I've mostly got it done already.

📝 Posted:
🚚 Summary of:
P0147
Commits:
456b621...c940059
💰 Funded by:
Ember2528, -Tom-
🏷 Tags:

Didn't quite get to cover background rendering for TH05's Stage 1-5 bosses in this one, as I had to reverse-engineer two more fundamental parts involved in boss background rendering before.

First, we got the those blocky transitions from stage tiles to bomb and boss backgrounds, loaded from BB*.BB and ST*.BB, respectively. These files store 16 frames of animation, with every bit corresponding to a 16×16 tile on the playfield. With 384×368 pixels to be covered, that would require 69 bytes per frame. But since that's a very odd number to work with in micro-optimized ASM, ZUN instead stores 512×512 pixels worth of bits, ending up with a frame size of 128 bytes, and a per-frame waste of 59 bytes. :tannedcirno: At least it was possible to decompile the core blitting function as __fastcall for once.
But wait, TH05 comes with, and loads, a bomb .BB file for every character, not just for the Reimu and Yuuka bomb transitions you see in-game… 🤔 Restoring those unused stage tile → bomb image transition animations for Mima and Marisa isn't that trivial without having decompiled their actual bomb animation functions before, so stay tuned!

Interestingly though, the code leaves out what would look like the most obvious optimization: All stage tiles are unconditionally redrawn each frame before they're erased again with the 16×16 blocks, no matter if they weren't covered by such a block in the previous frame, or are going to be covered by such a block in this frame. The same is true for the static bomb and boss background images, where ZUN simply didn't write a .CDG blitting function that takes the dirty tile array into account. If VRAM writes on PC-98 really were as slow as the games' README.TXT files claim them to be, shouldn't all the optimization work have gone towards minimizing them? :thonk: Oh well, it's not like I have any idea what I'm talking about here. I'd better stop talking about anything relating to VRAM performance on PC-98… :onricdennat:


Second, it finally was time to solve the long-standing confusion about all those callbacks that are supposed to render the playfield background. Given the aforementioned static bomb background images, ZUN chose to make this needlessly complicated. And so, we have two callback function pointers: One during bomb animations, one outside of bomb animations, and each boss update function is responsible for keeping the former in sync with the latter. :zunpet:

Other than that, this was one of the smoothest pushes we've had in a while; the hardest parts of boss background rendering all were part of 📝 the last push. Once you figured out that ZUN does indeed dynamically change hardware color #0 based on the current boss phase, the remaining one function for Shinki, and all of EX-Alice's background rendering becomes very straightforward and understandable.


Meanwhile, -Tom- told me about his plans to publicly release 📝 his TH05 scripting toolkit once TH05's MAIN.EXE would hit around 50% RE! That pretty much defines what the next bunch of generic TH05 pushes will go towards: bullets, shared boss code, and one full, concrete boss script to demonstrate how it's all combined. Next up, therefore: TH04's bullet firing code…? Yes, TH04's. I want to see what I'm doing before I tackle the undecompilable mess that is TH05's bullet firing code, and you all probably want readable code for that feature as well. Turns out it's also the perfect place for Blue Bolt's pending contributions.

📝 Posted:
🚚 Summary of:
P0146
Commits:
08bc188...456b621
💰 Funded by:
Ember2528, -Tom-
🏷 Tags:

Y'know, I kinda prefer the pending crowdfunded workload to stay more near the middle of the cap, rather than being sold out all the time. So to reach this point more quickly, let's do the most relaxing thing that can be easily done in TH05 right now: The boss backgrounds, starting with Shinki's, 📝 now that we've got the time to look at it in detail.

… Oh come on, more things that are borderline undecompilable, and require new workarounds to be developed? Yup, Borland C++ always optimizes any comparison of a register with a literal 0 to OR reg, reg, no matter how many calculations and inlined function calls you replace the 0 with. Shinki's background particle rendering function contains a CMP AX, 0 instruction though… so yeah, 📝 yet another piece of custom ASM that's worse than what Turbo C++ 4.0J would have generated if ZUN had just written readable C. This was probably motivated by ZUN insisting that his modified master.lib function for blitting particles takes its X and Y parameters as registers. If he had just used the __fastcall convention, he also would have got the sprite ID passed as a register. 🤷
So, we really don't want to be forced into inline assembly just because of the third comparison in the otherwise perfectly decompilable four-comparison if() expression that prevents invisible particles from being drawn. The workaround: Comparing to a pointer instead, which only the linker gets to resolve to the actual value of 0. :tannedcirno: This way, the compiler has to make room for any 16-bit literal, and can't optimize anything.


And then we go straight from micro-optimization to waste, with all the duplication in the code that animates all those particles together with the zooming and spinning lines. This push decompiled 1.31% of all code in TH05, and thanks to alignment, we're still missing Shinki's high-level background rendering function that calls all the subfunctions I decompiled here.
With all the manipulated state involved here, it's not at all trivial to see how this code produces what you see in-game. Like:

  1. If all lines have the same Y velocity, how do the other three lines in background type B get pushed down into this vertical formation while the top one stays still? (Answer: This velocity is only applied to the top line, the other lines are only pushed based on some delta.)
  2. How can this delta be calculated based on the distance of the top line with its supposed target point around Shinki's wings? (Answer: The velocity is never set to 0, so the top line overshoots this target point in every frame. After calculating the delta, the top line itself is pushed down as well, canceling out the movement. :zunpet:)
  3. Why don't they get pushed down infinitely, but stop eventually? (Answer: We only see four lines out of 20, at indices #0, #6, #12, and #18. In each frame, lines [0..17] are copied to lines [1..18], before anything gets moved. The invisible lines are pushed down based on the delta as well, which defines a distance between the visible lines of (velocity * array gap). And since the velocity is capped at -14 pixels per frame, this also means a maximum distance of 84 pixels between the midpoints of each line.)
  4. And why are the lines moving back up when switching to background type C, before moving down? (Answer: Because type C increases the velocity rather than decreasing it. Therefore, it relies on the previous velocity state from type B to show a gapless animation.)

So yeah, it's a nice-looking effect, just very hard to understand. 😵

With the amount of effort I'm putting into this project, I typically gravitate towards more descriptive function names. Here, however, uth05win's simple and seemingly tiny-brained "background type A/B/C/D" was quite a smart choice. It clearly defines the sequence in which these animations are intended to be shown, and as we've seen with point 4 from the list above, that does indeed matter.

Next up: At least EX-Alice's background animations, and probably also the high-level parts of the background rendering for all the other TH05 bosses.

📝 Posted:
🚚 Summary of:
P0143, P0144, P0145
Commits:
(Website) 9069fb7...c8ac7e5, (Website) c8ac7e5...69dd597, (Website) 69dd597...71417b6
💰 Funded by:
[Anonymous], Yanga, Lmocinemod
🏷 Tags:

Who said working on the website was "fun"? That code is a mess. :tannedcirno: This right here is the first time I seriously wrote a website from (almost) scratch. Its main job is to parse over a Git repository and calculate numbers, so any additional bulky frameworks would only be in the way, and probably need to be run on some sort of wobbly, unmaintainable "stack" anyway, right? 😛 📝 As with the main project though, I'm only beginning to figure out the best structure for this, and these new features prompted quite a lot of upfront refactoring…

Before I start ranting though, let's quickly summarize the most visible change, the new tag system for this blog!

Finally, the order page now shows the exact number of pushes a contribution will fund – no more manual divisions required. Shoutout to the one email I received, which pointed out this potential improvement!


As for the "invisible" changes: The one main feature of this website, the aforementioned calculation of the progress metrics, also turned out as its biggest annoyance over the years. It takes a little while to parse all the big .ASM files in the source tree, once for every push that can affect the average number of removed instructions and unlabeled addresses. And without a cache, we've had to do that every time we re-launch the app server process.
Fundamentally, this is – you might have guessed it – a dependency tracking problem, with two inputs: the .ASM files from the ReC98 repo, and the Golang code that calculates the instruction and PI numbers. Sure, the code has been pretty stable, but what if we do end up extending it one day? I've always disliked manually specified version numbers for use cases like this one, where the problem at hand could be exactly solved with a hashing function, without being prone to human error.

(Sidenote: That's why I never actively supported thcrap mods that affected gameplay while I was still working on that project. We still want to be able to save and share replays made on modded games, but I do not want to subject users to the unacceptable burden of manually remembering which version of which patch stack they've recorded a given replay with. So, we'd somehow need to calculate a hash of everything that defines the gameplay, exclude the things that don't, and only show replays that were recorded on the hash that matches the currently running patch stack. Well, turns out that True Touhou Fans™ quite enjoy watching the games get broken in every possible way. That's the way ZUN intended the games to be experienced, after all. Otherwise, he'd be constantly maintaining the games and shipping bugfix patches… 🤷)

Now, why haven't I been caching the progress numbers all along? Well, parallelizing that parsing process onto all available CPU cores seemed enough in 2019 when this site launched. Back then, the estimates were calculated from slightly over 10 million lines of ASM, which took about 7 seconds to be parsed on my mid-range dev system.
Fast forward to P0142 though, and we have to parse 34.3 million lines of ASM, which takes about 26 seconds on my dev system. That would have only got worse with every new delivery, especially since this production server doesn't have as many cores.

I was thinking about a "doing less" approach for a while: Parsing only the files that had changed between the start and end commit of a push, and keeping those deltas across push boundaries. However, that turned out to be slightly more complex than the few hours I wanted to spend on it. And who knows how well that would have scaled. We've still got a few hundred pushes left to go before we're done here, after all.

So with the tag system, as always, taking longer and consuming more pushes than I had planned, the time had come to finally address the underlying dependency tracking problem.
Initially, this sounded like a nail that was tailor-made for 📝 my favorite hammer, Tup: Move the parser to a separate binary, gather the list of all commits via git rev-list, and run that parser binary on every one of the commits returned. That should end up correctly tracking the relevant parts of .git/ and the new binary as inputs, and cause the commits to be re-parsed if the parser binary changes, right? Too bad that Tup both refuses to track anything inside .git/, and can't track a Golang binary either, due to all of the compiler's unpredictable outputs into its build cache. But can't we at least turn off–

> The build cache is now required as a step toward eliminating $GOPATH/pkg. — Go 1.12 release notes

Oh, wonderful. Hey, I always liked $GOPATH! 🙁

But sure, Golang is too smart anyway to require an external build system. The compiler's build ID is exactly what we need to correctly invalidate the progress number cache. Surely there is a way to retrieve the build ID for any package that makes up a binary at runtime via some kind of reflection, right? Right? …Of course not, in the great Unix tradition, this functionality is only available as a CLI tool that prints its result to stdout. 🙄
But sure, no problem, let's just exec() a separate process on the parser's library package file… oh wait, such a thing doesn't exist anymore, unless you manually install the package. This would have added another complication to the build process, and you'd still have to manually locate the package file, with its version-specific directory name. That might have worked out in the end, but figuring all this out would have probably gone way beyond the budget.

OK, but who cares about packages? We just care about one single file here, anyway. Didn't they put the official Golang source code parser into the standard library? Maybe that can give us something close to the build ID, by hashing the abstract syntax tree of that file. Well, for starters, one does not simply serialize the returned AST. At least into Golang's own, most "native" Gob format, which requires all types from the go/ast package to be manually registered first.
That leaves ast.Fprint() as the only thing close to a ready-made serialization function… and guess what, that one suffers from Golang's typical non-deterministic order when rendering any map to a string. 🤦

Guess there's no way around the simplest, most stupid way of simply calculating any cryptographically secure hash over the ASM parser file. 😶 It's not like we frequently change comments in this file, but still, this could have been so much nicer.
Oh well, at least I did get that issue resolved now, in an acceptable way. If you ever happened to see this website rebuilding: That should now be a matter of seconds, rather than minutes. Next up: Shinki's background animations!

📝 Posted:
🚚 Summary of:
P0140, P0141, P0142
Commits:
d985811...d856f7d, d856f7d...5afee78, 5afee78...08bc188
💰 Funded by:
[Anonymous], rosenrose, Yanga
🏷 Tags:

Alright, onto Konngara! Let's quickly move the escape sequences used later in the battle to C land, and then we can immediately decompile the loading and entrance animation function together with its filenames. Might as well reverse-engineer those escape sequences while I'm at it, though – even if they aren't implemented in DOSBox-X, they're well documented in all those Japanese PDFs, so this should be no big deal…

…wait, ESC )3 switches to "graph mode"? As opposed to the default "kanji mode", which can be re-entered via ESC )0? Let's look up graph mode in the PC-9801 Programmers' Bible then…

> Kanji cannot be handled in this mode.

…and that's apparently all it has to say. Why have it then, on a platform whose main selling point is a kanji ROM, and where Shift-JIS (and, well, 7-bit ASCII) are the only native encodings? No support for graph mode in DOSBox-X either… yeah, let's take a deep dive into NEC's IO.SYS, and get to the bottom of this.

And yes, graph mode pretty much just disables Shift-JIS decoding for characters written via INT 29h, the lowest-level way of "just printing a char" on DOS, which every printf() will ultimately end up calling. Turns out there is a use for it though, which we can spot by looking at the 8×16 half-width section of font ROM:

8×16 half-width section of font ROM, with the characters in the Shift-JIS lead byte range highlighted in red

The half-width glyphs marked in red correspond to the byte ranges from 0x80-0x9F and 0xE0-0xFF… which Shift-JIS defines as lead bytes for two-byte, full-width characters. But if we turn off Shift-JIS decoding…

Visible differences between the kanji and graph modes on PC-98 DOS
(Yes, that g in the function row is how NEC DOS indicates that graph mode is active. Try it yourself by pressing Ctrl+F4!)

Jackpot, we get those half-width characters when printing their corresponding bytes.
I've re-implemented all my findings into DOSBox-X, which will include graph mode in the upcoming 0.83.14 release. If P0140 looks a bit empty as a result, that's why – most of the immediate feature work went into DOSBox-X, not into ReC98. That's the beauty of "anything" pushes. :tannedcirno:

So, after switching to graph mode, TH01 does… one of the slowest possible memset()s over all of text RAM – one printf(" ") call for every single one of its 80×25 half-width cells – before switching back to kanji mode. What a waste of RE time…? Oh well, at least we've now got plenty of proof that these weird escape sequences actually do nothing of interest.


As for the Konngara code itself… well, it's script-like code, what can you say. Maybe minimally sloppy in some places, but ultimately harmless.
One small thing that might not be widely known though: The large, blue-green Siddhaṃ seed syllables are supposed to show up immediately, with no delay between them? Good to know. Clocking your emulator too low tends to roll them down from the top of the screen, and will certainly add a noticeable delay between the four individual images.

… Wait, but this means that ZUN could have intended this "effect". Why else would he not only put those syllables into four individual images (and therefore add at least the latency of disk I/O between them), but also show them on the foreground VRAM page, rather than on the "back buffer"?

Meanwhile, in 📝 another instance of "maybe having gone too far in a few places": Expressing distances on the playfield as fractions of its width and height, just to avoid absolute numbers? Raw numbers are bad because they're in screen space in this game. But we've already been throwing PLAYFIELD_ constants into the mix as a way of explicitly communicating screen space, and keeping raw number literals for the actual playfield coordinates is looking increasingly sloppy… I don't know, fractions really seemed like the most sensible thing to do with what we're given here. 😐


So, 2 pushes in, and we've got the loading code, the entrance animation, facial expression rendering, and the first one out of Konngara's 12 danmaku patterns. Might not sound like much, but since that first pattern involves those ◆ blue-green diamond sprites and therefore is one of the more complicated ones, it all amounts to roughly 21.6% of Konngara's code. That's 7 more pushes to get Konngara done, then? Next up though: Two pushes of website improvements.

📝 Posted:
🏷 Tags:

Last updated: 📝 2024-02-03

Secured a 22.5-hour RL workweek to leave plenty of time for this project, Touhou Patch Center's commissioned MediaWiki update work is also nearing completion, time to reopen the store! Since it's been a long time, here's an overview of where we currently are in each game and binary, and what the next logical step would be:

But as always, you can request pretty much any other part of any game. We're now at a pretty good place as far as arbitrary requests are concerned, as I simply can't decide myself where to put all the current pending contributions in the funding backlog. 😅 By spending only the missing amount of money to complete any of those, you can capture any of those "fractional" contributions towards a specific goal.

The next specific requests are going to set the priorities of this project for quite some time! The best strategy: Spend a low amount of money on something very specific, and watch as existing generic contributions will necessarily have to be put towards making that specific goal happen 😛

📝 Posted:
🚚 Summary of:
P0139
Commits:
864e864...d985811
💰 Funded by:
[Anonymous]
🏷 Tags:

Technical debt, part 10… in which two of the PMD-related functions came with such complex ramifications that they required one full push after all, leaving no room for the additional decompilations I wanted to do. At least, this did end up being the final one, completing all SHARED segments for the time being.


The first one of these functions determines the BGM and sound effect modes, combining the resident type of the PMD driver with the Option menu setting. The TH04 and TH05 version is apparently coded quite smartly, as PC-98 Touhou only needs to distinguish "OPN- / PC-9801-26K-compatible sound sources handled by PMD.COM" from "everything else", since all other PMD varieties are OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's AH=09h function. I'll leave a comprehensive, fully documented enum to interested contributors, since that would involve research into basically the entire history of the PC-9800 series, and even the clearly out-of-scope PC-88VA. After all, distinguishing between more versions of the PMD driver in the Option menu (and adding new sprites for them!) is strictly mod territory.


The honor of being the final decompiled function in any SHARED segment went to TH04's snd_load(). TH04 contains by far the sanest version of this function: Readable C code, no new ZUN bugs (and still missing file I/O error handling, of course)… but wait, what about that actual file read syscall, using the INT 21h, AH=3Fh DOS file read API? Reading up to a hardcoded number of bytes into PMD's or MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in TH05… that's kind of weird. About time we looked closer into this. :thonk:

Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one memory segment for these, as especially TH05's code might suggest to anyone unfamiliar with these drivers. :zunpet: Instead, you can customize the size of these buffers on its command line. In GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound effects, and 12 KiB for MMD files in TH02… which means that the hardcoded sizes in snd_load() are completely wrong, no matter how you look at them. :onricdennat: Consequently, this read syscall will overflow PMD's or MMD's song or sound effect buffer if the given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT instead, and it would have been fine. As it also turns out though, PMD has an API function (AH=22h) to retrieve the actual buffer sizes, provided for exactly that purpose. There is little excuse not to use it, as it also gives you PMD's default sizes if you don't specify any yourself.
(Unless your build process enumerates all PMD files that are part of the game, and bakes the largest size into both snd_load() and GAME.BAT. That would even work with MMD, which doesn't have an equivalent for AH=22h.)

What'd be the consequence of loading a larger file then? Well, since we don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single memory segment. As a result, the limit for the combined size of the song, instrument, and sound effect buffer is determined by the amount of code in the driver itself. In PMD86 version 4.8o (bundled with TH04 and TH05) for example, the remaining size for these buffers is exactly 45,555 bytes. Being an actually good programmer who doesn't blindly trust user input, KAJA thankfully validates the sizes given via the /M, /V, and /E command-line options before letting the driver reside in memory, and shuts down with an error message if they exceed 40 KiB. Would have been even better if he calculated the exact size – even in the current PMD version 4.8s from January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect is down to the INT 21h, AH=3Fh implementation in the underlying DOS version. DOS 3.3 treats the destination address as linear and reads past the end of the segment, DOS 5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining space in the segment, and maybe there's even a DOS that wraps around and ends up overwriting the PMD driver code. In any case: You will overwrite what's after the driver in memory – typically, the game .EXE and its master.lib functions.

It almost feels like a happy accident that this doesn't cause issues in the original games. The largest PMD file in any of the 4 games, the -86 version of 幽夢 ~ Inanimate Dream, takes up 8,099 bytes, just under the 8,192 byte limit for BGM. For modders, I'd really recommend implementing this properly, with PMD's AH=22h function and error handling, once position independence has been reached.

Whew, didn't think I'd be doing more research into KAJA's drivers during regular ReC98 development! That's probably been the final time though, as all involved functions are now decompiled, and I'm unlikely to iterate over them again.


And that's it! Repaid the biggest chunk of technical debt, time for some actual progress again. Next up: Reopening the store tomorrow, and waiting for new priorities. If we got nothing by Sunday, I'm going to put the pending [Anonymous] pushes towards some work on the website.

📝 Posted:
🚚 Summary of:
P0138
Commits:
8d953dc...864e864
💰 Funded by:
[Anonymous], Blue Bolt
🏷 Tags:

Technical debt, part 9… and as it turns out, it's highly impractical to repay 100% of it at this point in development. 😕

The reason: graph_putsa_fx(), ZUN's function for rendering optionally boldfaced text to VRAM using the font ROM glyphs, in its ridiculously micro-optimized TH04 and TH05 version. This one sets the "callback function" for applying the boldface effect by self-modifying the target of two CALL rel16 instructions… because there really wasn't any free register left for an indirect CALL, eh? The necessary distance, from the call site to the function itself, has to be calculated at assembly time, by subtracting the target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting lookup tables in the .DATA segment. With code segments, we can easily split them at pretty much any point between functions because there are multiple of them. But there's only a single .DATA segment, with all ZUN and master.lib data sandwiched between Borland C++'s crt0 at the top, and Borland C++'s library functions at the bottom of the segment. Adding another split point would require all data after that point to be moved to its own translation unit, which in turn requires EXTERN references in the big .ASM file to all that moved data… in short, it would turn the codebase into an even greater mess.
Declaring the labels as EXTERN wouldn't work either, since the linker can't do fancy arithmetic and is limited to simply replacing address placeholders with one single address. So, we're now stuck with this function at the bottom of the SHARED segment, for the foreseeable future.


We can still continue to separate functions off the top of that segment, though. Pretty much the only thing noteworthy there, so far: TH04's code for loading stage tile images from .MPN files, which we hadn't reverse-engineered so far, and which nicely fit into one of Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved the RE% bars again! If only for a tiny bit. :tannedcirno:
Both TH02 and TH05 simply store one pointer to one dynamically allocated memory block for all tile images, as well as the number of images, in the data segment. TH04, on the other hand, reserves memory for 8 .MPN slots, complete with their color palettes, even though it only ever uses the first one of these. There goes another 458 bytes of conventional RAM… I should start summing up all the waste we've seen so far. Let's put the next website contribution towards a tagging system for these blog posts.

At 86% of technical debt in the SHARED segment repaid, we aren't quite done yet, but the rest is mostly just TH04 needing to catch up with functions we've already separated. Next up: Getting to that practical 98.5% point. Since this is very likely to not require a full push, I'll also decompile some more actual TH04 and TH05 game code I previously reverse-engineered – and after that, reopen the store!

📝 Posted:
🚚 Summary of:
P0137
Commits:
07bfcf2...8d953dc
💰 Funded by:
[Anonymous]
🏷 Tags:

Whoops, the build was broken again? Since P0127 from mid-November 2020, on TASM32 version 5.3, which also happens to be the one in the DevKit… That version changed the alignment for the default segments of certain memory models when requesting .386 support. And since redefining segment alignment apparently is highly illegal and absolutely has to be a build error, some of the stand-alone .ASM translation units didn't assemble anymore on this version. I've only spotted this on my own because I casually compiled ReC98 somewhere else – on my development system, I happened to have TASM32 version 5.0 in the PATH during all this time.
At least this was a good occasion to get rid of some weird segment alignment workarounds from 2015, and replace them with the superior convention of using the USE16 modifier for the .MODEL directive.

ReC98 would highly benefit from a build server – both in order to immediately spot issues like this one, and as a service for modders. Even more so than the usual open-source project of its size, I would say. But that might be exactly because it doesn't seem like something you can trivially outsource to one of the big CI providers for open-source projects, and quickly set it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular 64-bit Windows 10 and DOSBox running the exact build tools from the DevKit. Ideally, though, such a server should really run the optimal configuration of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step to run natively, which already is something that no popular CI service out there offers. Then, we'd optimally expand to Linux, every other Windows version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd be a lot. An experimental project all on its own, with additional hosting costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest there is once the store reopens (which will be at the beginning of May, at the latest). That aside, it would 📝 also be a great project for outside contributors!


So, technical debt, part 8… and right away, we're faced with TH03's low-level input function, which 📝 once 📝 again 📝 insists on being word-aligned in a way we can't fake without duplicating translation units. Being undecompilable isn't exactly the best property for a function that has been interesting to modders in the past: In 2018, spaztron64 created an ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human multiplayer mode: 2021-04-04-TH03-WASD-2player.zip However, this remapping attempt remained quite limited, since we hadn't (and still haven't) reached full position independence for TH03 yet. There's quite some potential for size optimizations in this function, which would allow more BIOS key groups to already be used right now, but it's not all that obvious to modders who aren't intimately familiar with x86 ASM. Therefore, I really wouldn't want to keep such a long and important function in ASM if we don't absolutely have to…

… and apparently, that's all the motivation I needed? So I took the risk, and spent the first half of this push on reverse-engineering TCC.EXE, to hopefully find a way to get word-aligned code segments out of Turbo C++ after all.

And there is! The -WX option, used for creating DPMI applications, messes up all sorts of code generation aspects in weird ways, but does in fact mark the code segment as word-aligned. We can consider ourselves quite lucky that we get to use Turbo C++ 4.0, because this feature isn't available in any previous version of Borland's C++ compilers.
That allowed us to restore all the decompilations I previously threw away… well, two of the three, that lookup table generator was too much of a mess in C. :tannedcirno: But what an abuse this is. The subtly different code generation has basically required one creative workaround per usage of -WX. For example, enabling that option causes the regular PUSH BP and POP BP prolog and epilog instructions to be wrapped with INC BP and DEC BP, for some reason:

a_function_compiled_with_wx proc
	inc 	bp    	; ???
	push	bp
	mov 	bp, sp
	    	      	; [… function code …]
	pop 	bp
	dec 	bp    	; ???
	ret
a_function_compiled_with_wx endp

Luckily again, all the functions that currently require -WX don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty close: snd_se_reset(void) is one of the functions that require word alignment. Previously, it shared a translation unit with the immediately following snd_se_play(int new_se), which does take a parameter, and therefore would have had its prolog and epilog code messed up by -WX. Since the latter function has a consistent (and thus, fakeable) alignment, I simply split that code segment into two, with a new -WX translation unit for just snd_se_reset(void). Problem solved – after all, two C++ translation units are still better than one ASM translation unit. :onricdennat: Especially with all the previous #include improvements.

The rest was more of the usual, getting us 74% done with repaying the technical debt in the SHARED segment. A lot of the remaining 26% is TH04 needing to catch up with TH03 and TH05, which takes comparatively little time. With some good luck, we might get this done within the next push… that is, if we aren't confronted with all too many more disgusting decompilations, like the two functions that ended this push. If we are, we might be needing 10 pushes to complete this after all, but that piece of research was definitely worth the delay. Next up: One more of these.

📝 Posted:
🚚 Summary of:
P0135, P0136
Commits:
a6eed55...252c13d, 252c13d...07bfcf2
💰 Funded by:
[Anonymous]
🏷 Tags:

Alright, no more big code maintenance tasks that absolutely need to be done right now. Time to really focus on parts 6 and 7 of repaying technical debt, right? Except that we don't get to speed up just yet, as TH05's barely decompilable PMD file loading function is rather… complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98 Touhou, I first consult the disassembly of Wolfenstein 3D. That game was originally compiled with the quite similar Borland C++ 3.0, so it's quite helpful to compare its ASM to the officially released source code. If I find the instructions in question, they mostly come from that game's ASM code, leading to the amusing realization that "even John Carmack was unable to get these instructions out of this compiler" :onricdennat: This time though, Wolfenstein 3D did point me to Borland's intrinsics for common C functions like memcpy() and strchr(), available via #pragma intrinsic. Bu~t those unfortunately still generate worse code than what ZUN micro-optimized here. Commenting how these sequences of instructions should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely though, clarifying the control flow, and clearly exposing a ZUN bug: TH05's snd_load() will hang in an infinite loop when trying to load a non-existing -86 BGM file (with a .M2 extension) if the corresponding -26 BGM file (with a .M extension) doesn't exist either.

Unsurprisingly, the PMD channel monitoring code in TH05's Music Room remains undecompilable outside the two most "high-level" initialization and rendering functions. And it's not because there's data in the middle of the code segment – that would have actually been possible with some #pragmas to ensure that the data and code segments have the same name. As soon as the SI and DI registers are referenced anywhere, Turbo C++ insists on emitting prolog code to save these on the stack at the beginning of the function, and epilog code to restore them from there before returning. Found that out in September 2019, and confirmed that there's no way around it. All the small helper functions here are quite simply too optimized, throwing away any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate that I do try.


Within that same 6th push though, we've finally reached the one function in TH05 that was blocking further progress in TH04, allowing that game to finally catch up with the others in terms of separated translation units. Feels good to finally delete more of those .ASM files we've decompiled a while ago… finally!

But since that was just getting started, the most satisfying development in both of these pushes actually came from some more experiments with macros and inline functions for near-ASM code. By adding "unused" dummy parameters for all relevant registers, the exact input registers are made more explicit, which might help future port authors who then maybe wouldn't have to look them up in an x86 instruction reference quite as often. At its best, this even allows us to declare certain functions with the __fastcall convention and express their parameter lists as regular C, with no additional pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even more amazing than previously thought when it comes to returning pseudo-registers from inline functions. A nice example for how this can improve readability can be found in this piece of TH02 code for polling the PC-98 keyboard state using a BIOS interrupt:

inline uint8_t keygroup_sense(uint8_t group) {
	_AL = group;
	_AH = 0x04;
	geninterrupt(0x18);
	// This turns the output register of this BIOS call into the return value
	// of this function. Surprisingly enough, this does *not* naively generate
	// the `MOV AL, AH` instruction you might expect here!
	return _AH;
}

void input_sense(void)
{
	// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
	// never emits as such, giving us only the three instructions we need.
	_AH = keygroup_sense(8);

	// Whereas this one gives us the one additional `MOV BH, AH` instruction
	// we'd expect, and nothing more.
	_BH = keygroup_sense(7);

	// And now it's obvious what both of these registers contain, from just
	// the assignments above.
	if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
		key_det |= INPUT_UP;
	}
	// […]
}

I love it. No inline assembly, as close to idiomatic C code as something like this is going to get, yet still compiling into the minimum possible number of x86 instructions on even a 1994 compiler. This is how I keep this project interesting for myself during chores like these. :tannedcirno: We might have even reached peak inline already?

And that's 65% of technical debt in the SHARED segment repaid so far. Next up: Two more of these, which might already complete that segment? Finally!

📝 Posted:
🚚 Summary of:
P0134
Commits:
1d5db71...a6eed55
💰 Funded by:
[Anonymous]
🏷 Tags:

Technical debt, part 5… and we only got TH05's stupidly optimized .PI functions this time?

As far as actual progress is concerned, that is. In maintenance news though, I was really hyped for the #include improvements I've mentioned in 📝 the last post. The result: A new x86real.h file, bundling all the declarations specific to the 16-bit x86 Real Mode in a smaller file than Turbo C++'s own DOS.H. After all, DOS is something else than the underlying CPU. And while it didn't speed up build times quite as much as I had hoped, it now clearly indicates the x86-specific parts of PC-98 Touhou code to future port authors.

After another couple of improvements to parameter declaration in ASM land, we get to TH05's .PI functions… and really, why did ZUN write all of them in ASM? Why (re)declare all the necessary structures and data in ASM land, when all these functions are merely one layer of abstraction above master.lib, which does all the actual work?
I get that ZUN might have wanted masked blitting to be faster, which is used for the fade-in effect seen during TH05's main menu animation and the ending artwork. But, uh… he knew how to modify master.lib. In fact, he did already modify the graph_pack_put_8() function used for rendering a single .PI image row, to ignore master.lib's VRAM clipping region. For this effect though, he first blits each row regularly to the invisible 400th row of VRAM, and then does an EGC-accelerated VRAM-to-VRAM blit of that row to its actual target position with the mask enabled. It would have been way more efficient to add another version of this function that takes a mask pattern. No amount of REP MOVSW is going to change the fact that two VRAM writes per line are slower than a single one. Not to mention that it doesn't justify writing every other .PI function in ASM to go along with it…
This is where we also find the most hilarious aspect about this: For most of ZUN's pointless micro-optimizations, you could have maybe made the argument that they do save some CPU cycles here and there, and therefore did something positive to the final, PC-98-exclusive result. But some of the hand-written ASM here doesn't even constitute a micro-optimization, because it's worse than what you would have got out of even Turbo C++ 4.0J with its 80386 optimization flags! :zunpet:

At least it was possible to "decompile" 6 out of the 10 functions here, making them easy to clean up for future modders and port authors. Could have been 7 functions if I also decided to "decompile" pi_free(), but all the C++ code is already surrounded by ASM, resulting in 2 ASM translation units and 2 C++ translation units. pi_free() would have needed a single translation unit by itself, which wasn't worth it, given that I would have had to spell out every single ASM instruction anyway.

void pascal pi_free(int slot)
{
	if(pi_buffers[slot]) {
		graph_pi_free(&pi_headers[slot], &pi_buffers[slot]);
		pi_buffers[slot] = NULL;
	}
}

There you go. What about this needed to be written in ASM?!?

The function calls between these small translation units even seemed to glitch out TASM and the linker in the end, leading to one CALL offset being weirdly shifted by 32 bytes. Usually, TLINK reports a fixup overflow error when this happens, but this time it didn't, for some reason? Mirroring the segment grouping in the affected translation unit did solve the problem, and I already knew this, but only thought of it after spending quite some RTFM time… during which I discovered the -lE switch, which enables TLINK to use the expanded dictionaries in Borland's .OBJ and .LIB files to speed up linking. That shaved off roughly another second from the build time of the complete ReC98 repository. The more you know… Binary blobs compiled with non-Borland tools would be the only reason not to use this flag.

So, even more slowdown with this 5th dedicated push, since we've still only repaid 41% of the technical debt in the SHARED segment so far. Next up: Part 6, which hopefully manages to decompile the FM and SSG channel animations in TH05's Music Room, and hopefully ends up being the final one of the slow ones.

📝 Posted:
🚚 Summary of:
P0133
Commits:
045450c...1d5db71
💰 Funded by:
[Anonymous]
🏷 Tags:

Wow, 31 commits in a single push? Well, what the last push had in progress, this one had in maintenance. The 📝 master.lib header transition absolutely had to be completed in this one, for my own sanity. And indeed, it reduced the build time for the entirety of ReC98 to about 27 seconds on my system, just as expected in the original announcement. Looking forward to even faster build times with the upcoming #include improvements I've got up my sleeve! The port authors of the future are going to appreciate those quite a bit.

As for the new translation units, the funniest one is probably TH05's function for blitting the 1-color .CDG images used for the main menu options. Which is so optimized that it becomes decompilable again, by ditching the self-modifying code of its TH04 counterpart in favor of simply making better use of CPU registers. The resulting C code is still a mess, but what can you do. :tannedcirno:
This was followed by even more TH05 functions that clearly weren't compiled from C, as evidenced by their padding bytes. It's about time I've documented my lack of ideas of how to get those out of Turbo C++. :onricdennat:

And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…

In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?

📝 Posted:
🚚 Summary of:
P0132
Commits:
dc9e3ee...045450c
💰 Funded by:
[Anonymous]
🏷 Tags:

Now that's the amount of translation unit separation progress I was looking for! Too bad that RL is keeping me more and more occupied these days, and ended up delaying this push until 2021. Now that Touhou Patch Center is also commissioning me to update their infrastructure, it's going to take a while for ReC98 to return to full speed, and for the store to be reopened. Should happen by April at the latest, though!

With everything related to this separation of translation units explained earlier, we've really got a push with nothing to talk about, this time. Except, maybe, for the realization that 📝 this current approach might not be the best fit for TH02 after all: Not only did it force us to 📝 throw away the previous decompilation of the sound effect playback functions, but OP.EXE also contains obviously copy-pasted code in addition to the common, shared set of library functions. How was that game even built, originally??? No way around compiling that one instance of the "delay until given BGM measure" function separately then, if it insists on using its own instance of the VSync delay function…
Oh well, this separated layout still works better for the later games, and consistency is good. Smooth sailing with all of the other functions, at least.

Next up: One more of these, which might even end up completing the 📝 transition to our own master.lib header file. In terms of the total number of ASM code left in the SHARED code segments, we're now 30% done after 3 dedicated pushes. It really shouldn't require 7 more pushes, though!

📝 Posted:
🚚 Summary of:
P0130, P0131
Commits:
6d69ea8...576def5, 576def5...dc9e3ee
💰 Funded by:
Yanga
🏷 Tags:

50% hype! 🎉 But as usual for TH01, even that final set of functions shared between all bosses had to consume two pushes rather than one…

First up, in the ongoing series "Things that TH01 draws to the PC-98 graphics layer that really should have been drawn to the text layer instead": The boss HP bar. Oh well, using the graphics layer at least made it possible to have this half-red, half-white pattern for the middle section.
This one pattern is drawn by making surprisingly good use of the GRCG. So far, we've only seen it used for fast monochrome drawing:

// Setting up fast drawing using color #9 (1001 in binary)
grcg_setmode(GC_RMW);
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0x00); // Plane 1: (R): (        )
outportb(0x7E, 0x00); // Plane 2: (G): (        )
outportb(0x7E, 0xFF); // Plane 3: (E): (********)

// Write a checkerboard pattern (* * * * ) in color #9 to the top-left corner,
// with transparent blanks. Requires only 1 VRAM write to a single bitplane:
// The GRCG automatically writes to the correct bitplanes, as specified above
*(uint8_t *)(MK_FP(0xA800, 0)) = 0xAA;

But since this is actually an 8-pixel tile register, we can set any 8-pixel pattern for any bitplane. This way, we can get different colors for every one of the 8 pixels, with still just a single VRAM write of the alpha mask to a single bitplane:

grcg_setmode(GC_RMW); //  Final color: (A7A7A7A7)
outportb(0x7E, 0x55); // Plane 0: (B): ( * * * *)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0x55); // Plane 2: (G): ( * * * *)
outportb(0x7E, 0xAA); // Plane 3: (E): (* * * * )

And I thought TH01 only suffered the drawbacks of PC-98 hardware, making so little use of its actual features that it's perhaps not fair to even call it "a PC-98 game"… Still, I'd say that "bad PC-98 port of an idea" describes it best.

However, after that tiny flash of brilliance, the surrounding HP rendering code goes right back to being the typical sort of confusing TH01 jank. There's only a single function for the three distinct jobs of

with magic numbers to select between all of these.

VRAM of course also means that the backgrounds behind the individual hit points have to be stored, so that they can be unblitted later as the boss is losing HP. That's no big deal though, right? Just allocate some memory, copy what's initially in VRAM, then blit it back later using your foundational set of blitting funct– oh, wait, TH01 doesn't have this sort of thing, right :tannedcirno: The closest thing, 📝 once again, are the .PTN functions. And so, the game ends up handling these 8×16 background sprites with 16×16 wrappers around functions for 32×32 sprites. :zunpet: That's quite the recipe for confusion, especially since ZUN preferred copy-pasting the necessary ridiculous arithmetic expressions for calculating positions, .PTN sprite IDs, and the ID of the 16×16 quarter inside the 32×32 sprite, instead of just writing simple helper functions. He did manage to make the result mostly bug-free this time around, though! (Edit (2022-05-31): Nope, there's a 📝 potential heap corruption after all, which can be triggered in some fights in test mode (game t) or debug mode (game d).) There's one minor hit point discoloration bug if the red-white or white sections start at an odd number of hit points, but that's never the case for any of the original 7 bosses.
The remaining sloppiness is ultimately inconsequential as well: The game always backs up twice the number of hit point backgrounds, and thus uses twice the amount of memory actually required. Also, this self-restriction of only unblitting 16×16 pixels at a time requires any remaining odd hit point at the last position to, of course, be rendered again :onricdennat:


After stumbling over the weakest imaginable random number generator, we finally arrive at the shared boss↔orb collision handling function, the final blocker among the final blockers. This function takes a whopping 12 parameters, 3 of them being references to int values, some of which are duplicated for every one of the 7 bosses, with no generic boss struct anywhere. 📝 Previously, I speculated that YuugenMagan might have been the first boss to be programmed for TH01. With all these variables though, there is some new evidence that SinGyoku might have been the first one after all: It's the only boss to use its own HP and phase frame variables, with the other bosses sharing the same two globals.

While this function only handles the response to a boss↔orb collision, it still does way too much to describe it briefly. Took me quite a while to frame it in terms of invincibility (which is the main impact of all of this that can be observed in gameplay code). That made at least some sort of sense, considering the other usages of the variables passed as references to that function. Turns out that YuugenMagan, Kikuri, and Elis abuse what's meant to be the "invincibility frame" variable as a frame counter for some of their animations 🙄
Oh well, the game at least doesn't call the collision handling function during those, so "invincibility frame" is technically still a correct variable name there.


And that's it! We're finally ready to start with Konngara, in 2021. I've been waiting quite a while for this, as all this high-level boss code is very likely to speed up TH01 progress quite a bit. Next up though: Closing out 2020 with more of the technical debt in the other games.

📝 Posted:
🚚 Summary of:
P0128, P0129
Commits:
dc65b59...dde36f7, dde36f7...f4c2e45
💰 Funded by:
Yanga
🏷 Tags:

So, only one card-flipping function missing, and then we can start decompiling TH01's two final bosses? Unfortunately, that had to be the one big function that initializes and renders all gameplay objects. #17 on the list of longest functions in all of PC-98 Touhou, requiring two pushes to fully understand what's going on there… and then it immediately returns for all "boss" stages whose number is divisible by 5, yet is still called during Sariel's and Konngara's initialization 🤦

Oh well. This also involved the final file format we hadn't looked at yet – the STAGE?.DAT files that describe the layout for all stages within a single 5-stage scene. Which, for a change is a very well-designed form– no, of course it's completely weird, what did you expect? Development must have looked somewhat like this:

With all that, it's almost not worth mentioning how there are 12 turret types, which only differ in which hardcoded pellet group they fire at a hardcoded interval of either 100 or 200 frames, and that they're all explicitly spelled out in every single switch statement. Or how the layout of the internal card and obstacle SoA classes is quite disjointed. So here's the new ZUN bugs you've probably already been expecting!


Cards and obstacles are blitted to both VRAM pages. This way, any other entities moving on top of them can simply be unblitted by restoring pixels from VRAM page 1, without requiring the stationary objects to be redrawn from main memory. Obviously, the backgrounds behind the cards have to be stored somewhere, since the player can remove them. For faster transitions between stages of a scene, ZUN chose to store the backgrounds behind obstacles as well. This way, the background image really only needs to be blitted for the first stage in a scene.

All that memory for the object backgrounds adds up quite a bit though. ZUN actually made the correct choice here and picked a memory allocation function that can return more than the 64 KiB of a single x86 Real Mode segment. He then accesses the individual backgrounds via regular array subscripts… and that's where the bug lies, because he stores the returned address in a regular far pointer rather than a huge one. This way, the game still can only display a total of 102 objects (i. e., cards and obstacles combined) per stage, without any unblitting glitches.
What a shame, that limit could have been 127 if ZUN didn't needlessly allocate memory for alpha planes when backing up VRAM content. :onricdennat:

And since array subscripts on far pointers wrap around after 64 KiB, trying to save the background of the 103rd object is guaranteed to corrupt the memory block header at the beginning of the returned segment. :zunpet: When TH01 runs in debug mode, it correctly reports a corrupted heap in this case.
After detecting such a corruption, the game loudly reports it by playing the "player hit" sound effect and locking up, freezing any further gameplay or rendering. The locking loop can be left by pressing ↵ Return, but the game will simply re-enter it if the corruption is still present during the next heapcheck(), in the next frame. And since heap corruptions don't tend to repair themselves, you'd have to constantly hold ↵ Return to resume gameplay. Doing that could actually get you safely to the next boss, since the game doesn't allocate or free any further heap memory during a 5-stage card-flipping scene, and just throws away its C heap when restarting the process for a boss. But then again, holding ↵ Return will also auto-flip all cards on the way there… 🤨


Finally, some unused content! Upon discovering TH01's stage selection debug feature, probably everyone tried to access Stage 21, just to see what happens, and indeed landed in an actual stage, with a black background and a weird color palette. Turns out that ZUN did ship an unused scene in SCENE7.DAT, which is exactly what's loaded there.
However, it's easy to believe that this is just garbage data (as I initially did): At the beginning of "Stage 22", the game seems to enter an infinite loop somewhere during the flip-in animation.

Well, we've had a heap overflow above, and the cause here is nothing but a stack buffer overflow – a perhaps more modern kind of classic C bug, given its prevalence in the Windows Touhou games. Explained in a few lines of code:

void stageobjs_init_and_render()
{
	int card_animation_frames[50]; // even though there can be up to 200?!
	int total_frames = 0;

	(code that would end up resetting total_frames if it ever tried to reset
	card_animation_frames[50]…)
}

The number of cards in "Stage 22"? 76. There you have it.

But of course, it's trivial to disable this animation and fix these stage transitions. So here they are, Stages 21 to 24, as shipped with the game in STAGE7.DAT:

TH01 stage 21, loaded from <code>STAGE7.DAT</code>TH01 stage 22, loaded from <code>STAGE7.DAT</code>TH01 stage 23, loaded from <code>STAGE7.DAT</code>TH01 stage 24, loaded from <code>STAGE7.DAT</code>

Wow, what a mess. All that was just a bit too much to be covered in two pushes… Next up, assuming the current subscriptions: Taking a vacation with one smaller TH01 push, covering some smaller functions here and there to ensure some uninterrupted Konngara progress later on.

📝 Posted:
🚚 Summary of:
P0126, P0127
Commits:
6c22af7...8b01657, 8b01657...dc65b59
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:

Alright, back to continuing the master.hpp transition started in P0124, and repaying technical debt. The last blog post already announced some ridiculous decompilations… and in fact, not a single one of the functions in these two pushes was decompilable into idiomatic C/C++ code.

As usual, that didn't keep me from trying though. The TH04 and TH05 version of the infamous 16-pixel-aligned, EGC-accelerated rectangle blitting function from page 1 to page 0 was fairly average as far as unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be the .MRS functions, used to render the gauge attack portraits and bomb backgrounds. The blitting code there uses the additional FS and GS segment registers provided by the Intel 386… which

  1. are not supported by Turbo C++'s inline assembler, and
  2. can't be turned into pointers, due to a compiler bug in Turbo C++ that generates wrong segment prefix opcodes for the _FS and _GS pseudo-registers.

Apparently I'm the first one to even try doing that with this compiler? I haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around this bug and generate the correct instructions. But that would incur yet another dependency on a 16-bit TASM, for something honestly quite insignificant.

What we can always do, however, is using __emit__() to simply output x86 opcodes anywhere in a function. Unlike spelled-out inline assembly, that can even be used in helper functions that are supposed to inline… which does in fact allow us to fully abstract away this compiler bug. Regular if() comparisons with pseudo-registers wouldn't inline, but "converting" them into C++ template function specializations does. All that's left is some C preprocessor abuse to turn the pseudo-registers into types, and then we do retain a normal-looking poke() call in the blitting functions in the end. 🤯

Yeah… the result is batshit insane. I may have gone too far in a few places…


One might certainly argue that all these ridiculous decompilations actually hurt the preservation angle of this project. "Clearly, ZUN couldn't have possibly written such unreasonable C++ code. So why pretend he did, and not just keep it all in its more natural ASM form?" Well, there are several reasons:

Unfortunately, these pushes also demonstrated a second disadvantage in trying to decompile everything possible: Since Turbo C++ lacks TASM's fine-grained ability to enforce code alignment on certain multiples of bytes, it might actually be unfeasible to link in a C-compiled object file at its intended original position in some of the .EXE files it's used in. Which… you're only going to notice once you encounter such a case. Due to the slightly jumbled order of functions in the 📝 second, shared code segment, that might be long after you decompiled and successfully linked in the function everywhere else.

And then you'll have to throw away that decompilation after all 😕 Oh well. In this specific case (the lookup table generator for horizontally flipping images), that decompilation was a mess anyway, and probably helped nobody. I could have added a dummy .OBJ that does nothing but enforce the needed 2-byte alignment before the function if I really insisted on keeping the C version, but it really wasn't worth it.


Now that I've also described yet another meta-issue, maybe there'll really be nothing to say about the next technical debt pushes? :onricdennat: Next up though: Back to actual progress again, with TH01. Which maybe even ends up pushing that game over the 50% RE mark?

📝 Posted:
🚚 Summary of:
P0124, P0125
Commits:
72dfa09...056b1c7, 056b1c7...f6a3246
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:

Turns out that TH04's player selection menu is exactly three times as complicated as TH05's. Two screens for character and shot type rather than one, and a way more intricate implementation for saving and restoring the background behind the raised top and left edges of a character picture when moving the cursor between Reimu and Marisa. TH04 decides to backup precisely only the two 256×8 (top) and 8×244 (left) strips behind the edges, indicated in red in the picture below.

Backed-up VRAM area in TH04's player character selection

These take up just 4 KB of heap memory… but require custom blitting functions, and expanding this explicitly hardcoded approach to TH05's 4 characters would have been pretty annoying. So, rather than, uh, not explicitly hardcoding it all, ZUN decided to just be lazy with the backup area in TH05, saving the entire 640×400 screen, and thus spending 128 KB of heap memory on this rather simple selection shadow effect. :zunpet:


So, this really wasn't something to quickly get done during the first half of a push, even after already having done TH05's equivalent of this menu. But since life is very busy right now, I also used the occasion to start addressing another code organization annoyance: master.lib's single master.h header file.

So, time to start a new master.hpp header that would contain just the declarations from master.h that PC-98 Touhou actually needs, plus some semantic (yes, semantic) sugar. Comparing just the old master.h to just the new master.hpp after roughly 60% of the transition has been completed, we get median build times of 319 ms for master.h, and 144 ms for master.hpp on my (admittedly rather slow) DOSBox setup. Nice!
As of this push, ReC98 consists of 107 translation units that have to be compiled with Turbo C++ 4.0J. Fully rebuilding all of these currently takes roughly 37.5 seconds in DOSBox. After the transition to master.hpp is done, we could therefore shave some 10 to 15 seconds off this time, simply by switching header files. And that's just the beginning, as this will also pave the way for further #include optimizations. Life in this codebase will be great!


Unfortunately, there wasn't enough time to repay some of the actual technical debt I was looking forward to, after all of this. Oh well, at least we now also have nice identifiers for the three different boldface options that are used when rendering text to VRAM, after procrastinating that issue for almost 11 months. Next up, assuming the existing subscriptions: More ridiculous decompilations of things that definitely weren't originally written in C, and a big blocker in TH03's MAIN.EXE.

📝 Posted:
🚚 Summary of:
P0123
Commits:
4406c3d...72dfa09
💰 Funded by:
Yanga
🏷 Tags:

Done with the .BOS format, at last! While there's still quite a bunch of undecompiled non-format blitting code left, this was in fact the final piece of graphics format loading code in TH01.

📝 Continuing the trend from three pushes ago, we've got yet another class, this time for the 48×48 and 48×32 sprites used in Reimu's gohei, slide, and kick animations. The only reason these had to use the .BOS format at all is simply because Reimu's regular sprites are 32×32, and are therefore loaded from 📝 .PTN files.
Yes, this makes no sense, because why would you split animations for the same character across two file formats and two APIs, just because of a sprite size difference? This necessity for switching blitting APIs might also explain why Reimu vanishes for a few frames at the beginning and the end of the gohei swing animation, but more on that once we get to the high-level rendering code.

Now that we've decompiled all the .BOS implementations in TH01, here's an overview of all of them, together with .PTN to show that there really was no reason for not using the .BOS API for all of Reimu's sprites:

CBossEntity CBossAnim CPlayerAnim ptn_* (32×32)
Format .BOS .BOS .BOS .PTN
Hitbox
Byte-aligned blitting
Byte-aligned unblitting
Unaligned blitting Single-line and wave only
Precise unblitting
Per-file sprite limit 8 8 32 64
Pixels blitted at once 16 16 8 32

And even that last property could simply be handled by branching based on the sprite width, and wouldn't be a reason for switching formats. But well, it just wouldn't be TH01 without all that redundant bloat though, would it?

The basic loading, freeing, and blitting code was yet another variation on the other .BOS code we've seen before. So this should have caused just as little trouble as the CBossAnim code… except that CPlayerAnim did add one slightly difficult function to the mix, which led to it requiring almost a full push after all. Similar to 📝 the unblitting code for moving lasers we've seen in the last push, ZUN tries to minimize the amount of VRAM writes when unblitting Reimu's slide animations. Technically, it's only necessary to restore the pixels that Reimu traveled by, plus the ones that wouldn't be redrawn by the new animation frame at the new X position.
The theoretically arbitrary distance between the two sprites is, of course, modeled by a fixed-size buffer on the stack :onricdennat:, coming with the further assumption that the sprite surely hasn't moved by more than 1 horizontal VRAM byte compared to the last frame. Which, of course, results in glitches if that's not the case, leaving little Reimu parts in VRAM if the slide speed ever exceeded 8 pixels per frame. :tannedcirno: (Which it never does, being hardcoded to 6 pixels, but still.). As it also turns out, all those bit masking operations easily lead to incredibly sloppy C code. Which compiles into incredibly terrible ASM, which in turn might end up wasting way more CPU time than the final VRAM write optimization would have gained? Then again, in-depth profiling is way beyond the scope of this project at this point.

Next up: The TH04 main menu, and some more technical debt.

📝 Posted:
🚚 Summary of:
P0122
Commits:
164591f...4406c3d
💰 Funded by:
Yanga
🏷 Tags:

This time around, laser is 📝 actually not difficult, with TH01's shootout laser class being simple enough to nicely fit into a single push. All other stationary lasers (as used by YuugenMagan, for example) don't even use a class, and are simply treated as regular lines with collision detection.

But of course, the shootout lasers also come with the typical share of TH01 jank we've all come to expect by now. This time, it already starts with the hardcoded sprite data:

TH01 shootout laser 'sprites'

A shootout laser can have a width from 1 to 8 pixels, so ZUN stored a separate 16×1 sprite with a line for each possible width (left-to-right). Then, he shifted all of these sprites 1 pixel to the right for all of the 8 possible start positions within a planar VRAM byte (top-to-bottom). Because… doing that bit shift programmatically is way too expensive, so let's pre-shift at compile time, and use 16× the memory per sprite? :tannedcirno:

Since a bunch of other sprite sheets need to be pre-shifted as well (this is the 5th one we've found so far), our sprite converter has a feature to automatically generate those pre-shifted variations. This way, we can abstract away that implementation detail and leave modders with .BMP files that still only contain a single version of each sprite. But, uh…, wait, in this sprite sheet, the second row for 1-pixel lasers is accidentally shifted right by one more pixel that it should have been?! Which means that

  1. we can't use the auto-preshift feature here, and have to store this weird-looking (and quite frankly, completely unnecessary) sprite sheet in its entirety
  2. ZUN did, at least during TH01's development, not have a sprite converter, and directly hardcoded these dot patterns in the C++ code :zunpet:

The waste continues with the class itself. 69 bytes, with 22 bytes outright unused, and 11 not really necessary. As for actual innovations though, we've got 📝 another 32-bit fixed-point type, this time actually using 8 bits for the fractional part. Therefore, the ray position is tracked to the 1/256th of a pixel, using the full precision of master.lib's 8-bit sin() and cos() lookup tables.
Unblitting is also remarkably efficient: It's only done once the laser stopped extending and started moving, and only for the exact pixels at the start of the ray that the laser traveled by in a single frame. If only the ray part was also rendered as efficiently – it's fully blitted every frame, right next to the collision detection for each row of the ray.


With a public interface of two functions (spawn, and update / collide / unblit / render), that's superficially all there is to lasers in this game. There's another (apparently inlined) function though, to both reset and, uh, "fully unblit" all lasers at the end of every boss fight… except that it fails hilariously at doing the latter, and ends up effectively unblitting random 32-pixel line segments, due to ZUN confusing both the coordinates and the parameter types for the line unblitting function. :zunpet:
A while ago, I was asked about this crash that tends to happen when defeating Elis. And while you can clearly see the random unblitted line segments that are missing from the sprites, I don't quite think we've found the cause for the crash, since the 📝 line unblitting function used there does clip its coordinates to the VRAM range.

Next up: The final piece of image format code in TH01, covering Reimu's sprites!

📝 Posted:
🚚 Summary of:
P0120, P0121
Commits:
453dd3c...3c008b6, 3c008b6...5c42fcd
💰 Funded by:
Yanga
🏷 Tags:

Back to TH01, and its boss sprite format… with a separate class for storing animations that only differs minutely from the 📝 regular boss entity class I covered last time? Decompiling this class was almost free, and the main reason why the first of these pushes ended up looking pretty huge.

Next up were the remaining shape drawing functions from the code segment that started with the .GRC functions. P0105 already started these with the (surprisingly sanely implemented) 8×8 diamond, star, and… uh, snowflake (?) sprites , prominently seen in the Konngara, Elis, and Sariel fights, respectively. Now, we've also got:

The weirdness becomes obvious with just a single screenshot:

TH01 invincibility sprite weirdness

First, we've got the obvious issue of the sprites not being clipped at the right edge of VRAM, with the rightmost pixels in each row of the sprite extending to the beginning of the next row. Well, that's just what you get if you insist on writing unique low-level blitting code for the majority of the individual sprites in the game… 🤷
More importantly though, the sprite sheet looks like this: So how do we even get these fully filled red diamonds?

Well, turns out that the sprites are never consistently unblitted during their 8 frames of animation. There is a function that looks like it unblits the sprite… except that it starts with by enabling the GRCG and… reading from the first bitplane on the background page? If this was the EGC, such a read would fill some internal registers with the contents of all 4 bitplanes, which can then subsequently be blitted to all 4 bitplanes of any VRAM page with a single memory write. But with the GRCG in RMW mode, reads do nothing special, and simply copy the memory contents of one bitplane to the read destination. Maybe ZUN thought that setting the RMW color to red also sets some internal 4-plane mask register to match that color? :zunpet:
Instead, the rather random pixels read from the first bitplane are then used as a mask for a second blit of the same red sprite. Effectively, this only really "unblits" the invincibility pixels that are drawn on top of Reimu's sprite. Since Reimu is drawn first, the invincibility sprites are overwritten anyway. But due to the palette color layout of Reimu's sprite, its pixels end up fully masking away any invincibility sprite pixels in that second blit, leaving VRAM untouched as a result. Anywhere else though, this animation quickly turns into the union of all animation frames.

Then again, if that 16-dot-aligned rectangular unblitting function is all you know about the EGC, and you can't be bothered to write a perfect unblitter for 8×8 sprites, it becomes obvious why you wouldn't want to use it:

Because Reimu would barely be visible under all that flicker. In comparison, those fully filled diamonds actually look pretty good.


After all that, the remaining time wouldn't have been enough for the next few essential classes, so I closed out the push with three more VRAM effects instead:


And with that, ReC98, as a whole, is not only ⅓ done, but I've also fully caught up with the feature backlog for the first time in the history of this crowdfunding! Time to go into maintenance mode then, while we wait for the next pushes to be funded. Got a huge backlog of tiny maintenance issues to address at a leisurely pace, and of course there's also the 📝 16-bit build system waiting to be finished.

📝 Posted:
🚚 Summary of:
P0119
Commits:
cbf14eb...453dd3c
💰 Funded by:
[Anonymous], -Tom-
🏷 Tags:

So, TH05 OP.EXE. The first half of this push started out nicely, with an easy decompilation of the entire player character selection menu. Typical ZUN quality, with not much to say about it. While the overall function structure is identical to its TH04 counterpart, the two games only really share small snippets inside these functions, and do need to be RE'd separately.

The high score viewing (not registration) menu would have been next. Unfortunately, it calls one of the GENSOU.SCR loading functions… which are all a complete mess that still needed to be sorted out first. 5 distinct functions in 6 binaries, and of course TH05 also micro-optimized its MAIN.EXE version to directly use the DOS INT 21h file loading API instead of master.lib's wrappers. Could have all been avoided with a single method on the score data structure, taking a player character ID and a difficulty level as parameters…

So, no score menu in this push then. Looking at the other end of the ASM code though, we find the starting functions for the main game, the Extra Stage, and the demo replays, which did fit perfectly to round out this push.

Which is where we find an easter egg! 🥚 If you've ever looked into 怪綺談2.DAT, you might have noticed 6 .REC files with replays for the Demo Play mode. However, the game only ever seems to cycle between 4 replays. So what's in the other two, and why are they 40 KB instead of just 10 KB like the others? Turns out that they combine into a full Extra Stage Clear replay with Mima, with 3 bombs and 1 death, obviously recorded by ZUN himself. The split into two files for the stage (DEMO4.REC) and boss (DEMO5.REC) portion is merely an attempt to limit the amount of simultaneously allocated heap memory.
To watch this replay without modding the game, unlock the Extra Stage with all 4 characters, then hold both the ⬅️ left and ➡️ right arrow keys in the main menu while waiting for the usual demo replay. I can't possibly be the first one to discover this, but I couldn't find any other mention of it.
Edit (2021-03-15): ZUN did in fact document this replay in Section 6 of TH05's OMAKE.TXT, along with the exact method to view it. Thanks to Popfan for the discovery!

Here's a recording of the whole replay:

Note how the boss dialogue is skipped. MAIN.EXE actually contains no less than 6 if() branches just to distinguish this overly long replay from the regular ones.


I'd really like to do the TH04 and TH05 main menus in parallel, since we can expect a bit more shared code after all the initial differences. Therefore, I'm going to put the next "anything" push towards covering the TH04 version of those functions. Next up though, it's back to TH01, with more redundant image format code…

📝 Posted:
🚚 Summary of:
P0118
Commits:
0bb5bc3...cbf14eb
💰 Funded by:
-Tom-, Ember2528
🏷 Tags:

🎉 TH05 is finally fully position-independent! 🎉 To celebrate this milestone, -Tom- coded a little demo, which we recorded on both an emulator and on real PC-98 hardware:

For all the new people who are unfamiliar with PC-98 Touhou internals: Boss behavior is hardcoded into MAIN.EXE, rather than being scriptable via separate .ECL files like in Windows Touhou. That's what makes this kind of a big deal.


What does this mean?

You can now freely add or remove both data and code anywhere in TH05, by editing the ReC98 codebase, writing your mod in ASM or C/C++, and recompiling the code. Since all absolute memory addresses have now been converted to labels, this will work without causing any instability. See the position independence section in the FAQ for a more thorough explanation about why this was a problem.

By extension, this also means that it's now theoretically possible to use a different compiler on the source code. But:

What does this not mean?

The original ZUN code hasn't been completely reverse-engineered yet, let alone decompiled. As the final PC-98 Touhou game, TH05 also happens to have the largest amount of actual ZUN-written ASM that can't ever be decompiled within ReC98's constraints of a legit source code reconstruction. But a lot of the originally-in-C code is also still in ASM, which might make modding a bit inconvenient right now. And while I have decompiled a bunch of functions, I selected them largely because they would help with PI (as requested by the backers), and not because they are particularly relevant to typical modding interests.

As a result, the code might also be a bit confusingly organized. There's quite a conflict between various goals there: On the one hand, I'd like to only have a single instance of every function shared with earlier games, as well as reduce ZUN's code duplication within a single game. On the other hand, this leads to quite a lot of code being scattered all over the place and then #include-pasted back together, except for the places where 📝 this doesn't work, and you'd have to use multiple translation units anyway… I'm only beginning to figure out the best structure here, and some more reverse-engineering attention surely won't hurt.

Also, keep in mind that the code still targets x86 Real Mode. To work effectively in this codebase, you'd need some familiarity with memory segmentation, and how to express it all in code. This tends to make even regular C++ development about an order of magnitude harder, especially once you want to interface with the remaining ASM code. That part made -Tom- struggle quite a bit with implementing his custom scripting language for the demo above. For now, he built that demo on quite a limited foundation – which is why he also chose to release neither the build nor the source publically for the time being.
So yeah, you're definitely going to need the TASM and Borland C++ manuals there.

tl;dr: We now know everything about this game's data, but not quite as much about this game's code.

So, how long until source ports become a realistic project?

You probably want to wait for 100% RE, which is when everything that can be decompiled has been decompiled.

Unless your target system is 16-bit Windows, in which case you could theoretically start right away. 📝 Again, this would be the ideal first system to port PC-98 Touhou to: It would require all the generic portability work to remove the dependency on PC-98 hardware, thus paving the way for a subsequent port to modern systems, yet you could still just drop in any undecompiled ASM.

Porting to IBM-compatible DOS would only be a harder and less universally useful version of that. You'd then simply exchange one architecture, with its idiosyncrasies and limits, for another, with its own set of idiosyncrasies and limits. (Unless, of course, you already happen to be intimately familiar with that architecture.) The fact that master.lib provides DOS/V support would have only mattered if ZUN consistently used it to abstract away PC-98 hardware at every single place in the code, which is definitely not the case.


The list of actually interesting findings in this push is, 📝 again, very short. Probably the most notable discovery: The low-level part of the code that renders Marisa's laser from her TH04 Illusion Laser shot type is still present in TH05. Insert wild mass guessing about potential beta version shot types… Oh, and did you know that the order of background images in the Extra Stage staff roll differs by character?

Next up: Finally driving up the RE% bar again, by decompiling some TH05 main menu code.

📝 Posted:
🚚 Summary of:
P0117
Commits:
03048c3...0bb5bc3
💰 Funded by:
[Anonymous]
🏷 Tags:

Wouldn't it be a bit disappointing to have TH05 completely position-independent, but have it still require hex-editing of the original ZUN.COM to mod its gaiji characters? As in, these custom "text" glyphs, available to the PC-98 text RAM:

TH05 gaiji characters

Especially since we now even have a sprite converter… the lack of which was exactly 📝 what made rebuilding ZUN.COM not that worthwhile before. So, before the big release, let's get all the remaining ZUN.COM sub-binaries of TH04 and TH05 dumped into .ASM files, and re-assembled and linked during the build process.

This is also the moment in which Egor's 2018 reimplementation of O. Morikawa's comcstm finally gets to shine. Back then, I considered it too early to even bother with ZUN.COM and reimplementing the .COM wrapper that ZUN originally used to bundle multiple smaller executables into that single binary. But now that the time is right, it is nice to have that code, as it allowed me to get these rebuilds done in half a push. Otherwise, it would have surely required one or two dedicated ones.

Since we like correctness here, newly dumped ZUN code means that it also has to be included in the RE% baseline calculation. This is why TH04's and TH05's overall RE% bars have gone back a tiny bit… in case you remember how they previously looked like :tannedcirno: After all, I would like to figure out where all that memory allocated during TH04's and TH05's memory check is freed, if at all.


Alright, one half of a push left… Y'know, getting rid of those last few PI false positives is actually one of the most annoying chores in this project, and quite stressful as well: I have to convince myself that the remaining false positives are, in fact, not memory references, but with way too little time for in-depth RE and to denote what they are instead. In that situation, everyone (including myself!) is anticipating that PI goal, and no one is really interested in RE. (Well… that is, until they actually get to developing their mod. But more on that tomorrow. :onricdennat:) Which means that it boils down to quite some hasty, dumb, and superficial RE around those remaining numbers.

So, in the hope of making it less annoying for the other 4 games in the future, let's systematically cover the sources of those remaining false positives in TH05, over all games. I/O port accesses with either the port or the value in registers (and thus, no longer as an immediate argument to the IN or OUT instructions, which the PI counter can clearly ingore), palette color arithmetic, or heck, 0xFF constants that obviously just mean "-1" and are not a reference to offset 0xFF in the data segment. All of this, of course, once again had a way bigger effect on everything but an almost position-independent TH05… but hey, that's the sort of thing you reserve the "anything" pushes for. And that's also how we get some of the single biggest PI% gains we have seen so far, and will be seeing before the 100% PI mark. And yes, those will continue in the next push.

Alright! Big release tomorrow…

📝 Posted:
🚚 Summary of:
P0115, P0116
Commits:
967bb8b...e5328a3, e5328a3...03048c3
💰 Funded by:
Lmocinemod, Blue Bolt, [Anonymous]
🏷 Tags:

Finally, after a long while, we've got two pushes with barely anything to talk about! Continuing the road towards 100% PI for TH05, these were exactly the two pushes that TH05 MAINE.EXE PI was estimated to additionally cost, relative to TH04's. Consequently, they mostly went to TH05's unique data structures in the ending cutscenes, the score name registration menu, and the staff roll.

A unique feature in there is TH05's support for automatic text color changes in its ending scripts, based on the first full-width Shift-JIS codepoint in a line. The \c=codepoint,color commands at the top of the _ED??.TXT set up exactly this codepoint→color mapping. As far as I can tell, TH05 is the only Touhou game with a feature like this – even the Windows Touhou games went back to manually spelling out each color change.

The orb particles in TH05's staff roll also try to be a bit unique by using 32-bit X and Y subpixel variables for their current position. With still just 4 fractional bits, I can't really tell yet whether the extended range was actually necessary. Maybe due to how the "camera scrolling" through "space" was implemented? All other entities were pretty much the usual fare, though.
12.4, 4.4, and now a 28.4 fixed-point format… yup, 📝 C++ templates were definitely the right choice.

At the end of its staff roll, TH05 not only displays the usual performance verdict, but then scrolls in the scores at the end of each stage before switching to the high score menu. The simplest way to smoothly scroll between two full screens on a PC-98 involves a separate bitmap… which is exactly what TH05 does here, reserving 28,160 bytes of its global data segment for just one overly large monochrome 320×704 bitmap where both the screens are rendered to. That's… one benefit of splitting your game into multiple executables, I guess? :tannedcirno:
Not sure if it's common knowledge that you can actually scroll back and forth between the two screens with the Up and Down keys before moving to the score menu. I surely didn't know that before. But it makes sense – might as well get the most out of that memory.


The necessary groundwork for all of this may have actually made TH04's (yes, TH04's) MAINE.EXE technically position-independent. Didn't quite reach the same goal for TH05's – but what we did reach is ⅔ of all PC-98 Touhou code now being position-independent! Next up: Celebrating even more milestones, as -Tom- is about to finish development on his TH05 MAIN.EXE PI demo…

📝 Posted:
🚚 Summary of:
P0113, P0114
Commits:
150d2c6...6204fdd, 6204fdd...967bb8b
💰 Funded by:
Lmocinemod
🏷 Tags:

Alright, tooling and technical debt. Shouldn't be really much to talk about… oh, wait, this is still ReC98 :tannedcirno:

For the tooling part, I finished up the remaining ergonomics and error handling for the 📝 sprite converter that Jonathan Campbell contributed two months ago. While I familiarized myself with the tool, I've actually ran into some unreported errors myself, so this was sort of important to me. Still got no command-line help in there, but the error messages can now do that job probably even better, since we would have had to write them anyway.

So, what's up with the technical debt then? Well, by now we've accumulated quite a number of 📝 ASM code slices that need to be either decompiled or clearly marked as undecompilable. Since we define those slices as "already reverse-engineered", that decision won't affect the numbers on the front page at all. But for a complete decompilation, we'd still have to do this someday. So, rather than incorporating this work into pushes that were purchased with the expectation of measurable progress in a certain area, let's take the "anything goes" pushes, and focus entirely on that during them.

The second code segment seemed like the best place to start with this, since it affects the largest number of games simultaneously. Starting with TH02, this segment contains a set of random "core" functions needed by the binary. Image formats, sounds, input, math, it's all there in some capacity. You could maybe call it all "libzun" or something like that? But for the time being, I simply went with the obvious name, seg2. Maybe I'll come up with something more convincing in the future.


Oh, but wait, why were we assembling all the previous undecompilable ASM translation units in the 16-bit build part? By moving those to the 32-bit part, we don't even need a 16-bit TASM in our list of dependencies, as long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit Windows version. 🎉 Which is certainly the most user-visible improvement in all of these two pushes. :onricdennat:


Back in 2015, I already decompiled all of TH02's seg2 functions. As suggested by the Borland compiler, I tried to follow a "one translation unit per segment" layout, bundling the binary-specific contents via #include. In the end, it required two translation units – and that was even after manually inserting the original padding bytes via #pragma codestring… yuck. But it worked, compiled, and kept the linker's job (and, by extension, segmentation worries) to a minimum. And as long as it all matched the original binaries, it still counted as a valid reconstruction of ZUN's code. :zunpet:

However, that idea ultimately falls apart once TH03 starts mixing undecompilable ASM code inbetween C functions. Now, we officially have no choice but to use multiple C and ASM translation units, with maybe only just one or two #includes in them…

…or we finally start reconstructing the actual seg2 library, turning every sequence of related functions into its own translation unit. This way, we can simply reuse the once-compiled .OBJ files for all the binaries those functions appear in, without requiring that additional layer of translation units mirroring the original segmentation.
The best example for this is TH03's almost undecompilable function that generates a lookup table for horizontally flipping 8 1bpp pixels. It's part of every binary since TH03, but only used in that game. With the previous approach, we would have had to add 9 C translation units, which would all have just #included that one file. Now, we simply put the .OBJ file into the correct place on the linker command line, as soon as we can.

💡 And suddenly, the linker just inserts the correct padding bytes itself.

The most immediate gains there also happened to come from TH03. Which is also where we did get some tiny RE% and PI% gains out of this after all, by reverse-engineering some of its sprite blitting setup code. Sure, I should have done even more RE here, to also cover those 5 functions at the end of code segment #2 in TH03's MAIN.EXE that were in front of a number of library functions I already covered in this push. But let's leave that to an actual RE push 😛


All in all though, I was just getting started with this; the real gains in terms of removed ASM files are still to come. But in the meantime, the funding situation has become even better in terms of allowing me to focus on things nobody asked for. 🙂 So here's a slightly better idea: Instead of spending two more pushes on this, let's shoot for TH05 MAINE.EXE position independence next. If I manage to get it done, we'll have a 100% position-independent TH05 by the time -Tom- finishes his MAIN.EXE PI demo, rather than the 94% we'd get from just MAIN.EXE. That's bound to make a much better impression on all the people who will then (re-)discover the project.

📝 Posted:
🚚 Summary of:
P0001
Commits:
e447a2d...150d2c6
💰 Funded by:
GhostPhanom
🏷 Tags:

(tl;dr: ReC98 has switched to Tup for the 32-bit build. You probably want to get 💾 this build of Tup, and put it somewhere in your PATH. It's optional, and always will be, but highly recommended.)


P0001! Reserved for the delivery of the very first financial contribution I've ever received for ReC98, back in January 2018. GhostPhanom requested the exact opposite of immediate results, which motivated me to go on quite a passionate quest for the perfect ReC98 build system. A quest that went way beyond the crowdfunding…

Makefiles are a decent idea in theory: Specify the targets to generate, the source files these targets depend on and are generated from, and the rules to do the generating, with some helpful shorthand syntax. Then, you have a build dependency graph, and your make tool of choice can provide minimal rebuilds of only the targets whose sources changed since the last make call. But, uh… wait, this is C/C++ we're talking about, and doesn't pretty much every source file come with a second set of dependent source files, namely, every single #include in the source file itself? Do we really have to duplicate all these inside the Makefile, and keep it in sync with the source file? 🙄

This fact alone means that Makefiles are inherently unsuited for any language with an #include feature… that is, pretty much every language out there. Not to mention other aspects like changes to the compilation command lines, or the build rules themselves, all of which require metadata of the previous build to be persistently stored in some way. I have no idea why such a trash technology is even touted as a viable build tool for code.

But wait! Most make implementations, including Borland's, do support the notion of auto-dependency information, emitted by the compiler in a specific format, to provide make with the additional list of #includes. Sure, this should be a basic feature of any self-respecting build tool, and not something you have to add as an extension, but let's just set our idealism aside for a moment. Well, too bad that Borland's implementation only works if you spell out both the source➜object and the object➜binary rules, which loses the performance gained from compiling multiple translation units in a single BCC or TCC process. And even then, it tends to break in that DOS VM you're probably using. Not to mention, again, all the other aspects that still remain unsolved.

So, I decided to just write my own build system, tailor-made for the needs of ReC98's 16-bit build process, and combining a number of experimental ideas. Which is still not quite bug-free and ready for public use, given that the entire past year has kept me busy with actual tangible RE and PI progress. What did finally become ready, however, is the improvement for the 32-bit build part, and that's what we've got here.


💭 Now, if only there was a build system that would perfectly track dependencies of any compiler it calls, by injecting code and hooking file opening syscalls. It'd be completely unrealistic for it to also run on DOS (and we probably don't want to traverse a graph database in a cycle-limited DOSBox), but it would be perfect for our 32-bit build part, as long as that one still exists.

Turns out Tup is exactly that system. In practice, its low-level nature as a make replacement does limit its general usefulness, which is why you probably haven't seen it used in a lot of projects. But for something like ReC98 with its reliance on outdated compilers that aren't supported by any decent high-level tool, it's exactly the right tool for the job. Also, it's completely beyond me how Ninja, the most popular make replacement these days, was inspired by Tup, yet went a step back to parsing the specific dependency information produced by gcc, Clang, and Visual Studio, and only those

Sure, it might seem really minor to worry about not unconditionally rebuilding all 32-bit .asm files, which just takes a couple of seconds anyway. But minimal rebuilds in the 32-bit part also provide the foundation for minimal rebuilds in the 16-bit part – and those TLINK invocations do take quite some time after all.

Using Tup for ReC98 was an idea that dated back to January 2017. Back then, I already opened the pull request with a fix to allow Tup to work together with 32-bit TASM. As much as I love Tup though, the fact that it only worked on 64-bit Windows ≥Vista would have meant that we had to exchange perfect dependency tracking for the ability to build on 32-bit and older Windows versions at all. For a project that relies on DOS compilers, this would have been exactly the wrong trade-off to make.

What's worse though: TLINK fails to run on modern 32-bit Windows with Loader error (0000) : Unrecognized Error. Therefore, the set of systems that Tup runs on, and the set of systems that can actually compile ReC98's 16-bit build part natively, would have been exactly disjoint, with no OS getting to use both at the same time.
So I've kept using Tup for only my own development, but indefinitely shelved the idea of making it the official build system, due to those drawbacks. Recently though, it all came together:

As I'm writing this post, the pull request has unfortunately not yet been merged. So, here's my own custom build instead:

💾 Download Tup for 32-bit Windows (optimized build at this commit)

I've also added it to the DevKit, for any newcomers to ReC98.


After the switch to Tup and the fallback option, I extensively tested building ReC98 on all operating systems I had lying around. And holy cow, so much in that build was broken beyond belief. In the end, the solution involved just fully rebuilding the entire 16-bit part by default. :tannedcirno: Which, of course, nullifies any of the advantages we might have gotten from a Makefile in the first place, due to just how unreliable they are. If you had problems building ReC98 in the past, try again now!

And sure, it would certainly be possible to also get Tup working on Windows ≤XP, or 9x even. But I leave that to all those tinkerers out there who are actually motivated to keep those OSes alive. My work here is done – we now have a build process that is optimal on 32-bit Windows ≧Vista, and still functional and reliable on 64-bit Windows, Linux, and everything down to Windows 98 SE, and therefore also real PC-98 hardware. Pretty good, I'd say.

(If it weren't for that weird crash of the 16-bit TASM.EXE in that Windows 95 command prompt I've tried it in, it would also work on that OS. Probably just a misconfiguration on my part?)

Now, it might look like a waste of time to improve a 32-bit build part that won't even exist anymore once this project is done. However, a fully 16-bit DOS build will only make sense after

And who knows whether this project will get funded for that long. So yeah, the 32-bit build part will stay with us for quite some more time, and for all upcoming PI milestones. And with the current build process, it's pretty much the most minor among all the minor issues I can think of. Let's all enjoy the performance of a 32-bit build while we can 🙂

Next up: Paying some technical debt while keeping the RE% and PI% in place.

📝 Posted:
🚚 Summary of:
P0111, P0112
Commits:
8b5c146...4ef4c9e, 4ef4c9e...e447a2d
💰 Funded by:
[Anonymous], Blue Bolt
🏷 Tags:

Only one newly ordered push since I've reopened the store? Great, that's all the justification I needed for the extended maintenance delay that was part of these two pushes 😛

Having to write comments to explain whether coordinates are relative to the top-left corner of the screen or the top-left corner of the playfield has finally become old. So, I introduced distinct types for all the coordinate systems we typically encounter, applying them to all code decompiled so far. Note how the planar nature of PC-98 VRAM meant that X and Y coordinates also had to be different from each other. On the X side, there's mainly the distinction between the [0; 640] screen space and the corresponding [0; 80] VRAM byte space. On the Y side, we also have the [0; 400] screen space, but the visible area of VRAM might be limited to [0; 200] when running in the PC-98's line-doubled 640×200 mode. A VRAM Y coordinate also always implies an added offset for vertical scrolling.
During all of the code reconstruction, these types can only have a documenting purpose. Turning them into anything more than just typedefs to int, in order to define conversion operators between them, simply won't recompile into identical binaries. Modding and porting projects, however, now have a nice foundation for doing just that, and can entirely lift coordinate system transformations into the type system, without having to proofread all the meaningless int declarations themselves.


So, what was left in terms of memory references? EX-Alice's fire waves were our final unknown entity that can collide with the player. Decently implemented, with little to say about them.

That left the bomb animation structures as the one big remaining PI blocker. They started out nice and simple in TH04, with a small 6-byte star animation structure used for both Reimu and Marisa. TH05, however, gave each character her own animation… and what the hell is going on with Reimu's blue stars there? Nope, not going to figure this out on ASM level.

A decompilation first required some more bomb-related variables to be named though. Since this was part of a generic RE push, it made sense to do this in all 5 games… which then led to nice PI gains in anything but TH05. :tannedcirno: Most notably, we now got the "pulling all items to player" flag in TH04 and TH05, which is actually separate from bombing. The obvious cheat mod is left as an exercise to the reader.


So, TH05 bomb animations. Just like the 📝 custom entity types of this game, all 4 characters share the same memory, with the superficially same 10-byte structure.
But let's just look at the very first field. Seen from a low level, it's a simple struct { int x, y; } pos, storing the current position of the character-specific bomb animation entity. But all 4 characters use this field differently:

Therefore, I decompiled it as 4 separate structures once again, bundled into an union of arrays.

As for Reimu… yup, that's some pointer arithmetic straight out of Jigoku* for setting and updating the positions of the falling star trails. :zunpet: While that certainly required several comments to wrap my head around the current array positions, the one "bug" in all this arithmetic luckily has no effect on the game.
There is a small glitch with the growing circles, though. They are spawned at the end of the loop, with their position taken from the star pointer… but after that pointer has already been incremented. On the last loop iteration, this leads to an out-of-bounds structure access, with the position taken from some unknown EX-Alice data, which is 0 during most of the game. If you look at the animation, you can easily spot these bugged circles, consistently growing from the top-left corner (0, 0) of the playfield:


After all that, there was barely enough remaining time to filter out and label the final few memory references. But now, TH05's MAIN.EXE is technically position-independent! 🎉 -Tom- is going to work on a pretty extensive demo of this unprecedented level of efficient Touhou game modding. For a more impactful effect of both the 100% PI mark and that demo, I'll be delaying the push covering the remaining false positives in that binary until that demo is done. I've accumulated a pretty huge backlog of minor maintenance issues by now…
Next up though: The first part of the long-awaited build system improvements. I've finally come up with a way of sanely accelerating the 32-bit build part on most setups you could possibly want to build ReC98 on, without making the building experience worse for the other few setups.

📝 Posted:
🚚 Summary of:
P0110
Commits:
2c7d86b...8b5c146
💰 Funded by:
[Anonymous], Blue Bolt
🏷 Tags:

… and just as I explained 📝 in the last post how decompilation is typically more sensible and efficient than ASM-level reverse-engineering, we have this push demonstrating a counter-example. The reason why the background particles and lines in the Shinki and EX-Alice battles contributed so much to position dependence was simply because they're accessed in a relatively large amount of functions, one for each different animation. Too many to spend the remaining precious crowdfunded time on reverse-engineering or even decompiling them all, especially now that everyone anticipates 100% PI for TH05's MAIN.EXE.

Therefore, I only decompiled the two functions of the line structure that also demonstrate best how it works, which in turn also helped with RE. Sadly, this revealed that we actually can't 📝 overload operator =() to get that nice assignment syntax for 12.4 fixed-point values, because one of those new functions relies on Turbo C++'s built-in optimizations for trivially copyable structures. Still, impressive that this abstraction caused no other issues for almost one year.

As for the structures themselves… nope, nothing to criticize this time! Sure, one good particle system would have been awesome, instead of having separate structures for the Stage 2 "starfield" particles and the one used in Shinki's battle, with hardcoded animations for both. But given the game's short development time, that was quite an acceptable compromise, I'd say.
And as for the lines, there just has to be a reason why the game reserves 20 lines per set, but only renders lines #0, #6, #12, and #18. We'll probably see once we get to look at those animation functions more closely.

This was quite a 📝 TH03-style RE push, which yielded way more PI% than RE%. But now that that's done, I can finally not get distracted by all that stuff when looking at the list of remaining memory references. Next up: The last few missing structures in TH05's MAIN.EXE!

📝 Posted:
🚚 Summary of:
P0109
Commits:
dcf4e2c...2c7d86b
💰 Funded by:
[Anonymous], Blue Bolt
🏷 Tags:

Back to TH05! Thanks to the good funding situation, I can strike a nice balance between getting TH05 position-independent as quickly as possible, and properly reverse-engineering some missing important parts of the game. Once 100% PI will get the attention of modders, the code will then be in better shape, and a bit more usable than if I just rushed that goal.

By now, I'm apparently also pretty spoiled by TH01's immediate decompilability, after having worked on that game for so long. Reverse-engineering in ASM land is pretty annoying, after all, since it basically boils down to meticulously editing a piece of ASM into something I can confidently call "reverse-engineered". Most of the time, simply decompiling that piece of code would take just a little bit longer, but be massively more useful. So, I immediately tried decompiling with TH05… and it just worked, at every place I tried!? Whatever the issue was that made 📝 segment splitting so annoying at my first attempt, I seem to have completely solved it in the meantime. 🤷 So yeah, backers can now request pretty much any part of TH04 and TH05 to be decompiled immediately, with no additional segment splitting cost.

(Protip for everyone interested in starting their own ReC project: Just declare one segment per function, right from the start, then group them together to restore the original code segmentation…)


Except that TH05 then just throws more of its infamous micro-optimized and undecompilable ASM at you. 🙄 This push covered the function that adjusts the bullet group template based on rank and the selected difficulty, called every time such a group is configured. Which, just like pretty much all of TH05's bullet spawning code, is one of those undecompilable functions. If C allowed labels of other functions as goto targets, it might have been decompilable into something useful to modders… maybe. But like this, there's no point in even trying.

This is such a terrible idea from a software architecture point of view, I can't even. Because now, you suddenly have to mirror your C++ declarations in ASM land, and keep them in sync with each other. I'm always happy when I get to delete an ASM declaration from the codebase once I've decompiled all the instances where it was referenced. But for TH05, we now have to keep those declarations around forever. 😕 And all that for a performance increase you probably couldn't even measure. Oh well, pulling off Galaxy Brain-level ASM optimizations is kind of fun if you don't have portability plans… I guess?

If I started a full fangame mod of a PC-98 Touhou game, I'd base it on TH04 rather than TH05, and backport selected features from TH05 as needed. Just because it was released later doesn't make it better, and this is by far not the only one of ZUN's micro-optimizations that just went way too far.

Dropping down to ASM also makes it easier to introduce weird quirks. Decompiled, one of TH05's tuning conditions for stack groups on Easy Mode would look something like:

case BP_STACK:
	// […]
	if(spread_angle_delta >= 2) {
		stack_bullet_count--;
	}

The fields of the bullet group template aren't typically reset when setting up a new group. So, spread_angle_delta in the context of a stack group effectively refers to "the delta angle of the last spread group that was fired before this stack – whenever that was". uth05win also spotted this quirk, considered it a bug, and wrote fanfiction by changing spread_angle_delta to stack_bullet_count.
As usual for functions that occur in more than one game, I also decompiled the TH04 bullet group tuning function, and it's perfectly sane, with no such quirks.


In the more PI-focused parts of this push, we got the TH05-exclusive smooth boss movement functions, for flying randomly or towards a given point. Pretty unspectacular for the most part, but we've got yet another uth05win inconsistency in the latter one. Once the Y coordinate gets close enough to the target point, it actually speeds up twice as much as the X coordinate would, whereas uth05win used the same speedup factors for both. This might make uth05win a couple of frames slower in all boss fights from Stage 3 on. Hard to measure though – and boss movement partly depends on RNG anyway.


Next up: Shinki's background animations – which are actually the single biggest source of position dependence left in TH05.

📝 Posted:
🚚 Summary of:
P0105, P0106, P0107, P0108
Commits:
3622eb6...11b776b, 11b776b...1f1829d, 1f1829d...1650241, 1650241...dcf4e2c
💰 Funded by:
Yanga
🏷 Tags:

And indeed, I got to end my vacation with a lot of image format and blitting code, covering the final two formats, .GRC and .BOS. .GRC was nothing noteworthy – one function for loading, one function for byte-aligned blitting, and one function for freeing memory. That's it – not even a unblitting function for this one. .BOS, on the other hand…

…has no generic (read: single/sane) implementation, and is only implemented as methods of some boss entity class. And then again for Sariel's dress and wand animations, and then again for Reimu's animations, both of which weren't even part of these 4 pushes. Looking forward to decompiling essentially the same algorithms all over again… And that's how TH01 became the largest and most bloated PC-98 Touhou game. So yeah, still not done with image formats, even at 44% RE.

This means I also had to reverse-engineer that "boss entity" class… yeah, what else to call something a boss can have multiple of, that may or may not be part of a larger boss sprite, may or may not be animated, and that may or may not have an orb hitbox?
All bosses except for Kikuri share the same 5 global instances of this class. Since renaming all these variables in ASM land is tedious anyway, I went the extra mile and directly defined separate, meaningful names for the entities of all bosses. These also now document the natural order in which the bosses will ultimately be decompiled. So, unless a backer requests anything else, this order will be:

  1. Konngara
  2. Sariel
  3. Elis
  4. Kikuri
  5. SinGyoku
  6. (code for regular card-flipping stages)
  7. Mima
  8. YuugenMagan

As everyone kind of expects from TH01 by now, this class reveals yet another… um, unique and quirky piece of code architecture. In addition to the position and hitbox members you'd expect from a class like this, the game also stores the .BOS metadata – width, height, animation frame count, and 📝 bitplane pointer slot number – inside the same class. But if each of those still corresponds to one individual on-screen sprite, how can YuugenMagan have 5 eye sprites, or Kikuri have more than one soul and tear sprite? By duplicating that metadata, of course! And copying it from one entity to another :onricdennat:
At this point, I feel like I even have to congratulate the game for not actually loading YuugenMagan's eye sprites 5 times. But then again, 53,760 bytes of waste would have definitely been noticeable in the DOS days. Makes much more sense to waste that amount of space on an unused C++ exception handler, and a bunch of redundant, unoptimized blitting functions :tannedcirno:

(Thinking about it, YuugenMagan fits this entire system perfectly. And together with its position in the game's code – last to be decompiled means first on the linker command line – we might speculate that YuugenMagan was the first boss to be programmed for TH01?)

So if a boss wants to use sprites with different sizes, there's no way around using another entity. And that's why Girl-Elis and Bat-Elis are two distinct entities internally, and have to manually sync their position. Except that there's also a third one for Attacking-Girl-Elis, because Girl-Elis has 9 frames of animation in total, and the global .BOS bitplane pointers are divided into 4 slots of only 8 images each. :zunpet:
Same for SinGyoku, who is split into a sphere entity, a person entity, and a… white flash entity for all three forms, all at the same resolution. Or Konngara's facial expressions, which also require two entities just for themselves.


And once you decompile all this code, you notice just how much of it the game didn't even use. 13 of the 50 bytes of the boss entity class are outright unused, and 10 bytes are used for a movement clamping and lock system that would have been nice if ZUN also used it outside of Kikuri's soul sprites. Instead, all other bosses ignore this system completely, and just party on the X/Y coordinates of the boss entities directly.

As for the rendering functions, 5 out of 10 are unused. And while those definitely make up less than half of the code, I still must have spent at least 1 of those 4 pushes on effectively unused functionality.
Only one of these functions lends itself to some speculation. For Elis' entrance animation, the class provides functions for wavy blitting and unblitting, which use a separate X coordinate for every line of the sprite. But there's also an unused and sort of broken one for unblitting two overlapping wavy sprites, located at the same Y coordinate. This might indicate that Elis could originally split herself into two sprites, similar to TH04 Stage 6 Yuuka? Or it might just have been some other kind of animation effect, who knows.


After over 3 months of TH01 progress though, it's finally time to look at other games, to cover the rest of the crowdfunding backlog. Next up: Going back to TH05, and getting rid of those last PI false positives. And since I can potentially spend the next 7 weeks on almost full-time ReC98 work, I've also re-opened the store until October!

📝 Posted:
🚚 Summary of:
P0103, P0104
Commits:
b60f38d...05c0028, 05c0028...3622eb6
💰 Funded by:
Ember2528
🏷 Tags:

It's vacation time! Which, for ReC98, means "relaxing by looking at something boring and uninteresting that we'll ultimately have to cover anyway"… like the TH01 HUD.

📝 As noted earlier, all the score, card combo, stage, and time numbers are drawn into VRAM. Which turns TH01's HUD rendering from the trivial, gaiji-assisted text RAM writes we see in later games to something that, once again, requires blitting and unblitting steps. For some reason though, everything on there is blitted to both VRAM pages? And that's why the HUD chose to allocate a bunch of .PTN sprite slots to store the background behind all "animated" elements at the beginning of a 4-stage scene or boss battle… separately for every affected 16×16 area. (Looking forward to the completely unnecessary code in the Sariel fight that updates these slots after the backgrounds were animated!) And without any separation into helper functions, we end up with the same blitting calls separately copy-pasted for every single HUD element. That's why something as seemingly trivial as this isn't even done after 2 pushes, as we're still missing the stage timer.

Thankfully, the .PTN function signatures come with none of ZUN's little inconsistencies, so I was able to mostly reduce this copy-pasta to a bunch of small inline functions and macros. Those interfaces still remain a bit annoying, though. As a 32×32 format, .PTN merely supports 16×16 sprites with a separate bunch of functions that take an additional quarter parameter from 0 to 3, to select one of the 4 16×16 quarters in a such a sprite…


For life and bomb counts, there was no way around VRAM though, since ZUN wanted to use more than a single color for those. This is where we find at least somewhat of a mildly interesting quirk in all of this: Any life counts greater than the intended 6 will wrap into new rows, with the bombs in the second row overlapping those excess lives. With the way the rest of the HUD rendering works, that wrapping code code had to be explicitly written… which means that ZUN did in fact accomodate (his own?) cheating there.

TH01 life wrapping

Now, I promised image formats, and in the middle of this copy-pasta, we did get one… sort of. MASK.GRF, the red HUD background, is entirely handled with two small bespoke functions… and that's all the code we have for this format. Basically, it's a variation on the 📝 .GRZ format we've seen earlier. It uses the exact same RLE algorithm, but only has a single byte stream for both RLE commands and pixel data… as you would expect from an RLE format.

.GRF actually stores 4 separately encoded RLE streams, which suggests that it was intended for full 16-color images. Unfortunately, MASK.GRF only contains 4 copies of the same HUD background :zunpet:, so no unused beta data for us there. The only thing we could derive from 4 identical bitplanes would be that the background was originally meant to be drawn using color #15, rather than the red seen in the final game. Color #15 is a stage-specific background color that would have made the HUD blend in quite nicely – in the YuugenMagan fight, it's the changing color of the in the background, for example. But really, with no generic implementation of this format, that's all just speculation.

Oh, and in case you were looking for a rip of that image:

TH01 HUD background (MASK.GRF)

So yeah, more of the usual TH01 code, with the usual small quirks, but nothing all too horrible – as expected. Next up: The image formats that didn't make it into this push.

📝 Posted:
🚚 Summary of:
P0099, P0100, P0101, P0102
Commits:
1799d67...1b25830, 1b25830...ceb81db, ceb81db...c11a956, c11a956...b60f38d
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

Well, make that three days. Trying to figure out all the details behind the sprite flickering was absolutely dreadful…
It started out easy enough, though. Unsurprisingly, TH01 had a quite limited pellet system compared to TH04 and TH05:

As expected from TH01, the code comes with its fair share of smaller, insignificant ZUN bugs and oversights. As you would also expect though, the sprite flickering points to the biggest and most consequential flaw in all of this.


Apparently, it started with ZUN getting the impression that it's only possible to use the PC-98 EGC for fast blitting of all 4 bitplanes in one CPU instruction if you blit 16 horizontal pixels (= 2 bytes) at a time. Consequently, he only wrote one function for EGC-accelerated sprite unblitting, which can only operate on a "grid" of 16×1 tiles in VRAM. But wait, pellets are not only just 8×8, but can also be placed at any unaligned X position…

… yet the game still insists on using this 16-dot-aligned function to unblit pellets, forcing itself into using a super sloppy 16×8 rectangle for the job. 🤦 ZUN then tried to mitigate the resulting flickering in two hilarious ways that just make it worse:

  1. An… "interlaced rendering" mode? This one's activated for all Stage 15 and 20 fights, and separates pellets into two halves that are rendered on alternating frames. Collision detection with the Yin-Yang Orb and the player is only done for the visible half, but collision detection with player shots is still done for all pellets every frame, as are motion updates – so that pellets don't end up moving half as fast as they should.
    So yeah, your eyes weren't deceiving you. The game does effectively drop its perceived frame rate in the Elis, Kikuri, Sariel, and Konngara fights, and it does so deliberately.
  2. 📝 Just like player shots, pellets are also unblitted, moved, and rendered in a single function. Thanks to the 16×8 rectangle, there's now the (completely unnecessary) possibility of accidentally unblitting parts of a sprite that was previously drawn into the 8 pixels right of a pellet. And this is where ZUN went full :tannedcirno: and went "oh, I know, let's test the entire 16 pixels, and in case we got an entity there, we simply make the pellet invisible for this frame! Then we don't even have to unblit it later!" :zunpet:

    Except that this is only done for the first 3 elements of the player shot array…?! Which don't even necessarily have to contain the 3 shots fired last. It's not done for the player sprite, the Orb, or, heck, other pellets that come earlier in the pellet array. (At least we avoided going 𝑂(𝑛²) there?)

    Actually, and I'm only realizing this now as I type this blog post: This test is done even if the shots at those array elements aren't active. So, pellets tend to be made invisible based on comparisons with garbage data. :onricdennat:

    And then you notice that the player shot unblit​/​move​/​render function is actually only ever called from the pellet unblit​/​move​/​render function on the one global instance of the player shot manager class, after pellets were unblitted. So, we end up with a sequence of

    Pellet unblit → Pellet move → Shot unblit → Shot move → Shot render → Pellet render

    which means that we can't ever unblit a previously rendered shot with a pellet. Sure, as terrible as this one function call is from a software architecture perspective, it was enough to fix this issue. Yet we don't even get the intended positive effect, and walk away with pellets that are made temporarily invisible for no reason at all. So, uh, maybe it all just was an attempt at increasing the ramerate on lower spec PC-98 models?

Yup, that's it, we've found the most stupid piece of code in this game, period. It'll be hard to top this.


I'm confident that it's possible to turn TH01 into a well-written, fluid PC-98 game, with no flickering, and no perceived lag, once it's position-independent. With some more in-depth knowledge and documentation on the EGC (remember, there's still 📝 this one TH03 push waiting to be funded), you might even be able to continue using that piece of blitter hardware. And no, you certainly won't need ASM micro-optimizations – just a bit of knowledge about which optimizations Turbo C++ does on its own, and what you'd have to improve in your own code. It'd be very hard to write worse code than what you find in TH01 itself.

(Godbolt for Turbo C++ 4.0J when? Seriously though, that would 📝 also be a great project for outside contributors!)


Oh well. In contrast to TH04 and TH05, where 4 pushes only covered all the involved data types, they were enough to completely cover all of the pellet code in TH01. Everything's already decompiled, and we never have to look at it again. 😌 And with that, TH01 has also gone from by far the least RE'd to the most RE'd game within ReC98, in just half a year! 🎉
Still, that was enough TH01 game logic for a while. :tannedcirno: Next up: Making up for the delay with some more relaxing and easy pieces of TH01 code, that hopefully make just a bit more sense than all this garbage. More image formats, mainly.

📝 Posted:
🏷 Tags:

TH01 pellets are coming up next, and for the first time, we'll have the chance to move hardcoded sprite data from ASM land to C land. As it would turn out, bad luck with the 2-byte alignment at the end of REIIDEN.EXE's data segment pretty much forces us to declare TH01's pellet sprites in C if we want to decompile the final few pellet functions without ugly workarounds for the float literals there. And while I could have just converted them into a C array and called it a day, it did raise the question of when we are going to do this The Right And Moddable Way, by auto-converting actual image files into ASM or C arrays during the build process. These arrays are even more annoying to edit in C, after all – unlike TASM, the old C++ we have to work with doesn't support binary number literals, only hexadecimal or, gasp, octal.
Without the explicit funding for such a converter, I reached out to GitHub, asking backers and outside contributors whether they'd be in favor of it. As something that requires no RE skills and collides with nothing else, it would be a perfect task for C/C++ coders who want to support ReC98 with something other than money.

And surprisingly, those still exist! Jonathan Campbell, of DOSBox-X fame, went ahead and implemented all the required functionality, within just a few days. Thanks again! The result is probably a lot more portable than it would have been if I had written it. Which is pretty relevant for future port authors – any additional tooling we write ourselves should not add to the list of problems they'll have to worry about.

Right now, all of the sprites are #included from the big ASM dump files, which means that they have to be converted before those files are assembled during the 32-bit build part. We could have introduced a third distinct build step there, perhaps even a 16-bit one so that we can use Turbo C++ 4.0J to also compile the converter… However, the more reasonable option was to do this at the beginning of the 32-bit build step, and add a 32-bit Windows C++ compiler to the list of tools required for ReC98's build process.
And the best choice for ReC98 is, in fact… 🥁… the 20-year-old Borland C++ 5.5 freeware release. See the README for a lengthy justification, as well as download links.

So yes, all sprites mentioned in the GitHub issue can now be modded by simply editing .BMP files, using an image editor of your choice. 🖌
And now that that's dealt with, it's finally time for more actual progress! TH01 pellets coming tomorrow.

📝 Posted:
🚚 Summary of:
P0096, P0097, P0098
Commits:
8ddb778...8283c5e, 8283c5e...600f036, 600f036...ad06748
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

So, let's finally look at some TH01 gameplay structures! The obvious choices here are player shots and pellets, which are conveniently located in the last code segment. Covering these would therefore also help in transferring some first bits of data in REIIDEN.EXE from ASM land to C land. (Splitting the data segment would still be quite annoying.) Player shots are immediately at the beginning…

…but wait, these are drawn as transparent sprites loaded from .PTN files. Guess we first have to spend a push on 📝 Part 2 of this format.
Hm, 4 functions for alpha-masked blitting and unblitting of both 16×16 and 32×32 .PTN sprites that align the X coordinate to a multiple of 8 (remember, the PC-98 uses a planar VRAM memory layout, where 8 pixels correspond to a byte), but only one function that supports unaligned blitting to any X coordinate, and only for 16×16 sprites? Which is only called twice? And doesn't come with a corresponding unblitting function? :thonk:

Yeah, "unblitting". TH01 isn't double-buffered, and uses the PC-98's second VRAM page exclusively to store a stage's background and static sprites. Since the PC-98 has no hardware sprites, all you can do is write pixels into VRAM, and any animated sprite needs to be manually removed from VRAM at the beginning of each frame. Not using double-buffering theoretically allows TH01 to simply copy back all 128 KB of VRAM once per frame to do this. :tannedcirno: But that would be pretty wasteful, so TH01 just looks at all animated sprites, and selectively copies only their occupied pixels from the second to the first VRAM page.


Alright, player shot class methods… oh, wait, the collision functions directly act on the Yin-Yang Orb, so we first have to spend a push on that one. And that's where the impression we got from the .PTN functions is confirmed: The orb is, in fact, only ever displayed at byte-aligned X coordinates, divisible by 8. It's only thanks to the constant spinning that its movement appears at least somewhat smooth.
This is purely a rendering issue; internally, its position is tracked at pixel precision. Sadly, smooth orb rendering at any unaligned X coordinate wouldn't be that trivial of a mod, because well, the necessary functions for unaligned blitting and unblitting of 32×32 sprites don't exist in TH01's code. Then again, there's so much potential for optimization in this code, so it might be very possible to squeeze those additional two functions into the same C++ translation unit, even without position independence…

More importantly though, this was the right time to decompile the core functions controlling the orb physics – probably the highlight in these three pushes for most people.
Well, "physics". The X velocity is restricted to the 5 discrete states of -8, -4, 0, 4, and 8, and gravity is applied by simply adding 1 to the Y velocity every 5 frames :zunpet: No wonder that this can easily lead to situations in which the orb infinitely bounces from the ground.
At least fangame authors now have a reference of how ZUN did it originally, because really, this bad approximation of physics had to have been written that way on purpose. But hey, it uses 64-bit floating-point variables! :onricdennat:

…sometimes at least, and quite randomly. This was also where I had to learn about Turbo C++'s floating-point code generation, and how rigorously it defines the order of instructions when mixing double and float variables in arithmetic or conditional expressions. This meant that I could only get ZUN's original instruction order by using literal constants instead of variables, which is impossible right now without somehow splitting the data segment. In the end, I had to resort to spelling out ⅔ of one function, and one conditional branch of another, in inline ASM. 😕 If ZUN had just written 16.0 instead of 16.0f there, I would have saved quite some hours of my life trying to decompile this correctly…

To sort of make up for the slowdown in progress, here's the TH01 orb physics debug mod I made to properly understand them. Edit (2022-07-12): This mod is outdated, 📝 the current version is here! 2020-06-13-TH01OrbPhysicsDebug.zip To use it, simply replace REIIDEN.EXE, and run the game in debug mode, via game d on the DOS prompt.
Its code might also serve as an example of how to achieve this sort of thing without position independence.

Screenshot of the TH01 orb physics debug mod

Alright, now it's time for player shots though. Yeah, sure, they don't move horizontally, so it's not too bad that those are also always rendered at byte-aligned positions. But, uh… why does this code only use the 16×16 alpha-masked unblitting function for decaying shots, and just sloppily unblits an entire 16×16 square everywhere else?

The worst part though: Unblitting, moving, and rendering player shots is done in a single function, in that order. And that's exactly where TH01's sprite flickering comes from. Since different types of sprites are free to overlap each other, you'd have to first unblit all types, then move all types, and then render all types, as done in later PC-98 Touhou games. If you do these three steps per-type instead, you will unblit sprites of other types that have been rendered before… and therefore end up with flicker.
Oh, and finally, ZUN also added an additional sloppy 16×16 square unblit call if a shot collides with a pellet or a boss, for some guaranteed flicker. Sigh.


And that's ⅓ of all ZUN code in TH01 decompiled! Next up: Pellets!

📝 Posted:
🚚 Summary of:
P0095
Commits:
57a8487...8ddb778
💰 Funded by:
Yanga
🏷 Tags:

🎉 TH01's OP.EXE and FUUIN.EXE are now fully position-independent! 🎉

What does this mean?

You can now add any data or code to TH01's main menu or ending cutscenes, by simply editing the ReC98 source, writing your mod in ASM or C++, and recompiling the code. Since all absolute memory addresses in OP and FUUIN have now been converted to labels, this will work without causing any instability. See the position independence section in the FAQ for a more thorough explanation about why this was a problem.
As an example, the most popular TH01 mod idea, replacing MDRV2 with PMD, could now at least be prototyped and tested in OP.EXE, without having to worry about x86 instruction lengths.
📝 Check the video I made for the TH04/TH05 OP.EXE PI announcement for a basic overview of how to do that.

What does this not mean?

The original ZUN code hasn't been completely decompiled yet. The final high-level parts of both the main menu and the cutscenes are still ASM, which might make modding a bit inconvenient right now.
It's not that much more code though, and could quickly be covered in a few pushes if requested. Due to the plentiful monthly subscriptions, the shop will stay closed for regular orders until the end of June, but backers with outstanding contributions could request that now if they want to – simply drop me a mail. Otherwise, the "generic TH01 RE" money will continue to go towards the main game. That way, we'll have more substance to show once we do decide to decompile the rest of OP.EXE and FUUIN.EXE, and likely get some press coverage as a result.


Then again, we've been building up to this point over the last few pushes, and it only really needed a quick look over the remaining false positives. The majority of the time therefore went towards more PI in REIIDEN.EXE, where the bitplane pointers for .BOS files yielded some quite big gains. Couldn't really find any obvious reason why ZUN used two slighly different variations on loading and blitting those files, though… :onricdennat:

As the final function in this rather random push, we got TH01's hardware-powered scrolling function, used for screen shaking effects and the scrolling backgrounds at the start of the Final Boss stages. And while I tried to document all these I/O writes… it turned out that ZUN actually copied the entire function straight from the PC-9801 Programmers' Bible, with no changes. :zunpet: It's the setgsta() example function on page 150. Which is terribly suboptimal and bloated – all those integer divisions are really not how you'd write such code for a 16-bit compiler from the 90's…

And that gives us 60% PI overall, and 50% PI over all of TH01! Next up: More structures… and classes, even?

📝 Posted:
🚚 Summary of:
P0092, P0093, P0094
Commits:
29c5a73...4403308, 4403308...0e73029, 0e73029...57a8487
💰 Funded by:
Yanga, Ember2528
🏷 Tags:

Three pushes to decompile the TH01 high score menu… because it's completely terrible, and needlessly complicated in pretty much every aspect:

In the end, I just gave up with my usual redundancy reduction efforts for this one. Anyone wanting to change TH01's high score name entering code would be better off just rewriting the entire thing properly.

And that's all of the shared code in TH01! Both OP.EXE and FUUIN.EXE are now only missing the actual main menu and ending code, respectively. Next up, though: The long awaited TH01 PI push. Which will not only deliver 100% PI for OP.EXE and FUUIN.EXE, but also probably quite some gains in REIIDEN.EXE. With now over 30% of the game decompiled, it's about time we get to look at some gameplay code!

📝 Posted:
🚚 Summary of:
P0090, P0091
Commits:
90252cc...07dab29, 07dab29...29c5a73
💰 Funded by:
Yanga, Ember2528
🏷 Tags:

Back to TH01, and its high score menu… oh, wait, that one will eventually involve keyboard input. And thanks to the generous TH01 funding situation, there's really no reason not to cover that right now. After all, TH01 is the last game where input still hadn't been RE'd.
But first, let's also cover that one unused blitting function, together with REIIDEN.CFG loading and saving, which are in front of the input function in OP.EXE… (By now, we all know about the hidden start bomb configuration, right?)

Unsurprisingly, the earliest game also implements input in the messiest way, with a different function for each of the three executables. "Because they all react differently to keyboard inputs :zunpet:", apparently? OP.EXE even has two functions for it, one for the START / CONTINUE / OPTION / QUIT main menu, and one for both Option and Music Test menus, both of which directly perform the ring arithmetic on the menu cursor variable. A consistent separation of keyboard polling from input processing apparently wasn't all too obvious of a thought, since it's only truly done from TH02 on.

This lack of proper architecture becomes actually hilarious once you notice that it did in fact facilitate a recursion bug! :godzun: In case you've been living under a rock for the past 8 years, TH01 shipped with debugging features, which you can enter by running the game via game d from the DOS prompt. These features include a memory info screen, shown when pressing PgUp, implemented as one blocking function (test_mem()) called directly in response to the pressed key inside the polling function. test_mem() only returns once that screen is left by pressing PgDown. And in order to poll input… it directly calls back into the same polling function that called it in the first place, after a 3-frame delay.

Which means that this screen is actually re-entered for every 3 frames that the PgUp key is being held. And yes, you can, of course, also crash the system via a stack overflow this way by holding down PgUp for a few seconds, if that's your thing.
Edit (2020-09-17): Here's a video from spaztron64, showing off this exact stack overflow crash while running under the VEM486 memory manager, which displays additional information about these sorts of crashes:

What makes this even funnier is that the code actually tracks the last state of every polled key, to prevent exactly that sort of bug. But the copy-pasted assignment of the last input state is only done after test_mem() already returned, making it effectively pointless for PgUp. It does work as intended for PgDown… and that's why you have to actually press and release this key once for every call to test_mem() in order to actually get back into the game. Even though a single call to PgDown will already show the game screen again.

In maybe more relevant news though, this function also came with what can be considered the first piece of actual gameplay logic! Bombing via double-tapping the Z and X keys is also handled here, and now we know that both keys simply have to be tapped twice within a window of 20 frames. They are tracked independently from each other, so you don't necessarily have to press them simultaneously.
In debug mode, the bomb count tracks precisely this window of time. That's why it only resets back to 0 when pressing Z or X if it's ≥20.

Sure, TH01's code is expectedly terrible and messy. But compared to the micro-optimizations of TH04 and TH05, it's an absolute joy to work on, and opening all these ZUN bug loot boxes is just the icing on the cake. Looking forward to more of the high score menu in the next pushes!

📝 Posted:
🚚 Summary of:
P0088, P0089
Commits:
97ce7b7...da6b856, da6b856...90252cc
💰 Funded by:
-Tom-, [Anonymous], Blue Bolt
🏷 Tags:

As expected, we've now got the TH04 and TH05 stage enemy structure, finishing position independence for all big entity types. This one was quite straightfoward, as the .STD scripting system is pretty simple.

Its most interesting aspect can be found in the way timing is handled. In Windows Touhou, all .ECL script instructions come with a frame field that defines when they are executed. In TH04's and TH05's .STD scripts, on the other hand, it's up to each individual instruction to add a frame time parameter, anywhere in its parameter list. This frame time defines for how long this instruction should be repeatedly executed, before it manually advances the instruction pointer to the next one. From what I've seen so far, these instruction typically apply their effect on the first frame they run on, and then do nothing for the remaining frames.
Oh, and you can't nest the LOOP instruction, since the enemy structure only stores one single counter for the current loop iteration.

Just from the structure, the only innovation introduced by TH05 seems to have been enemy subtypes. These can be used to parametrize scripts via conditional jumps based on this value, as a first attempt at cutting down the need to duplicate entire scripts for similar enemy behavior. And thanks to TH05's favorable segment layout, this game's version of the .STD enemy script interpreter is even immediately ready for decompilation, in one single future push.

As far as I can tell, that now only leaves

until TH05's MAIN.EXE is completely position-independent.
Which, however, won't be all it needs for that 100% PI rating on the front page. And with that many false positives, it's quite easy to get lost with immediately reverse-engineering everything around them. This time, the rendering of the text dissolve circles, used for the stage and BGM title popups, caught my eye… and since the high-level code to handle all of that was near the end of a segment in both TH04 and TH05, I just decided to immediately decompile it all. Like, how hard could it possibly be? Sure, it needed another segment split, which was a bit harder due to all the existing ASM referencing code in that segment, but certainly not impossible…

Oh wait, this code depends on 9 other sets of identifiers that haven't been declared in C land before, some of which require vast reorganizations to bring them up to current consistency standards. Whoops! Good thing that this is the part of the project I'm still offering for free…
Among the referenced functions was tiles_invalidate_around(), which marks the stage background tiles within a rectangular area to be redrawn this frame. And this one must have had the hardest function signature to figure out in all of PC-98 Touhou, because it actually seems impossible. Looking at all the ways the game passes the center coordinate to this function, we have

  1. X and Y as 16-bit integer literals, merged into a single PUSH of a 32-bit immediate
  2. X and Y calculated and pushed independently from each other
  3. by-value copies of entire Point instances

Any single declaration would only lead to at most two of the three cases generating the original instructions. No way around separately declaring the function in every translation unit then, with the correct parameter list for the respective calls. That's how ZUN must have also written it.

Oh well, we would have needed to do all of this some time. At least there were quite a bit of insights to be gained from the actual decompilation, where using const references actually made it possible to turn quite a number of potentially ugly macros into wholesome inline functions.

But still, TH04 and TH05 will come out of ReC98's decompilation as one big mess. A lot of further manual decompilation and refactoring, beyond the limits of the original binary, would be needed to make these games portable to any non-PC-98, non-x86 architecture.
And yes, that includes IBM-compatible DOS – which, for some reason, a number of people see as the obvious choice for a first system to port PC-98 Touhou to. This will barely be easier. Sure, you'll save the effort of decompiling all the remaining original ASM. But even with master.lib's MASTER_DOSV setting, these games still very much rely on PC-98 hardware, with corresponding assumptions all over ZUN's code. You will need to provide abstractions for the PC-98's superimposed text mode, the gaiji, and planar 4-bit color access in general, exchanging the use of the PC-98's GRCG and EGC blitter chips with something else. At that point, you might as well port the game to one generic 640×400 framebuffer and away from the constraints of DOS, resulting in that Doom source code-like situation which made that game easily portable to every architecture to begin with. But ZUN just wasn't a John Carmack, sorry.

Or what do I know. I've never programmed for IBM-compatible DOS, but maybe ReC98's audience does include someone who is intimately familiar with IBM-compatible DOS so that the constraints aren't much of an issue for them? But even then, 16-bit Windows would make much more sense as a first porting target if you don't want to bother with that undecompilable ASM.

At least I won't have to look at TH04 and TH05 for quite a while now. :tannedcirno: The delivery delays have made it obvious that my life has become pretty busy again, probably until September. With a total of 9 TH01 pushes from monthly subscriptions now waiting in the backlog, the shop will stay closed until I've caught up with most of these. Which I'm quite hyped for!

📝 Posted:
🚚 Summary of:
P0086, P0087
Commits:
54ee99b...24b96cd, 24b96cd...97ce7b7
💰 Funded by:
[Anonymous], Blue Bolt, -Tom-
🏷 Tags:

Alright, the score popup numbers shown when collecting items or defeating (mid)bosses. The second-to-last remaining big entity type in TH05… with quite some PI false positives in the memory range occupied by its data. Good thing I still got some outstanding generic RE pushes that haven't been claimed for anything more specific in over a month! These conveniently allowed me to RE most of these functions right away, the right way.

Most of the false positives were boss HP values, passed to a "boss phase end" function which sets the HP value at which the next phase should end. Stage 6 Yuuka, Mugetsu, and EX-Alice have their own copies of this function, in which they also reset certain boss-specific global variables. Since I always like to cover all varieties of such duplicated functions at once, it made sense to reverse-engineer all the involved variables while I was at it… and that's why this was exactly the right time to cover the implementation details of Stage 6 Yuuka's parasol and vanishing animations in TH04. :zunpet:

With still a bit of time left in that RE push afterwards, I could also start looking into some of the smaller functions that didn't quite fit into other pushes. The most notable one there was a simple function that aims from any point to the current player position. Which actually only became a separate function in TH05, probably since it's called 27 times in total. That's 27 places no longer being blocked from further RE progress.

WindowsTiger already did most of the work for the score popup numbers in January, which meant that I only had to review it and bring it up to ReC98's current coding styles and standards. This one turned out to be one of those rare features whose TH05 implementation is significantly less insane than the TH04 one. Both games lazily redraw only the tiles of the stage background that were drawn over in the previous frame, and try their best to minimize the amount of tiles to be redrawn in this way. For these popup numbers, this involves calculating the on-screen width, based on the exact number of digits in the point value. TH04 calculates this width every frame during the rendering function, and even resorts to setting that field through the digit iteration pointer via self-modifying code… yup. TH05, on the other hand, simply calculates the width once when spawning a new popup number, during the conversion of the point value to binary-coded decimal. The "×2" multiplier suffix being removed in TH05 certainly also helped in simplifying that feature in this game.

And that's ⅓ of TH05 reverse-engineered! Next up, one more TH05 PI push, in which the stage enemies hopefully finish all the big entity types. Maybe it will also be accompanied by another RE push? In any case, that will be the last piece of TH05 progress for quite some time. The next TH01 stretch will consist of 6 pushes at the very least, and I currently have no idea of how much time I can spend on ReC98 a month from now…

📝 Posted:
🚚 Summary of:
P0085
Commits:
110d6dd...54ee99b
💰 Funded by:
-Tom-
🏷 Tags:

Wait, PI for FUUIN.EXE is mainly blocked by the high score menu? That one should really be properly decompiled in a separate RE push, since it's also present in largely identical form in REIIDEN.EXE… but I currently lack the explicit funding to do that.

And as it turns out, I shouldn't really capture any of the existing generic RE contributions for it either. Back in 2018 when I ran the crowdfunding on the Touhou Patch Center Discord server, I said that generic RE contributions would never go towards TH01. No one was interested in that game back then, and as it's significantly different from all the other games, it made sense to only cover it if explicitly requested.
As Touhou Patch Center still remains one of the biggest supporters and advertisers for ReC98, someone recently believed that this rule was still in effect, despite not being mentioned anywhere on this website.

Fast forward to today, and TH01 has become the single most supported game lately, with plenty of incomplete pushes still open to be completed. Reverse-engineering it has proven to be quite efficient, yielding lots of completion percentage points per push. This, I suppose, is exactly what backers that don't give any specific priorities are mainly interested in. Therefore, I will allocate future partial contributions to TH01, whenever it makes sense.

So, instead of rushing TH01 PI, let's wait for Ember2528's April subscription, and get the 25% total RE milestone with some TH05 PI progress instead. This one primarily focused on the gather circles (spirals…?), the third-last missing entity type in TH05. These are rendered using the same 8×8 pellet sprite introduced in TH02… except that the actual pellets received a darkened bottom part in TH04 . Which, in turn, is actually rendered quite efficiently – the games first render the top white part of all pellets, followed by the bottom gray part of all pellets. The PC-98 GRCG is used throughout the process, doing its typical job of accelerating monochrome blitting, and by arranging the rendering like this, only two GRCG color changes are required to draw any number of pellets. I guess that makes it quite a worthwhile optimization? Don't ask me for specific performance numbers or even saved cycles, though :onricdennat:

Next up, one more TH05 PI push!

📝 Posted:
🚚 Summary of:
P0084
Commits:
dfac2f2...110d6dd
💰 Funded by:
Yanga
🏷 Tags:

Final TH01 RE push for the time being, and as expected, we've got the superficially final piece of shared code between the TH01 executables. However, just having a single implementation for loading and recreating the REYHI*.DAT score files would have been way above ZUN's standards of consistency. So ZUN had the unique idea to mix up the file I/O APIs, using master.lib functions in REIIDEN.EXE, and POSIX functions (along with error messages and disabled interrupts) in FUUIN.EXE:zunpet: Could have been worse though, as it was possible to abstract that away quite nicely.

That code wasn't quite in the natural way of decompilation either. As it turns out though, 📝 segment splitting isn't so painful after all if one of the new segments only has a few functions. Definitely going to do that more often from now on, since it allows a much larger number of functions to be immediately decompiled. Which is always superior to somehow transforming a function's ASM into a form that I can confidently call "reverse-engineered", only to revisit it again later for its decompilation.

And while I unfortunately missed 25% of total RE by a bit, this push reached two other and perhaps even more significant milestones:

Next up, PI milestones!

📝 Posted:
🚚 Summary of:
P0083
Commits:
f6cbff0...dfac2f2
💰 Funded by:
Yanga
🏷 Tags:

Nope, RL has given me plenty of things to do from home after all, so the current cap still remains an accurate representation of my free time. 😕

For now though, we've got one more TH01 file format push, covering the core functions for loading and displaying the 32×32 and 16×16 sprites from the .PTN files, as announced – and probably one of the last ones for quite a while to yield both RE and PI progress way above average. But what is this, error return values in a ZUN game?! And actually good code for deriving the alpha channel from the 16th color in the hardware palette?! Sure, the rest of the code could still be improved a lot, but that was quite a surprise, especially after the spaghetti code of 📝 the last push. That makes up for two of the .PTN structure fields (one of them always 0, and one of them always 1) remaining unused, and therefore unknown.

ZUN also uses the .PTN image slots to store the background of frequently updated VRAM sections, in order to be able to repeatedly draw on top of them – like for example the HUD area where the score and time numbers are drawn. Future games would simply use the text RAM and gaiji for those numbers. This would have worked just fine for TH01 too – especially since all the functions decompiled so far align the VRAM X coordinate to the 8-pixel byte grid, which is the simplest way of accessing VRAM given the PC-98's planar memory layout. Looks as if ZUN simply wasn't aware of gaiji during the development of TH01.

This won't be the last time I cover the .PTN format, since all the blitting functions that actually use alpha are exclusive to REIIDEN.EXE, and currently out of decompilation reach. But after some more long overdue cleaning work, TH01 has now passed both TH02 and even TH04 to become the second-most reverse-engineered game in all of ReC98, in terms of absolute numbers! 🎉

Also, PI for TH01's OP.EXE is imminent. Next up though, we've first got the probably final double-speed push for TH01, covering the last set of duplicated functions between the three binaries – quite fitting for the currently last fully funded, outstanding TH01 RE push. Then, we also might get FUUIN.EXE PI within the same push afterwards? After that, TH01 progress will be slowing down, since I'd then have to cover either the main menu or in-game code or the cutscenes, depending on what the backers request. (By default, it's going to be in-game code, of course.)

📝 Posted:
🚚 Summary of:
P0082
Commits:
5ac9b30...f6cbff0
💰 Funded by:
Ember2528
🏷 Tags:

Last of the 3 weeks of almost full-time ReC98 work, supposedly the least stressful one, and then things still get delayed thanks to illness 😕 In better news though, it looks like I'll be able to extend these 3 weeks to 8, as my RL is shutting down for coronavirus reasons. I'm going to wait a bit for the dust to settle before raising the crowdfunding cap though, since RL might give me more to do from home after all. I may or may not also get commissioned for a non-Touhou translation patch project to be worked on in that time…

The .GRP file functions turned out to, of course, also be present in FUUIN.EXE. In fact, that binary had the largest share of progress in this push, since it's the only one to include another reimplementation of master.lib-style hardware palette fading. As a typical little ZUN inconsistency, the FUUIN.EXE version of one .GRP palette function directly calls one of these functions.

As for the functions themselves, they basically wrap the single-function Pi load and display library by 電脳科学研究所/BERO in a bowl of global state spaghetti. 🍝 At least the function names now clearly encode important side effects like, y'know, a changed hardware palette. The reason ZUN used this separate library over master.lib's PI loading functions was probably its support for defining a color as transparent. This feature is used for the red box in the main menu, and the large cyan Siddhaṃ seed syllables in (again) the Konngara fight.

Next up, we've got the .PTN format!

📝 Posted:
🚚 Summary of:
P0081
Commits:
0252da2...5ac9b30
💰 Funded by:
Ember2528
🏷 Tags:

Sadly, we've already reached the end of fast triple-speed TH01 progress with 📝 the last push, which decompiled the last segment shared by all three of TH01's executables. There's still a bit of double-speed progress left though, with a small number of code segments that are shared between just two of the three executables.

At the end of the first one of these, we've got all the code for the .GRZ format – which is yet another run-length encoded image format, but this time storing up to 16 full 640×400 16-color images with an alpha bit. This one is exclusively used to wastefully store Konngara's sword slash and kuji-in kill animations. Due to… suboptimal code organization, the code for the format is also present in OP.EXE, despite not being used there. But hey, that brings TH01 to over 20% in RE!

Decoupling the RLE command stream from the pixel data sounds like a nice idea at first, allowing the format to efficiently encode a variety of animation frames displayed all over the screen… if ZUN actually made use of it. The RLE stream also has quite some ridiculous overhead, starting with 1 byte to store the 1-bit command (putting a single 8×1 pixel block, or entering a run of N such blocks). Run commands then store another 1-byte run length, which has to be followed by another command byte to identify the run as putting N blocks, or skipping N blocks. And the pixel data is just a sequence of these blocks for all 4 bitplanes, in uncompressed form…

Also, have some rips of all the images this format is used for:

<code>boss8.grz</code>, image 1/16<code>boss8.grz</code>, image 2/16<code>boss8.grz</code>, image 3/16<code>boss8.grz</code>, image 4/16<code>boss8.grz</code>, image 5/16<code>boss8.grz</code>, image 6/16<code>boss8.grz</code>, image 7/16<code>boss8.grz</code>, image 8/16<code>boss8.grz</code>, image 9/16<code>boss8.grz</code>, image 10/16<code>boss8.grz</code>, image 11/16<code>boss8.grz</code>, image 12/16<code>boss8.grz</code>, image 13/16<code>boss8.grz</code>, image 14/16<code>boss8.grz</code>, image 15/16<code>boss8.grz</code>, image 16/16

To make these, I just wrote a small viewer, calling the same decompiled TH01 code: 2020-03-07-grzview.zip Obviously, this means that it not only must to be run on a PC-98, but also discards the alpha information. If any backers are really interested in having a proper converter to and from PNG, I can implement that in an upcoming push… although that would be the perfect thing for outside contributors to do.

Next up, we got some code for the PI format… oh, wait, the actual files are called "GRP" in TH01.

📝 Posted:
🚚 Summary of:
P0080
Commits:
cd48aa3...0252da2
💰 Funded by:
Splashman, Ember2528
🏷 Tags:

Last part of TH01's main graphics function segment, and we've got even more code that alternates between being boring and being slightly weird. But at least, "boring" also meant "consistent" for once. And so progress continued to be as fast as expected from the last TH01 pushes, yielding 3.3% in TH01 RE%, and 1% in overall RE%, within a single day. There even was enough time to decompile another full code segment, which bundles all the hardware initialization and cleanup calls into single functions to be run when starting and exiting the game. Which might be interesting for at least one person, I guess :tannedcirno:

But seriously, trying to access page 2 on a system with only page 0 and 1? Had to get out my real PC-98 to double-check that I wasn't missing anything here, since every emulator only looks at the bottom bit of the page number. But real hardware seems to do the same, and there really is nothing special to it semantically, being equivalent to page 0. 🤷

Next up in TH01, we'll have some file format code!

📝 Posted:
🚚 Summary of:
P0078, P0079
Commits:
f4eb7a8...9e52cb1, 9e52cb1...cd48aa3
💰 Funded by:
iruleatgames, -Tom-
🏷 Tags:

To finish this TH05 stretch, we've got a feature that's exclusive to TH05 for once! As the final memory management innovation in PC-98 Touhou, TH05 provides a single static (64 * 26)-byte array for storing up to 64 entities of a custom type, specific to a stage or boss portion. (Edit (2023-05-29): This system actually debuted in 📝 TH04, where it was used for much simpler entities.)

TH05 uses this array for

  1. the Stage 2 star particles,
  2. Alice's puppets,
  3. the tip of curve ("jello") bullets,
  4. Mai's snowballs and Yuki's fireballs,
  5. Yumeko's swords,
  6. and Shinki's 32×32 bullets,

which makes sense, given that only one of those will be active at any given time.

On the surface, they all appear to share the same 26-byte structure, with consistently sized fields, merely using its 5 generic fields for different purposes. Looking closer though, there actually are differences in the signedness of certain fields across the six types. uth05win chose to declare them as entirely separate structures, and given all the semantic differences (pixels vs. subpixels, regular vs. tiny master.lib sprites, …), it made sense to do the same in ReC98. It quickly turned out to be the only solution to meet my own standards of code readability.

Which blew this one up to two pushes once again… But now, modders can trivially resize any of those structures without affecting the other types within the original (64 * 26)-byte boundary, even without full position independence. While you'd still have to reduce the type-specific number of distinct entities if you made any structure larger, you could also have more entities with fewer structure members.

As for the types themselves, they're full of redundancy once again – as you might have already expected from seeing #4, #5, and #6 listed as unrelated to each other. Those could have indeed been merged into a single 32×32 bullet type, supporting all the unique properties of #4 (destructible, with optional revenge bullets), #5 (optional number of twirl animation frames before they begin to move) and #6 (delay clouds). The *_add(), *_update(), and *_render() functions of #5 and #6 could even already be completely reverse-engineered from just applying the structure onto the ASM, with the ones of #3 and #4 only needing one more RE push.

But perhaps the most interesting discovery here is in the curve bullets: TH05 only renders every second one of the 17 nodes in a curve bullet, yet hit-tests every single one of them. In practice, this is an acceptable optimization though – you only start to notice jagged edges and gaps between the fragments once their speed exceeds roughly 11 pixels per second:

And that brings us to the last 20% of TH05 position independence! But first, we'll have more cheap and fast TH01 progress.

📝 Posted:
🚚 Summary of:
P0076, P0077
Commits:
222fc99...9ae9754, 9ae9754...f4eb7a8
💰 Funded by:
[Anonymous], -Tom-, Splashman
🏷 Tags:

Well, that took twice as long as I thought, with the two pushes containing a lot more maintenance than actual new research. Spending some time improving both field names and types in 32th System's TH03 resident structure finally gives us all of those structures. Which means that we can now cover all the remaining decompilable ZUN.COM parts at once…

Oh wait, their main() functions have stayed largely identical since TH02? Time to clean up and separate that first, then… and combine two recent code generation observations into the solution to a decompilation puzzle from 4½ years ago. Alright, time to decomp-

Oh wait, we'd kinda like to properly RE all the code in TH03-TH05 that deals with loading and saving .CFG files. Almost every outside contributor wanted to grab this supposedly low-hanging fruit a lot earlier, but (of course) always just for a single game, while missing how the format evolved.

So, ZUN.COM. For some reason, people seem to consider it particularly important, even though it contains neither any game logic nor any code specific to PC-98 hardware… All that this decompilable part does is to initialize a game's .CFG file, allocate an empty resident structure using master.lib functions, release it after you quit the game, error-check all that, and print some playful messages~ (OK, TH05's also directly fills the resident structure with all data from MIKO.CFG, which all the other games do in OP.EXE.) At least modders can now freely change and extend all the resident structures, as well as the .CFG files? And translators can translate those messages that you won't see on a decently fast emulator anyway? Have fun, I guess 🤷‍

And you can in fact do this right now – even for TH04 and TH05, whose ZUN.COM currently isn't rebuilt by ReC98. There is actually a rather involved reason for this:

So yeah, no meaningful RE and PI progress at any of these levels. Heck, even as a modder, you can just replace the zun zun_res (TH02), zun -5 (TH03), or zun -s (TH04/TH05) calls in GAME.BAT with a direct call to your modified *RES*.COM. And with the alternative being "manually typing 0 and 1 bits into a text file", editing the sprites in TH05's GJINIT.COM is way more comfortable in a binary sprite editor anyway.

For me though, the best part in all of this was that it finally made sense to throw out the old Borland C++ run-time assembly slices 🗑 This giant waste of time became obvious 5 years ago, but any ASM dump of a .COM file would have needed rather ugly workarounds without those slices. Now that all .COM binaries that were originally written in C are compiled from C, we can all enjoy slightly faster grepping over the entire repository, which now has 229 fewer files. Productivity will skyrocket! :tannedcirno:

Next up: Three weeks of almost full-time ReC98 work! Two more PI-focused pushes to finish this TH05 stretch first, before switching priorities to TH01 again.

📝 Posted:
🚚 Summary of:
P0072, P0073, P0074, P0075
Commits:
4bb04ab...cea3ea6, cea3ea6...5286417, 5286417...1807906, 1807906...222fc99
💰 Funded by:
[Anonymous], -Tom-, Myles
🏷 Tags:

Long time no see! And this is exactly why I've been procrastinating bullets while there was still meaningful progress to be had in other parts of TH04 and TH05: There was bound to be quite some complexity in this most central piece of game logic, and so I couldn't possibly get to a satisfying understanding in just one push.

Or in two, because their rendering involves another bunch of micro-optimized functions adapted from master.lib.

Or in three, because we'd like to actually name all the bullet sprites, since there are a number of sprite ID-related conditional branches. And so, I was refining things I supposedly RE'd in the the commits from the first push until the very end of the fourth.

When we talk about "bullets" in TH04 and TH05, we mean just two things: the white 8×8 pellets, with a cap of 240 in TH04 and 180 in TH05, and any 16×16 sprites from MIKO16.BFT, with a cap of 200 in TH04 and 220 in TH05. These are by far the most common types of… err, "things the player can collide with", and so ZUN provides a whole bunch of pre-made motion, animation, and n-way spread / ring / stack group options for those, which can be selected by simply setting a few fields in the bullet template. All the other "non-bullets" have to be fired and controlled individually.

Which is nothing new, since uth05win covered this part pretty accurately – I don't think anyone could just make up these structure member overloads. The interesting insights here all come from applying this research to TH04, and figuring out its differences compared to TH05. The most notable one there is in the default groups: TH05 allows you to add a stack to any single bullet, n-way spread or ring, but TH04 only lets you create stacks separately from n-way spreads and rings, and thus gets by with fewer fields in its bullet template structure. On the other hand, TH04 has a separate "n-way spread with random angles, yet still aimed at the player" group? Which seems to be unused, at least as far as midbosses and bosses are concerned; can't say anything about stage enemies yet.

In fact, TH05's larger bullet template structure illustrates that these distinct group types actually are a rather redundant piece of over-engineering. You can perfectly indicate any permutation of the basic groups through just the stack bullet count (1 = no stack), spread bullet count (1 = no spread), and spread delta angle (0 = ring instead of spread). Add a 4-flag bitfield to cover the rest (aim to player, randomize angle, randomize speed, force single bullet regardless of difficulty or rank), and the result would be less redundant and even slightly more capable.

Even those 4 pushes didn't quite finish all of the bullet-related types, stopping just shy of the most trivial and consistent enum that defines special movement. This also left us in a 📝 TH03-like situation, in which we're still a bit away from actually converting all this research into actual RE%. Oh well, at least this got us way past 50% in overall position independence. On to the second half! 🎉

For the next push though, we'll first have a quick detour to the remaining C code of all the ZUN.COM binaries. Now that the 📝 TH04 and TH05 resident structures no longer block those, -Tom- has requested TH05's RES_KSO.COM to be covered in one of his outstanding pushes. And since 32th System recently RE'd TH03's resident structure, it makes sense to also review and merge that, before decompiling all three remaining RES_*.COM binaries in hopefully a single push. It might even get done faster than that, in which case I'll then review and merge some more of WindowsTiger's research.

📝 Posted:
🚚 Summary of:
P0071
Commits:
327021b...4bb04ab
💰 Funded by:
KirbyComment, -Tom-
🏷 Tags:

Turns out that covering TH03's 128-byte player structure was way more insightful than expected! And while it doesn't include every bit of per-player data, we still got to know quite a bit about the game from just trying to name its members:

The next TH03 pushes can now cover all the functions that reference this structure in one way or another, and actually commit all this research and translate it into some RE%. Since the non-TH05 priorities have become a bit unclear after the last 50 € RE contribution though (as of this writing, it's still 10 € to decide on what game to cover in two RE pushes!), I'll be returning to TH05 until that's decided.

📝 Posted:
🚚 Summary of:
P0070
Commits:
a931758...327021b
💰 Funded by:
KirbyComment
🏷 Tags:

As noted in 📝 P0061, TH03 gameplay RE is indeed going to progress very slowly in the beginning. A lot of the initial progress won't even be reflected in the RE% – there are just so many features in this game that are intertwined into each other, and I only consider functions to be "reverse-engineered" once we understand every involved piece of code and data, and labeled every absolute memory reference in it. (Yes, that means that the percentages on the front page are actually underselling ReC98's progress quite a bit, and reflect a pretty low bound of our actual understanding of the games.)

So, when I get asked to look directly at gameplay code right now, it's quite the struggle to find a place that can be covered within a push or two and that would immediately benefit scoreplayers. The basics of score and combo handling themselves managed to fit in pretty well, though:

Oh well, at least we still got a bit of PI% out of this one. From this point though, the next push (or two) should be enough to cover the big 128-byte player structure – which by itself might not be immediately interesting to scoreplayers, but surely is quite a blocker for everything else.

📝 Posted:
🚚 Summary of:
P0067, P0068, P0069
Commits:
e55a48b...ebb30ce, ebb30ce...2ac00d4, e0d0dcd...0f18dbc
💰 Funded by:
Splashman, Yanga, [Anonymous]
🏷 Tags:

Now that's more like the speed I was expecting! After a few more unused functions for palette fading and rectangle blitting, we've reached the big line drawing functions. And the biggest one among them, drawing a straight line at any angle between two points using Bresenham's algorithm, actually happens to be the single longest function present in more than one binary in all of PC-98 Touhou, and #23 on the list of individual longest functions.

And it technically has a ZUN bug! If you pass a point outside the (0, 0) - (639, 399) screen range, the function will calculate a new point at the edge of the screen, so that the resulting line will retain the angle intended by the points given. Except that it does so by calculating the line slope using an integer division rather than a floating-point one :zunpet: Doesn't seem like it actually causes any weirdly skewed lines to be drawn in-game, though; that case is only hit in the Mima boss fight, which draws a few lines with a bottom coordinate of 400 rather than the maximum of 399. It might also cause the wrong background pixels to be restored during parts of the YuugenMagan fight, leading to flickering sprites, but seriously, that's an issue everywhere you look in this game.

Together with the rendering-text-to-VRAM function we've mostly already known from TH02, this pushed the total RE percentage well over 20%, and almost doubled the TH01 RE percentage, all within three pushes. And comparatively, it went really smoothly, to the point (ha) where I even had enough time left to also include the single-point functions that come next in that code segment. Since about half of the remaining functions in OP.EXE are present in more than just itself, I'll be able to at least keep up this speed until OP.EXE hits the 70% RE mark. That is, as long as the backers' priorities continue to be generic RE or "giving some love to TH01"… we don't have a precedent for TH01's actual game code yet.

And that's all the TH01 progress funded for January! Next up, we actually do have a focus on TH03's game and scoring mechanics… or at least the foundation for that.

📝 Posted:
🚚 Summary of:
P0066
Commits:
042b780...e55a48b
💰 Funded by:
Yanga, Splashman
🏷 Tags:

So, the thing that made me so excited about TH01 were all those bulky C reimplementations of master.lib functions. Identical copies in all three executables, trivial to figure out and decompile, removing tons of instructions, and providing a foundation for large parts of the game later. The first set of functions near the end of that shared code segment deals with color palette handling, and master.lib's resident palette structure in particular. (No relation to the game's resident structure.) Which directly starts us out with pretty much all the decompilation difficulties imaginable:

And as it turns out, the game doesn't even use the resident palette feature. Which adds yet another set of functions to the, uh, learning experience that ZUN must have chosen this game to be. I wouldn't be surprised if we manage to uncover actual scrapped beta game content later on, among all the unused code that's bound to still be in there.

At least decompilation should get easier for the next few TH01 pushes now… right?

📝 Posted:
🚚 Summary of:
P0065
Commits:
9faa29a...042b780
💰 Funded by:
Touhou Patch Center
🏷 Tags:

A~nd resident structures ended up being exactly the right thing to start off the new year with. WindowsTiger and spaztron64 have already been pushing for them with their own reverse-engineering, and together with my own recent GENSOU.SCR RE work, we've clarified just enough context around the harder-to-explain values to make both TH04's and TH05's structures fit nicely into the typical time frame of a single push.

With all the apparently obvious and seemingly just duplicated values, it has always been easy to do a superficial job for most of the structure, then lose motivation for the last few unknown fields. Pretty glad to got this finally covered; I've heard that people are going to write trainer tools now?

Also, where better to slot in a push that, in terms of figures, seems to deliver 0% RE and only miniscule PI progress, than at the end of Touhou Patch Center's 5-push order that already had multiple pushes yielding above-average progress? :onricdennat: As usual, we'll be reaping the rewards of this work in the next few TH04/TH05 pushes…

…whenever they get funded, that is, as for January, the backers have shifted the priorities towards TH01 and TH03. TH01 especially is something I'm quite excited about, as we're finally going to see just how fast this bloated game is really going to progress. Are you excited?

📝 Posted:
🚚 Summary of:
P0064
Commits:
80cec5b...9faa29a
💰 Funded by:
Touhou Patch Center
🏷 Tags:

🎉 TH04's and TH05's OP.EXE are now fully position-independent! 🎉

What does this mean?

You can now add any data or code to the main menus of the two games, by simply editing the ReC98 source, writing your mod in ASM or C/C++, and recompiling the code. Since all absolute memory addresses have now been converted to labels, this will work without causing any instability. See the position independence section in the FAQ for a more thorough explanation about why this was a problem.

What does this not mean?

The original ZUN code hasn't been completely reverse-engineered yet, let alone decompiled. Pretty much all of that is still ASM, which might make modding a bit inconvenient right now.

Since this push was otherwise pretty unremarkable, I made a video demonstrating a few basic things you can do with this:

Now, what to do for the last outstanding Touhou Patch Center push? Bullets, or resident structures? :thonk:

📝 Posted:
🚚 Summary of:
P0063
Commits:
034ae4b...8dbb450
💰 Funded by:
-Tom-
🏷 Tags:

Almost!

Just like most of the time, it was more sensible to cover GENSOU.SCR, the last structure missing in TH05's OP.EXE, everywhere it's used, rather than just rushing out OP.EXE position independence. I did have to look into all of the functions to fully RE it after all, and to find out whether the unused fields actually are unused. The only thing that kept this push from yielding even more above-average progress was the sheer inconsistency in how the games implemented the operations on this PC-98 equivalent of score*.dat:

Technically though, TH05's OP.EXE is position-independent now, and the rest are (should be? :tannedcirno:) merely false positives. However, TH04's is still missing another structure, in addition to its false positives. So, let's wait with the big announcement until the next push… which will also come with a demo video of what will be possible then.

📝 Posted:
🚚 Summary of:
P0062
Commits:
1d6fbb8...f275e04
💰 Funded by:
Touhou Patch Center
🏷 Tags:

Big gains, as expected, but not much to say about this one. With TH05 Reimu being way too easy to decompile after 📝 the shot control groundwork done in October, there was enough time to give the comprehensive PI false-positive treatment to two other sets of functions present in TH04's and TH05's OP.EXE. One of them, master.lib's super_*() functions, was used a lot in TH02, more than in any other game… I wonder how much more that game will progress without even focusing on it in particular.

Alright then! 100% PI for TH04's and TH05's OP.EXE upcoming… (Edit: Already got funding to cover this!)

📝 Posted:
🏷 Tags:

Did WindowsTiger just cover 2% over all games on his own? While not all of that passed my review, +1.59% RE and +1.66% PI over all 5 games is still pretty noteworthy, and comfortably pushes TH05 over the 25% mark in RE, and the 60% mark in PI.

However.

While I definitely do appreciate such contributions, reviewing and adapting these to my current code organization standards also takes more time than I'd like it to take. And taken to this level, it does kind of undermine this crowdfunding project, causing both a literal denial of service and exactly the stress that this crowdfunding was designed to avoid. Most of the time, I can't merge all of that as-is without knowingly creating annoyances down the line. But I don't want to just ignore it either, or reject every non-perfect commit…
That's also why I let it slide this time, due to some of the RE work in there being genuinely amazing. In the future though, be aware that your chance of having your work merged diminishes the further you move ahead of my current master branch. In extreme cases like this one, I'll then just be waiting until enough generic reverse-engineering pushes have accrued, and treat the merge as regular work.

But now, time to continue with the regular programming… I am kind of exhausted from all of this, so no bullets for the next two Touhou Patch Center pushes, still… Good thing there's still plenty of simpler things with big percentage gains to be done:

📝 Posted:
🚚 Summary of:
P0061
Commits:
96684f4...5f4f5d8
💰 Funded by:
Touhou Patch Center
🏷 Tags:

… nope, with a game whose MAIN.EXE is still just 5% reverse-engineered and which naturally makes heavy use of structures, there's still a lot more PI groundwork to be done before RE progress can speed up to the levels that we've now reached with TH05. The good news is that this game is (now) way easier to understand: In contrast to TH04 and TH05, where we needed to work towards player shots over a two-digit number of pushes, TH03 only needed two for SPRITE16, and a half one for the playfield shaking mechanism. After that, I could even already decompile the per-frame shot update and render functions, thanks to TH03's high number of code segments. Now, even the big 128-byte player structure doesn't seem all too far off.

Then again, as TH03 shares no code with any other game, this actually was a completely average PI push. For the remaining three, we'll return to TH04 and TH05 though, which should more than make up for the slight drop in RE speed after this one.

In other news, we've now also reached peak C++, with the introduction of templates! TH03 stores movement speeds in a 4.4 fixed-point format, which is an 8-bit spin on the usual 16-bit, 12.4 fixed-point format.

📝 Posted:
🚚 Summary of:
P0060
Commits:
29385dd...73f5ae7
💰 Funded by:
Touhou Patch Center
🏷 Tags:

So, where to start? Well, TH04 bullets are hard, so let's procrastinate start with TH03 instead :tannedcirno: The 📝 sprite display functions are the obvious blocker for any structure describing a sprite, and therefore most meaningful PI gains in that game… and I actually did manage to fit a decompilation of those three functions into exactly the amount of time that the Touhou Patch Center community votes alloted to TH03 reverse-engineering!

And a pretty amazing one at that. The original code was so obviously written in ASM and was just barely decompilable by exclusively using register pseudovariables and a bit of goto, but I was able to abstract most of that away, not least thanks to a few helpful optimization properties of Turbo C++… seriously, I can't stop marveling at this ancient compiler. The end result is both readable, clear, and dare I say portable?! To anyone interested in porting TH03, take a look. How painful would it be to port that away from 16-bit x86?

However, this push is also a typical example that the RE/PI priorities can only control what I look at, and the outcome can actually differ greatly. Even though the priorities were 65% RE and 35% PI, the progress outcome was +0.13% RE and +1.35% PI. But hey, we've got one more push with a focus on TH03 PI, so maybe that one will include more RE than PI, and then everything will end up just as ordered? :onricdennat:

📝 Posted:
🚚 Summary of:
P0059
Commits:
01de290...8b62780
💰 Funded by:
[Anonymous], -Tom-
🏷 Tags:

With no feedback to 📝 last week's blog post, I assume you all are fine with how things are going? Alright then, another one towards position independence, with the same approach as before…

Since -Tom- wanted to learn something about how the PC-98 EGC is used in TH04 and TH05, I took a look at master.lib's egc_shift_*() functions. These simply do a hardware-accelerated memmove() of any VRAM region, and are used for screen shaking effects. Hover over the image below for the raw effect:

Demonstration of an egc_shift_left() call

Then, I finally wanted to take a look at the bullet structures, but it required way too much reverse-engineering to even start within ¾ of a position independence push. Even with the help of uth05win – bullet handling was changed quite a bit from TH04 to TH05.

What I ultimately settled on was more raw, "boring" PI work based around an already known set of functions. For this one, I looked at vector construction… and this time, that actually made the games a little bit more position-independent, and wasn't just all about removing false positives from the calculation. This was one of the few sets of functions that would also apply to TH01, and it revealed just how chaotically that game was coded. This one commit shows three ways how ZUN stored regular 2D points in TH01:

… yeah. But in more productive news, this did actually lay the groundwork for TH04 and TH05 bullet structures. Which might even be coming up within the next big, 5-push order from Touhou Patch Center? These are the priorities I got from them, let's see how close I can get!

📝 Posted:
🚚 Summary of:
P0057, P0058
Commits:
1cb9731...ac7540d, ac7540d...fef0299
💰 Funded by:
[Anonymous], -Tom-
🏷 Tags:

So, here we have the first two pushes with an explicit focus on position independence… and they start out looking barely different from regular reverse-engineering? They even already deduplicate a bunch of item-related code, which was simple enough that it required little additional work? Because the actual work, once again, was in comparing uth05win's interpretations and naming choices with the original PC-98 code? So that we only ended up removing a handful of memory references there?

(Oh well, you can mod item drops now!)

So, continuing to interpret PI as a mere by-product of reverse-engineering might ultimately drive up the total PI cost quite a bit. But alright then, let's systematically clear out some false positives by looking at master.lib function calls instead… and suddenly we get the PI progress we were looking for, nicely spread out over all games since TH02. That kinda makes it sound like useless work, only done because it's dictated by some counting algorithm on a website. But decompilation will want to convert all of these values to decimal anyway. We're merely doing that right now, across all games.

Then again, it doesn't actually make any game more position-independent, and only proves how position-independent it already was. So I'm really wondering right now whether I should just rush actual position independence by simply identifying structures and their sizes, and not bother with members or false positives until that's done. That would certainly get the job done for TH04 and TH05 in just a few more pushes, but then leave all the proving work (and the road to 100% PI on the front page) to reverse-engineering.

I don't know. Would it be worth it to have a game that's "maybe fully position-independent", only for there to maybe be rare edge cases where it isn't?

Or maybe, continuing to strike a balance between identifying false positives (fast) and reverse-engineering structures (slow) will continue to work out like it did now, and make us end up close to the current estimate, which was attractive enough to sell out the crowdfunding for the first time… 🤔

Please give feedback! If possible, by Friday evening UTC+1, before I start working on the next PI push, this time with a focus on TH04.

📝 Posted:
🚚 Summary of:
P0056
Commits:
b28cefc...c09446a
💰 Funded by:
rosenrose, [Anonymous]
🏷 Tags:

No priorities, again…?! Please don't do this to me… 😕

Well, let's not continue with TH05 then 😛 And instead use the occasion to commit this interesting discovery, made by @m1yur1 last year. Yup, TH03's "ZUNSP" sprite driver is actually a "rebranded" version of Promisence Soft's SPRITE16.COM. Sure, you were allowed to use this driver in your own game, but replacing the copyright with your own isn't exactly the nicest thing to do… That now makes three library programmers that ZUN didn't credit. Makes me wonder what makes M. Kajihara so special. Probably the fact that Touhou has always been about the music for ZUN, first and foremost.

But what makes this more than a piece of trivia is the fact that Promiscence Soft's SPRITE16 sample game StormySpace was bundled with documentation on the driver. Shoutout to the Neo Kobe PC-98 collection for preserving he original release!

That means more documented third-party code that we don't necessarily have to reverse-engineer, just like master.lib or KAJA's PMD driver. However, the PC-98 EGC is rather complex and definitely not designed for alpha-tested 16-color sprite blitting. So it (once again) took quite a while to make sense of SPRITE16's code and the available documentation on the EGC, to come up with satisfying function names. As a result, I'm going to distribute the entire RE work related to TH03's SPRITE16 interface across a total of three pushes, this one being the first of them.

Next up, we do have more TH05 progress though, now for the first time specifically with position independence in mind. Pretty excited for this one!

📝 Posted:
🚚 Summary of:
P0036, P0037
Commits:
a533b5d...82b0e1d, 82b0e1d...e7e1cbc
💰 Funded by:
zorg
🏷 Tags:

And just in time for zorg's last outstanding pushes, the TH05 shot type control functions made the speedup happen!

It would have been really nice to also include Reimu's shot control functions in this last push, but figuring out this entire system, with its weird bitflags and switch statement micro-optimizations, was once again taking way longer than it should have. Especially with my new-found insistence on turning this obvious copy-pasta into something somewhat readable and terse…

But with such a rather tabular visual structure, things should now be moddable in hopefully easily consistent way. Of course, since we're only at 54% position independence for MAIN.EXE, this isn't possible yet without crashing the game, but modifying damage would already work.

Despite my earlier claims of ZUN only having used C++ in TH01, as it's the only game using new and delete, it's now pretty much confirmed that ZUN used it for all games, as inlined functions (and by extension, C++ class methods) are the only way to get certain instructions out of the Turbo C++ code generator. Also, I've kept my promise and started really filling that decompilation pattern file.

And now, with the reverse-engineering backlog finally being cleared out, we wait for the next orders, and the direction they might focus on…

📝 Posted:
🚚 Summary of:
P0034, P0035
Commits:
6cdd229...6f1f367, 6f1f367...a533b5d
💰 Funded by:
zorg
🏷 Tags:

Deathbombs confirmed, in both TH04 and TH05! On the surface, it's the same 8-frame window as in most Windows games, but due to the slightly lower PC-98 frame rate of 56.4 Hz, it's actually slightly more lenient in TH04 and TH05.

The last function in front of the TH05 shot type control functions marks the player's previous position in VRAM to be redrawn. But as it turns out, "player" not only means "the player's option satellites on shot levels ≥ 2", but also "the explosion animation if you lose a life", which required reverse-engineering both things, ultimately leading to the confirmation of deathbombs.

It actually was kind of surprising that we then had reverse-engineered everything related to rendering all three things mentioned above, and could also cover the player rendering function right now. Luckily, TH05 didn't decide to also micro-optimize that function into un-decompilability; in fact, it wasn't changed at all from TH04. Unlike the one invalidation function whose decompilation would have actually been the goal here…

But now, we've finally gotten to where we wanted to… and only got 2 outstanding decompilation pushes left. Time to get the website ready for hosting an actual crowdfunding campaign, I'd say – It'll make a better impression if people can still see things being delivered after the big announcement.

📝 Posted:
🚚 Summary of:
P0031, P0032, P0033
Commits:
dea40ad...9f764fa, 9f764fa...e6294c2, e6294c2...6cdd229
💰 Funded by:
zorg
🏷 Tags:

The glacial pace continues, with TH05's unnecessarily, inappropriately micro-optimized, and hence, un-decompilable code for rendering the current and high score, as well as the enemy health / dream / power bars. While the latter might still pass as well-written ASM, the former goes to such ridiculous levels that it ends up being technically buggy. If you enjoy quality ZUN code, it's definitely worth a read.

In TH05, this all still is at the end of code segment #1, but in TH04, the same code lies all over the same segment. And since I really wanted to move that code into its final form now, I finally did the research into decompiling from anywhere else in a segment.

Turns out we actually can! It's kinda annoying, though: After splitting the segment after the function we want to decompile, we then need to group the two new segments back together into one "virtual segment" matching the original one. But since all ASM in ReC98 heavily relies on being assembled in MASM mode, we then start to suffer from MASM's group addressing quirk. Which then forces us to manually prefix every single function call

with the group name. It's stupidly boring busywork, because of all the function calls you mustn't prefix. Special tooling might make this easier, but I don't have it, and I'm not getting crowdfunded for it.

So while you now definitely can request any specific thing in any of the 5 games to be decompiled right now, it will take slightly longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)

Only one function away from the TH05 shot type control functions now!

📝 Posted:
🚚 Summary of:
P0029, P0030
Commits:
6ff427a...c7fc4ca, c7fc4ca...dea40ad
💰 Funded by:
zorg
🏷 Tags:

Here we go, new C code! …eh, it will still take a bit to really get decompilation going at the speeds I was hoping for. Especially with the sheer amount of stuff that is set in the first few significant functions we actually can decompile, which now all has to be correctly declared in the C world. Turns out I spent the last 2 years screwing up the case of exported functions, and even some of their names, so that it didn't actually reflect their calling convention… yup. That's just the stuff you tend to forget while it doesn't matter.

To make up for that, I decided to research whether we can make use of some C++ features to improve code readability after all. Previously, it seemed that TH01 was the only game that included any C++ code, whereas TH02 and later seemed to be 100% C and ASM. However, during the development of the soon to be released new build system, I noticed that even this old compiler from the mid-90's, infamous for prioritizing compile speeds over all but the most trivial optimizations, was capable of quite surprising levels of automatic inlining with class methods…

…leading the research to culminate in the mindblow that is 9d121c7 – yes, we can use C++ class methods and operator overloading to make the code more readable, while still generating the same code than if we had just used C and preprocessor macros.

Looks like there's now the potential for a few pull requests from outside devs that apply C++ features to improve the legibility of previously decompiled and terribly macro-ridden code. So, if anyone wants to help without spending money…

📝 Posted:
🚚 Summary of:
P0028
Commits:
6023f5c...6ff427a
💰 Funded by:
zorg
🏷 Tags:

Back to actual development! Starting off this stretch with something fairly mechanical, the few remaining generic boss and midboss state variables. And once we start converting the constant numbers used for and around those variables into decimal, the estimated position independence probability immediately jumped by 5.31% for TH04's MAIN.EXE, and 4.49% for TH05's – despite not having made the game any more position- independent than it was before. Yup… lots of false positives in there, but who can really know for sure without having put in the work.

But now, we've RE'd enough to finally decompile something again next, 4 years after the last decompilation of anything!

📝 Posted:
🚚 Summary of:
P0016, P0017
Commits:
(Website) 98b5090...bca833b, (Website) 3f81d1f...d5b9ea2
💰 Funded by:
qp
🏷 Tags:

Website development time: 12/12

Calculating the average speed of the previous crowdfunded pushes, we arrive at estimated "goals" of…

Crowdfunding estimate at 60235fc

So, time's up, and I didn't even get to the entire PayPal integration and FAQ parts… 😕 Still got to clarify a couple of legal questions before formally starting this, though. So for now, let's continue with zorg's next 5 TH05 reverse-engineering and decompilation pushes, and watch those prices go down a bit… hopefully quite significantly!

📝 Posted:
🚚 Summary of:
P0013, P0014, P0015
Commits:
(Website) b9805d2...efeddd8, (Website) 31474a0...9dc9632, (Website) 9dc9632...8d3652f
💰 Funded by:
qp
🏷 Tags:

Website development time: 10/12

In order to be able to calculate how many instructions and absolute memory references are actually being removed with each push, we first need the database with the previous pushes from the Discord crowdfunding days. And while I was at it, I also imported the summary posts from back then.

Also, we now got something resembling a web design!

Crowdfunding logBlog
📝 Posted:
🚚 Summary of:
P0012
Commits:
(Website) b9918cc...b9805d2
💰 Funded by:
qp
🏷 Tags:

Website development time: 7/12

So yeah, "upper bound" and "probability". In reality it's certainly better than the numbers suggest, but as I keep saying, we can't say much about position independence without having reverse-engineered everything.

Next up: Money goals.

Upper bound of remaining absolute memory references at 60235fcProbability of position independence at 60235fc
📝 Posted:
🚚 Summary of:
P0011
Commits:
(Website) 40c1e98...b9918cc
💰 Funded by:
qp
🏷 Tags:

Website development time: 6/12

Here we go, overall ReC98 reverse-engineering progress. Now viewable for every commit on the page.

Number of not yet reverse-engineered x86 instructions at 60235fcReverse-engineering completion percentage at 60235fc
📝 Posted:
🚚 Summary of:
P0010, P0054, P0055
Commits:
(Website) cbda977...94127fb, (Website) 94127fb...3161d7e, (Website) 3161d7e...40c1e98
💰 Funded by:
DTM, Egor
🏷 Tags:

Website development time: 5/12

Now with the number of not yet RE'd x86 instructions the you might have seen in the thpatch Discord. They're a bit smaller now, didn't filter out a couple of directives back then.

Yes, requesting these currently is super slow. That's why I didn't want to have everyone here yet!

Next step: Figuring out the actual total number of game code instructions, for that nice "% done". Also, trying to do the same for position independence.

📝 Posted:
🚚 Summary of:
P0051, P0052, P0053
Commits:
6ed8e60...3ba536a
💰 Funded by:
-Tom-
🏷 Tags:

Boss explosions! And… urgh, I really also had to wade through that overly complicated HUD rendering code. Even though I had to pick -Tom-'s 7th push here as well, the worst of that is still to come. TH04 and TH05 exclusively store the current and high score internally as unpacked little-endian BCD, with some pretty dense ASM code involving the venerable x86 BCD instructions to update it.

So, what's actually the goal here. Since I was given no priorities :onricdennat:, I still haven't had to (potentially) waste time researching whether we really can decompile from anywhere else inside a segment other than backwards from the end. So, the most efficient place for decompilation right now still is the end of TH05's main_01_TEXT segment. With maybe 1 or 2 more reverse-engineering commits, we'd have everything for an efficient decompilation up to sub_123AD. And that mass of code just happens to include all the shot type control functions, and makes up 3,007 instructions in total, or 12% of the entire remaining unknown code in MAIN.EXE.

So, the most reasonable thing would be to actually put some of the upcoming decompilation pushes towards reverse-engineering that missing part. I don't think that's a bad deal since it will allow us to mod TH05 shot types in C sooner, but zorg and qp might disagree :thonk:

Next up: thcrap TL notes, followed by finally finishing GhostPhanom's old ReC98 future-proofing pushes. I really don't want to decompile without a proper build system.

📝 Posted:
🚚 Summary of:
P0049, P0050
Commits:
893bd46...6ed8e60
💰 Funded by:
-Tom-
🏷 Tags:

Sometimes, "strategically picking things to reverse-engineer" unfortunately also means "having to move seemingly random and utterly uninteresting stuff, which will only make sense later, out of the way". Really, this was so boring. Gonna get a lot more exciting in the next ones though.

📝 Posted:
🚚 Summary of:
P0047, P0048
Commits:
9a2c6f7...893bd46
💰 Funded by:
-Tom-
🏷 Tags:

So, let's continue with player shots! …eh, or maybe not directly, since they involve two other structure types in TH05, which we'd have to cover first. One of them is a different sort of sprite, and since I like me some context in my reverse-engineering, let's disable every other sprite type first to figure out what it is.

One of those other sprite types were the little sparks flying away from killed stage enemies, midbosses, and grazed bullets; easy enough to also RE right now. Turns out they use the same 8 hardcoded 8×8 sprites in TH02, TH04, and TH05. Except that it's actually 64 16×8 sprites, because ZUN wanted to pre-shift them for all 8 possible start pixels within a planar VRAM byte (rather than, like, just writing a few instructions to shift them programmatically), leading to them taking up 1,024 bytes rather than just 64.
Oh, and the thing I wanted to RE *actually* was the decay animation whenever a shot hits something. Not too complex either, especially since it's exclusive to TH05.

And since there was some time left and I actually have to pick some of the next RE places strategically to best prepare for the upcoming 17 decompilation pushes, here's two more function pointers for good measure.

📝 Posted:
🚚 Summary of:
P0046
Commits:
612beb8...deb45ea
💰 Funded by:
-Tom-
🏷 Tags:

Stumbled across one more drawing function in the way… which was only a duplicated and seemingly pointlessly micro-optimized copy of master.lib's super_roll_put_tiny() function, used for fast display of 4-color 16×16 sprites.

With this out of the way, we can tackle player shot sprite animation next. This will get rid of a lot of code, since every power level of every character's shot type is implemented in its own function. Which makes up thousands of instructions in both TH04 and TH05 that we can nicely decompile in the future without going through a dedicated reverse-engineering step.

📝 Posted:
🚚 Summary of:
P0043, P0044, P0045
Commits:
261d503...612beb8
💰 Funded by:
-Tom-
🏷 Tags:

Turns out I had only been about half done with the drawing routines. The rest was all related to redrawing the scrolling stage backgrounds after other sprites were drawn on top. Since the PC-98 does have hardware-accelerated scrolling, but no hardware-accelerated sprites, everything that draws animated sprites into a scrolling VRAM must then also make sure that the background tiles covered by the sprite are redrawn in the next frame, which required a bit of ZUN code. And that are the functions that have been in the way of the expected rapid reverse-engineering progress that uth05win was supposed to bring. So, looks like everything's going to go really fast now?

📝 Posted:
🚚 Summary of:
P0025, P0026, P0027
Commits:
0cde4b7...261d503
💰 Funded by:
zorg
🏷 Tags:

… yeah, no, we won't get very far without figuring out these drawing routines.
Which process data that comes from the .STD files. Which has various arrays related to the background… including one to specify the scrolling speed. And wait, setting that to 0 actually is what starts a boss battle?

So, have a TH05 Boss Rush patch: 2018-12-26-TH05BossRush.zip Theoretically, this should have also worked for TH04, but for some reason, the Stage 3 boss gets stuck on the first phase if we do this?

Here's the diff for the Boss Rush. Turning it into a thcrap-style Skipgame patch is left as an exercise for the reader.

📝 Posted:
🚚 Summary of:
P0023, P0024
Commits:
807df3d...0cde4b7
💰 Funded by:
zorg
🏷 Tags:

Actually, I lied, and lasers ended up coming with everything that makes reverse-engineering ZUN code so difficult: weirdly reused variables, unexpected structures within structures, and those TH05-specific nasty, premature ASM micro-optimizations that will waste a lot of time during decompilation, since the majority of the code actually was C, except for where it wasn't.

📝 Posted:
🚚 Summary of:
P0042
Commits:
f3b6137...807df3d
💰 Funded by:
-Tom-
🏷 Tags:

Laser… is not difficult. In fact, out of the remaining entity types I checked, it's the easiest one to fully grasp from uth05win alone, as it's only drawn using master.lib's line, circle, and polygon functions. Everything else ends up calling… something sprite-related that needs to be RE'd separately, and which uth05win doesn't help with, at all.

Oh, and since the speed of shoot-out lasers (as used by TH05's Stage 2 boss, for example) always depends on rank, we also got this variable now.

This only covers the structure itself – uth05win's member names for the LASER structure were not only a bit too unclear, but also plain wrong and misleading in one instance. The actual implementation will follow in the next one.

📝 Posted:
🚚 Summary of:
P0041
Commits:
b03bc91...f3b6137
💰 Funded by:
-Tom-
🏷 Tags:

So, after introducing instruction number statistics… let's go for over 2,000 lines that won't show up there immediately :tannedcirno: That being (mid-)boss HP, position, and sprite ID variables for TH04/TH05. Doesn't sound like much, but it kind of is if you insist on decimal numbers for easier comparison with uth05win's source code.

📝 Posted:
🚚 Summary of:
P0040
Commits:
d7483c0...b03bc91
💰 Funded by:
-Tom-
🏷 Tags:

Let's start this stretch with a pretty simple entity type, the growing and shrinking circles shown during bomb animations and around bosses in TH04 and TH05. Which can be drawn in varying colors… wait, what's all this inlined and duplicated GRCG mode and color setting code? Let's move that out into macros too, it takes up too much space when grepping for constants, and will raise a "wait, what was that I/O port doing again" question for most people reading the code again after a few months.

🎉 With this push, we've also hit a milestone! Less than 200,000 unknown x86 instructions remain until we've completely reverse-engineered all of PC-98 Touhou.

📝 Posted:
🚚 Summary of:
P0009
Commits:
79cc3ed...141baa4
💰 Funded by:
DTM
🏷 Tags:

While we're waiting for Bruno to release the next thcrap build with ANM header patching, here are the resulting commits of the ReC98 CDG/CD2 special offer purchased by DTM, reverse-engineering all code that covers these formats.

📝 Posted:
🚚 Summary of:
P0019, P0020, P0021, P0022
Commits:
c592464, cbe8a37, 8dfc2cd, 79cc3ed
💰 Funded by:
zorg
🏷 Tags:
> OK, let's do a quick ReC98 update before going back to thcrap, shouldn't take long > Hm, all that input code is kind of in the way, would be nice to cover that first to ease comparisons with uth05win's source code > What the hell, why does ZUN do this? Need to do more research > … > OK, research done, wait, what are those other functions doing? > Wha, everything about this is just ever so slightly awkward

Which ended up turning this one update into 2/10, 3/10, 4/10 and 5/10 of zorg's reverse-engineering commits. But at least we now got all shared input functions of TH02-TH05 covered and well understood.

📝 Posted:
🚚 Summary of:
P0018
Commits:
746681d...178d589
💰 Funded by:
zorg
🏷 Tags:

What do you do if the TH06 text image feature for thcrap should have been done 3 days™ ago, but keeps getting more and more complex, and you have a ton of other pushes to deliver anyway? Get some distraction with some light ReC98 reverse-engineering work. This is where it becomes very obvious how much uth05win helps us with all the games, not just TH05.

5a5c347 is the most important one in there, this was the missing substructure that now makes every other sprite-like structure trivial to figure out.

📝 Posted:
🚚 Summary of:
P0008
Commits:
47e5601...d62dd06
💰 Funded by:
-Tom-
🏷 Tags:

You could use this to get a homing Mima, for example.