Surprise! The last missing main menu in PC-98 Touhou was, in fact, not that hard. Finishing the rest of TH03's OP.EXE took slightly shorter than the expected 2 pushes, which left enough room to uncover an unexpected mystery and take important leaps in position independence…
For TH03, ZUN stepped up the visual quality of the main menu items by exchanging TH02's monospaced font with fixed, pre-composited strings of proportional text. While TH04 would later place its menu text in VRAM, TH03 still wanted to stay with TH02's approach of using gaiji to display the menu items on the PC-98 text layer. Since gaiji have a fixed size of 16×16 pixels, this requires the pre-composited bitmaps to be cut into blocks of that size and padded with blank pixels as necessary:
If your combined amount of text is short enough to fit into the PC-98's 256 gaiji slots, this is a nice way of using hardware features to replace the need for a proportional text renderer. It especially simplifies transitions between menus – simply wiping the entire TRAM is both cheap and certainly less error-prone than (un)blitting pixels in VRAM, which 📝 ZUN was always kind of sloppy at.
However, all this text still needs to be composited and cut into gaiji somewhere. If you do that manually, it's easy to lose sight of how the text is supposed to appear on screen, especially if you decide to horizontally center it. Then, you're in for some awkward coordinate fiddling as you try to place these 16-pixel bricks into the 8-pixel text grid to somehow make it all appear centered:
The VS Start menu actually is correctly centered.
Then again, did ZUN actually want to center the Option menu like this? Even the main menu looks kind of uncanny with perfect centering, probably because I'm so used to the original. Imperfect centering usually counts as a bug, but this case is quirky enough to leave it as is. We might want to perfectly center any future translations, but that would definitely cost a bit as I'd then actually need to write that proportional text renderer.
Apart from that, we're left with only a very short list of actual bugs and landmines:
The Cancel key is not handled inside the VS menu, arrgghh…! 🤬
ZUN almost managed to write a title screen and menu without a 📝 screen📝 tearing landmine, but a single one still managed to sneak into the first frame of the title screen's short fade-in animation. This one will blow up when returning from the Music Room, and can be entirely blamed on that screen's choice to leave 📝 a purple color in hardware palette slot 0. Replacing that color with black before returning would have completely hidden the potential tearing.
There might be another one in the long sliding animation, but I can only tell for sure once I've fully decompiled MAINL.EXE.
While the rest of the code is not free of the usual nitpicks, those don't matter in the grand scheme of things. The code for the sliding 東方夢時空
animation is even better: it makes decent use of the EGC and page flipping, and places the 📝 loading calls for the character selection portraits at sensible points where the animation naturally wants to have a delay anyway. We're definitely ending the main menus of PC-98 Touhou on a high note here.
You might have already spotted some unfamiliar text in the gaiji above, and indeed, we've got three pieces of unused text in these two menus! Starting from the top, the label is entirely unused as none of its gaiji IDs are referenced anywhere in the final code. The label's placement within the gaiji IDs would imply that this option was once part of the main menu, but nothing in the game suggests that the main menu ever had a bigger box that could fit a 7th element. On the contrary, every piece of menu code assumes that the box sprites loaded from OPWIN.BFT are exactly 128 pixels high:
Fun fact: The code doesn't even use the 16 pixels in the middle, and instead just assumes that the pixels between the X coordinates of [8; 16[ and [32; 40[ are identical.
The unused MIDI music option has already been widely documented elsewhere. Changing the first byte in YUME.CFG to 02 has no functional effect because ZUN removed most MIDI-related code before release. He did forget a few instances though, and the surviving dedicated switch case in the Option menu is now the entire reason why you can reveal this text without modifying the binary. Changing the option will always flip its value back to either off or FM(86).
Last but not least, we have the label and its associated numbers. These are the most interesting ones in my eyes; nobody talks about them, even though we have definite proof that they were used for the KeyConfig options at some earlier point in development:
That's all I've got about the menus, so let's talk characters and gameplay! When playing Story Mode, OP.EXE picks the opponents for all stages immediately after the 📝 Select screen has faded out. Each character fights a fixed and hardcoded opponent in Stage 7's Decisive Match:
Player
Stage 7 opponent
Reimu
Mima
Mima
Reimu
Marisa
Reimu
Ellen
Marisa
Kotohime
Reimu
Kana
Ellen
Rikako
Kana
Chiyuri
Kotohime
Yumemi
Rikako
The opponents for the first 6 stages, however, are indeed completely random, and picked by master.lib's reimplementation of the Borland RNG. The game only needs to ensure that no character is picked twice, which it does like this:
const int stage_7_opponent = HARDCODED_STAGE_7_OPPONENT_FOR[playchar];
bool opponent_seen[7] = { false };
for(int stage = 0; stage < 6; stage++) {
int candidate;
do {
// Pick a random character between Reimu and Rikako
candidate = (irand() % 7);
} while(opponent_seen[candidate] || (stage_7_opponent == candidate));
opponent_seen[candidate] = true;
story_opponent[stage] = candidate;
}
Characters are numbered from 0 ( Reimu) to 8 ( Yumemi), following the order in the Stage 7 table above.
Yup. For every stage, ZUN re-rolls until the RNG returns a character who hasn't yet been seen in a previous stage – even in Stage 6 where there's only one possible character left. Since each successive stage makes it harder for the inner loop to find a valid answer, you start to wonder if there is some unlucky combination of seed and player character that causes the game to just hang forever.
So I tested all possible 232 seed values for all 9 player characters and… nope, Borland's RNG is good enough to eventually always return the only possible answer. The inner loop for Stage 6 does occasionally run for a disproportionate number of iterations, with the worst case being 134 re-rolls when playing Rikako's Story Mode with a seed value of 0x099BDA86. But even that is many orders of magnitude away from manifesting as any kind of noticeable delay. And on average, it just takes 17.15 iterations to determine all 6 random opponents.
The attract demos are another intriguing aspect that I initially didn't even have on my radar for the main menu. touhou-memories raises an interesting question: The demos start at Gauge and Boss Attack level 9, which would imply Lunatic difficulty, but the enemy formations don't match what you'd normally get on Lunatic. So, which difficulty were they recorded on?
Our already RE'd code clears up the first part of that question. TH03's demos are not recordings, but simply regular VS rounds in CPU vs. CPU mode that automatically quit back to the title screen after 7,000 frames. They can only possibly appear pre-recorded because the game cycles through a mere four hardcoded character pairings with fixed RNG seeds:
Demo #
P1
P2
Seed
1
Mima
Reimu
600
2
Marisa
Rikako
1000
3
Ellen
Kana
3200
4
Kotohime
Marisa
500
Certainly an odd choice if your game already had the feature to let arbitrary CPU-controlled characters fight each other. That would have even naturally worked for the trial version, which doesn't contain demos at all.
Then again, even a "random" character selection would have appeared deterministic to an outside observer. As usual for PC-98 Touhou, the RNG seed is initialized to 0 at startup and then simply increments after every frame you spend on the title screen and inside the top-level main, Option, and character selection menus – and yes, it does stay constant inside the VS Start menu. But since these demos always start after waiting exactly 520 frames on the title screen without pressing any key to enter the main menu, there's no actual source of randomness anywhere. ZUN could have classically initialized the RNG with the current system time, which is what we used to do back in the day before operating systems had easily accessible APIs for true randomness, but he chose not to, for whatever reason.
The difficulty question, however, is not so easy to answer. The demo startup code in the main menu doesn't override the configured difficulty, and neither does any other of the binaries depending on the demo ID. This seems to suggest that the demos simply run at the difficulty you last configured in the Option menu, just like regular VS matches. But then, you'd expect them to run differently depending on that difficulty, which they demonstrably don't. They always start on Gauge and Boss Attack level 9, and their last frame before the exit animation is always identical, right down to the score, reinforcing the pre-recorded impression:
Note that it takes much longer than the expected 2:04 minutes for the game to reach this end state. Each WARNING!! You are forced to evade / Your life is in peril popup freezes gameplay for 26 frames which don't count toward the demo frame counter. That's why these popups will provide such a great 📝 resynchronization opportunity for netplay. It's almost as if Versus Touhou was designed from the start with rollback netcode in mind!
With quite a bit of time left over in the second push, it made sense to look at a bit of code around the Gauge and Boss Attack levels to hopefully get a better idea of what's going on there. The Gauge Attack levels are very straightforward – they can range from 1 to 16 inclusive, which matches the range that the game can depict with its gaiji, and all parts of the game agree about how they're interpreted:
Stored in GAMEFT.BFT.
The same can't be said about the Boss Attack level though, as the gauge and the WARNING!! popup interpret the same internal variable as two different levels?
This apparent inconsistency raises quite a few questions. After all, these gaiji have to be addressed by adding an offset from 0 to 15 to the ID of the 1 gaiji, but the levels are supposed to range from 1 to 16. Does this mean that one of these two displays has an off-by-one error? You can't fire a Level 0 Boss Attack because the level always increments before every attack, but would 0 still be a technically valid Boss Attack level?
Decompiling the static HUD code debunks at least the first question as ZUN resolves the apparent off-by-one error by explicitly capping the displayed level to 16. And indeed, if a round lasts until the maximum Boss Attack level, the two numbers end up matching:
This suggests that the popup indicates the level of the incoming attack while the gauge indicates the level of the next one to be fired by any player. That said, this theory not only needs tons of comments to explain it within the code, but also contradicts 夢時空.TXT, which explicitly describes the level next to the gauge as the 現在のBOSSアタックのレベル. Still, it remains our best bet until we've decompiled a few of the Boss Attacks and saw how they actually use this single variable.
So, what does this tell us about the demo difficulty? Now that we can search the code for these variables, we quickly come across the dedicated demo-specific branch that initializes these levels to the observable fixed values, along with two other variables I haven't researched so far. This confirms that demos run at a custom difficulty, as the two other variables receive slightly different values in regular gameplay.
However, it's still a good idea to check the code for any other potential effects of the difficulty setting. Maybe they're just hard to spot in demos? Doesn't difficulty typically affect a whole lot of other things in Touhou game code? Well, not in TH03 – MAIN.EXE only ever looks at the configured difficulty in three places, and all of them are part of the code that initializes a new round.
This reveals the true nature of difficulty in TH03: It's exclusively specified in terms of these five variables, and the Easy/Normal/Hard/Lunatic/"Demo" settings can be thought of as simply being presets for them. Story Mode adds 📝 the AI's number of safety frames to the list of variables and factors the current stage number into their values, but the concept stays the same. In this regard, TH03's design is unusually clean, making it perhaps the only Touhou game with not even a single "if difficulty is this, then do that" branch in script code. It's certainly the only PC-98 Touhou game with this property.
But it gets even better if we consider what this means for netplay. We now know that the configured difficulty is part of the match-defining parameters that must be synced between both players, just like the selected characters and the RNG seed. But why stop there? How about letting players not just choose between the presets, but allowing them to customize each of the five variables independently? Boom, we've just skyrocketed the replay value of netplay. 🚀 It's discoveries like these that justify my decision to start the road toward netplay by decompiling all of OP.EXE: In-engine menus are the cleanest and most friendly way of allowing players to configure all these variables, and now they're also the easiest and most natural choice from a technical point of view.
But wait, there's still some time left in that second push! The remaining fraction of the OP.EXE reverse-engineering contribution had repeating decimals, so let's do some quick TH02 PI work to remove the matching instance of repeating decimals from the backlog. This was very much a continuation of 📝 last year's light PI work; while the regular TH02 decompilation progress has focused and will continue to focus on the big features, it still left plenty of low-hanging PI fruit in boss code. Back then, we left with the positions of the Five Magic Stones, where ZUN's choice of storing them in arrays was almost revolutionary compared to what we saw in TH01. The same now applies to the state flags and total damage amount of not just the boss of Stage 3, but also the two independently damageable entities of the stage's midboss. In total, all of the newly identified arrays made up 3.36% of all memory references in TH02, and we're not even done with Stage 3.
Actually, you know what, let's round out that second push with even more low-hanging PI fruit and ensure 📝 technical position independence for TH03's MAINL.EXE. This was very helpful considering that I'm going to build netplay into the anniversary branch, whose debloated foundation 📝 aims to merge every game into as few executables as possible. Due to TH03's overall lower level of bloat and the dedicated SPRITE16-based rendering code in MAIN.EXE, it might not make as much sense to merge all three of TH03's .EXE binaries as it did for TH01, and MAIN.EXE's lack of position independence currently prevents this anyway. However, merging just OP.EXE and MAINL.EXE makes tremendous sense not just for TH03, but for the other three games as well. These binaries have a much smaller ratio of ZUN code to library code, and use the same file formats and subsystems.
But that's not even the best part! Once we've factored out all the invisible inconsistencies between the games, we get to share all of this code across all of the four games. Hence, technical position independence for TH03's MAINL.EXE also was the final obstacle in the way of a single consistent and ultimately portable version of all of this code. 🙌
So, how do we go from here to 📝 the short-term half-PC-98/half-modern netplay option that Ember2528 is now funding? Most of the netcode will be unrelated to TH03 in particular, but we'd obviously still want to reverse-engineer more of MAIN.EXE to ensure a high-quality integration. So how about alternating the upcoming deliveries between pure RE work and any new or modded code? Next up, therefore, I'll go for the latter and debloat OP.EXE so that I can later add the netplay features without pulling my hair out. At that point, it also makes sense to take the first steps into portability; I've got some initial ideas I'm excited to implement, and Congrio's tiny bit of funding just begs to be removed from the backlog.
(And I'm definitely going to defuse all the tearing landmines because my goodness are they infuriating when slowing down the game or working with screen recordings.)
TH05's OP.EXE? It's not one of the 📝 main blockers for multilingual translation support, but fine, let's push it to 100% RE. This didn't go all too quickly after all, though – sure, we were only missing the High Score viewer, but that's technically a menu. By now, we all know the level of code quality we can reasonably expect from ZUN's menu code, especially if we simultaneously look at how it's implemented in TH04 as well. But how much could I possibly say about even a static screen?
Then again, with half of the funding for this push not being constrained to RE, OP.EXE wasn't the worst choice. In both TH04 and TH05, the High Score viewer's code is preceded by all the functions needed to handle the GENSOU.SCR scorefile format, which I already RE'd 📝 in late 2019. Back then, it turned out to be one of the most needlessly inconsistent pieces of code in all of PC-98 Touhou, with a slightly different implementation in each of the 6 binaries that was waiting for its equally messy decompilation ever since.
Most of these inconsistencies just add bloat, but TH05's different stage number defaults for the Extra Stage do have the tiniest visible impact on the game. Since 2019 was before we had our current system of classifying weird code, let's take a quick look at this again:
In the end, this is a landmine, albeit a slightly unusual one. OP.EXE always needs to load GENSOU.SCR to determine whether the Extra Stage is unlocked and can be selected in the main menu. If that file is corrupted or doesn't exist yet, OP.EXE will always recreate it. Therefore, MAINE.EXE's recreation code would only ever run if GENSOU.SCR got deleted or corrupted while playing the game. This can only happen through code that runs outside the game or as the result of failing hardware, and thus goes beyond our criteria for observability.
On to the actual High Score screen then! The OP.EXE code I decompiled here only covers the viewer, the actual score registration is part of MAINE.EXE and is a completely different beast that only shares a few code snippets at best. This means that I'll have to do this all over again at some point down the line, which will result in another few pushes that look very similar to this one. 🥲
By now, it's no surprise that even this static screen has more or less the same density of bugs, landmines, and bloat as ZUN's more dynamic and animated menus. This time however, the worst source of bloat lies more on the meta level: TH04's version explicitly spells out every single loading and rendering call for both of that game's playable characters, rather than covering them with loops like TH05 does for its four characters. As a result, the two games only share 3¼ out of the 7 functions in even this simple viewer screen. It definitely didn't have to be this way.
On the bright side, the code starts off with a feature that probably only scoreplayers and their followers have been consciously awareof: The High Score screens can display 9-digit scores without glitches, unlike the in-game HUD's infamous overflow that turns the 8th digit into a letter once the score exceeds 100 million points.
To understand why this is such a surprise, we have to look at how scores are tracked in-game where the glitch does happen. This brings us back to the binary-coded decimal format that the final three PC-98 Touhou games use for their scores, which we didn't have to deal with 📝 for almost three years. On paper, the fixed-size array of 8 digits used by the three games would leave no room for a 9th one, so why don't we get a counterstop at 99,999,999 points, similar to what happens in modern Touhou? Let's look at the concrete example of adding, say, 200,000 points to a score of 99,899,990 points, and step through the algorithm for the most significant four digits:
score
BCD delta
09 09 08 09 09 09 09 00
+ 00 00 02 00 00 00 00 00
= 09 09 08 09 09 09 09 00
+ 00 00 02 00 00 00 00 00
= 09 0A 00 09 09 09 09 00
+ 00 00 02 00 00 00 00 00
= 0A 00 00 09 09 09 09 00
+ 00 00 02 00 00 00 00 00
= 0A 00 00 09 09 09 09 00
It sure is neat how ZUN arranged the gaiji font in such a way that the HUD's rendering is an exact visual representation of the bytes in memory… at least for scores between 100,000,000 (A0000000) and 159,999,999 (F9999999) inclusive.
Formatted as big-endian for easier reading. Here's the relevant undecompilable ASM code, featuring the venerable AAA instruction.
In other words: The carry of each addition is regularly added to the next digit as if it were binary, and then the next iteration has to adjust that value as necessary and pass along any carry to the digit after that. But once we've reached the most significant digit, there is no way for its carry to go. So it just stays there, leaving the last digit with a value greater than 9 and effectively turning it from a BCD digit into a regular old 8-bit binary value. This leaves us with a maximum representable score of 2,559,999,999 points (FF 09 09 09 09 09 09 09) – and with the scores achieved by current TAS runs being far below that limit in bothgames, it's definitely not worth it to bother about rendering that 10th score digit anywhere.
In the High Score screens, ZUN also zero-padded each score to 8 digits, but only blitted the 9th digit into the padding between name and score if it's nonzero. From this code detail alone, we can tell that ZUN was fully aware of ≥100 million points being possible, but probably considered such high scores unlikely enough to not bother rearranging the in-game HUD to support 9 digits. After all, it only looks like there's plenty of unused space next to the HUD, but in reality, it's tightly surrounded by important VRAM regions on both sides: The 32 pixels to the left provide the much-needed sprite garbage area to support 📝 visually clipped sprites despite master.lib's lack of sprite clipping, and the 64 pixels to the right are home to the 📝 tile source area:
It sure wouldn't have been impossible. You could either sacrifice the two tiles that would cover the 9th digit in both the HiScore and Score row, or – even better – move these tiles under the existing padding space within the HUD. 📝 The tile sections of TH04 and TH05 already address their images using raw VRAM addresses, so this wouldn't have even required an additional tile index→VRAM address lookup table.
And sure enough, ZUN confirms this awareness in TH04's OMAKE.TXT:
However, the highest score that the High Score screens of both games can display without visual glitches is not 999,999,999, as you would expect from 9 digits, but rather…
959 million?
(Also, this 9th digit nicely highlights a slight asymmetry in TH04's screen, where Marisa gets 4 fewer pixels of padding between names and scores.)
What a weird limit. Regardless of whether GENSOU.SCR saves its scores in a sane unsigned 32-bit format or a silly 8-digit BCD one, this limit makes no sense in either representation. In fact, GENSOU.SCR goes even further than BCD values, and instead uses… the ID of the corresponding gaiji in the 📝 bold font?
How cute. No matter how you look at it, storing digits with an added offset of 160 makes no sense:
It's suboptimal for the High Score screens (which want to display scores with the digit sprites from SCNUM.BFT and thus have to subtract 160 from every digit),
it's suboptimal for the HiScore row in the in-game HUD (which also needs actual digits under the hood for easier comparison and replacement with the current Score, and rendering just adds 160 again), and
it doesn't even work as obfuscation (with an offset of 160 / 0xA0, you can always read the number by just looking at the lower 4 bits, and each character/rank section in GENSOU.SCR is encrypted with its own key anyway).
It does start to explain the 959 million limit, though. Since each digit in GENSOU.SCR takes up 1 byte as well, they are indeed limited to a maximum value of (255 - 160) = 95 before they wrap back to 0.
But wait. If the game simply subtracts 160 from the gaiji index to get the digit value, shouldn't this subtraction also wrap back around from 0 to 255 and recover higher values without issue? The answer is, 📝 again, C's integer promotion: Splitting the binary value into two digits involves a division by 10, the C standard mandates that a regular untyped 10 is always of type int, the uint8_t digit operand gets promoted to match, and the result is actually negative and thus doesn't even get recognized as a 9th digit because no negative value is ≥10.
So what would happen if we were to enter a score that exceeds this limit? The registration screen in MAINE.EXE doesn't display the 9th digit and the 8th one wraps around. But it still sorts the score correctly, so at least the internal processing seems to work without any problem…
(160 + 99) = 259, which wraps around to 3, so this makes perfect sense. We'll figure out the exact logic behind the differently colored sprite once RE progress reaches this screen.
But once you try viewing this score, you're instead greeted with VRAM corruption resulting from master.lib's super_put() function not bounds-checking the negative sprite IDs passed by the viewer:
In a rare case for PC-98 Touhou, the High Score viewer also hides two interesting details regarding its BGM. Just like for the graphics, ZUN also coded a fade-in call for the music. In abbreviated ASM code:
mov ax, 0000h ; PMD AH=00H (start music playback)
int 60h
mov ax, 0280h ; PMD AH=02H (fade in/out)
int 60h
However, the AH=02H fade-in call has no effect because AH=00h resets the music volume and would need to be followed by a volume-lowering AH=19h call. But even if there was such a call, the fade-in would sound terrible. 80h corresponds to the fastest possible fade-in speed of -128, which is almost but not quite instant. As such, the fade-in would leave the initial note on each channel muted while the rest of the track fades in very abruptly, which clashes badly with the bass and chord notes you'd expect to hear in the name registration themes of the two games:
At least the first issue could have been avoided if PMD's AH=00h call took optional parameters that describe the initial playback state instead of relying on these mutating calls later on. After all, it might be entirely possible for a bunch of interrupts to fire between AH=00h and these further calls, and if those interrupts take a while, the FM chip might have already played a few samples at PMD's default volume. Sure, Real Mode doesn't stop you from wrapping this sequence in CLI and STI instructions to work around this issue, but why rely on even more CPU state mutation when there would have been plenty of free x86 registers for passing more initial state to AH=00h?
The second detail is the complete opposite: It's a fade-out when leaving the menu, it uses PMD's slowest fade speed, and it does work and sound good. However, the speed is so slow that you typically barely notice the feature before the main menu theme starts playing again. But ZUN hid a small easter egg in the code: After the title screen background faded back in, the games wait for all inputs to be released before moving back into the main menu and playing the title screen theme. By holding any key when leaving the High Score viewer, you can therefore listen to the fade-out for as long as you want.
Although when I said that it works, this does not include TH04. 📝 As📝 usual, this game's menus do not address the PC-98's keyboard scancode quirk with regard to held keys, causing the loop to break even while the player is still holding a key. There are 21 not yet RE'd input polling calls in TH02 and TH04 that will most certainly reveal similar inconsistencies, are you excited yet?
But in TH05, holding a key indeed reveals the hidden-content of a 37-second fade-out:
I'm holding Esc here, but this works with any key, even the ⬅️ left and ➡️ right arrow keys that don't quit out of the menu.
As you can already tell by the markers, the final bugs in TH05's (and only TH05's) OP.EXE are palette-related and revealed by switching between these two screens:
Why does the title screen initially use an ever so slightly darker palette than it does when returning from the menu?
What's with the sudden palette change between frames 1 and 2? Why are the colors suddenly much brighter?
1) is easily traced and attributed to an off-by-one error in the animation's palette fade code, but 2) is slightly more complex. This palette glitch only happens if the High Score viewer is the first palette-changing submenu you enter after the 📝 title animation. Just like 📝 TH03's character portraits, both TH04 and TH05 load the sprites for the High Score screen's digits (SCNUM.BFT) and rank indicator (HI_M.BFT) as soon as the title animation has finished. Since these are regular BFNT sprite sheets, ZUN loads them using master.lib's super_entry_bfnt(), and that's where the issue hides: master.lib's blocking palette fade functions operate on master.lib's main 8-bit palette, and super_entry_bfnt() overwrites this palette with the one in the BFNT header. Synchronizing the hardware palette with this newly loaded one would have immediately revealed this possibly unintended state mutation, but I get why master.lib might not have wanted to do that – after all, 📝 palette uploads aren't exactly cheap and would be very noticeable when loading multiple sprite sheets in a row.
In any case, this is no problem in TH04 as that game's HI_M.BFT and OP1.PI have identical palettes. But in TH05, HI_M.BFT has a significantly brighter palette:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
OP1.PI
HI01.PI / HI_M.BFT
And that's 100% RE for TH05's OP.EXE! 🎉 TH04's counterpart is not far behind either now, and only misses its title screen animation to reach the same mark.
As for 100% finalization, there's still the not yet decompiled TH04/TH05 version of the ZUN Soft logo that separates both OP.EXE binaries from this goal. But as I've mentioned 📝 time and time again, the most fitting moment for decompiling that animation would be right before reaching 100% on the entirety of either game. Really – as long as we aren't there, your funding is better invested into literally anything else. The ZUN Soft logo does not interact with or block work on any other part of the game, and any potential modding should be easy enough on the ASM level.
But thankfully, nobody actually scrolls down to the Finalized section. So I can rest assured that no one will take that moment away from me!
Next up: I'd kinda like to stay with PC-98 Touhou for a little longer, but the current backlog is pulling into too many different directions and doesn't convincingly point toward one goal over any other. TH02 is close, but with an active subscription, it makes more sense to accumulate 3 pushes of funding and then go for that game's bullet system in January. This is why I'm OK with subscriptions exceeding the cap every once in a while, because they do allow me to plan ahead in the long term.
So, let's wait a few days for all of you to capture the open towards something more specific. But if the backlog stays as indecisive as it is now, I'll instead go for finishing the Shuusou Gyoku Linux port, hopefully in time for the holiday season.
As for prices, indeed seems to be the point where my supply meets the community's demand for this project and the store no longer sells out immediately. So for the time being, we're going to stay at that push price and I won't increase it any further upon hitting the cap.
Remember when ReC98 was about researching the PC-98 Touhou games? After over half a year, we're finally back with some actual RE and decompilation work. The 📝 build system improvement break was definitely worth it though, the new system is a pure joy to use and injected some newfound excitement into day-to-day development.
And what game would be better suited for this occasion than TH03, which currently has the highest number of individual backers interested in it. Funding the full decompilation of TH03's OP.EXE is the clearest signal you can send me that 📝 you want your future TH03 netplay to be as seamlessly integrated and user-friendly as possible. We're just two menu screens away from reaching that goal anyway, and the character selection screen fits nicely into a single push.
The code of a menu typically starts with loading all its graphics, and TH03's character selection already stands out in that regard due to the sheer amount of image data it involves. Each of the game's 9 selectable characters comes with
a 192×192-pixel portrait (??SL.CD2),
a 32×44-pixel pictogram describing her Extra Attack (in SLEX.CD2), and
a 128×16-pixel image of her name (in CHNAME.BFT). While this image just consists of regular boldfaced versions of font ROM glyphs that the game could just render procedurally, pre-rendering these names and keeping them around in memory does make sense for performance reasons, as we're soon going to see. What doesn't make sense, though, is the fact that this is a 16-color BFNT image instead of a monochrome one, wasting both memory and rendering time.
Luckily, ZUN was sane enough to draw each character's stats programmatically. If you've ever looked through this game's data, you might have wondered where the game stores the sprite for an individual stat star. There's SLWIN.CDG, but that file just contains a full stat window with five stars in all three rows. And sure enough, ZUN renders each character's stats not by blitting sprites, but by painting (5 - value) yellow rectangles over the existing stars in that image.
The only stat-related image you will find as part of the game files. The number of stat stars per character is hardcoded and not based on any other internal constant we know about.
Together with the EXTRA🎔 window and the question mark portrait for Story Mode, all of this sums up to 255,216 bytes of image data across 14 files. You could remove the unnecessary alpha plane from SLEX.CD2 (-1,584 bytes) or store CHNAME.BFT in a 1-bit format (-6,912 bytes), but using 3.3% less memory barely makes a difference in the grand scheme of things.
From the code, we can assume that loading such an amount of data all at once would have led to a noticeable pause on the game's target PC-98 models. The obvious alternative would be to just start out with the initially visible images and lazy-load the data for other characters as the cursors move through the menu, but the resulting mini-latencies would have been bound to cause minor frame drops as well. Instead, ZUN opted for a rather creative solution: By segmenting the loading process into four parts and moving three of these parts ahead into the main menu, we instead get four smaller latencies in places where they don't stick out as much, if at all:
The loading process starts at the logo animation, with Ellen's, Kotohime's, and Kana's portraits getting loaded after the 東方夢時空 letters finished sliding in. Why ZUN chose to start with characters #3, #4, and #5 is anyone's guess.
Reimu's, Mima's, and Marisa's portraits as well as all 9 EXTRA🎔 attack pictograms are loaded at the end of the flash animation once the full title image is shown on screen and before the game is waiting for the player to press a key.
The stat and EXTRA🎔 windows are loaded at the end of the main menu's slide-in animation… together with the question mark portrait for Story Mode, even though the player might not actually want to play Story Mode.
Finally, the game loads Rikako's, Chiyuri's, and Yumemi's portraits after it cleared VRAM upon entering the Select screen, regardless of whether the latter two are even unlocked.
I don't like how ZUN implemented this split by using three separately named standalone functions with their own copy-pasted character loop, and the load calls for specific files could have also been arranged in a more optimal order. But otherwise, this has all the ingredients of good-code. As usual, though, ZUN then definitively ruins it all by counteracting the intended latency hiding with… deliberately added latency frames:
The entire initialization process of the character selection screen, including Step #4 of image loading, is enforced to take at least 30 frames, with the count starting before the switch to the Selection theme. Presumably, this is meant to give the player enough time to release the Z key that entered this menu, because holding it would immediately select Reimu (in Story mode) or the previously selected 1P character (in VS modes) on the very first frame. But this is a workaround at best – and a completely unnecessary one at that, given that regular navigation in this menu already needs to lock keys until they're released. In the end, you can still auto-select the default choice by just not releasing the Z key.
And if that wasn't enough, the 1P vs. 2P variant of the menu adds 16 more frames of startup delay on top.
Sure, maybe loading the fourth part's 69,120 bytes from a highly fragmented hard drive might have even taken longer than 30 frames on a period-correct PC-98, but the point still stands that these delays don't solve the problem they are supposed to solve.
But the unquestionable main attraction of this menu is its fancy background animation. Mathematically, it consists of Lissajous curves with a twist: Instead of calculating each point as
x = sin((fx·t)+ẟx)y = sin((fy·t)+ẟy), TH03 effectively calculates its points as
x = cos(fx·((t+ẟx) % 0xFF))y = sin(fy·((t+ẟy) % 0xFF)), due to t and ẟ being 📝 8-bit angles. Since the result of the addition remains 8-bit as well, it can and will regularly overflow before the frequency scaling factors fx and fy are applied, thus leading to sudden jumps between both ends of the 8-bit value range. The combination of this overflow and the gradual changes to fx and fy create all these interesting splits along the 360° of the curve:
At a high level, there really is just one big curve and one small curve, plus an array of trailing curves that approximate motion blur by subtracting from ẟx and ẟy.
In a rather unusual display of mathematical purity, ZUN fully re-calculates all variables and every point on every frame from just the single byte of state that indicates the current time within the animation's 128-frame cycle. However, that beauty is quickly tarnished by the sheer cost of fully recalculating these curves every frame:
In total, the effect calculates, clips, and plots 16 curves: 2 main ones, with up to 7×2 = 14 darker trailing curves.
Each of these curves is made up of the 256 maximum possible points you can get with 8-bit angles, giving us 4,096 points in total.
Each of these points takes at least 333 cycles on a 486 if it passes all clipping checks, not including VRAM latencies or the performance impact of the 📝 GRCG's RMW mode.
Due to the larger curve's diameter of 440 pixels, a few of the points at its edges are needlessly calculated only to then be discarded by the clipping checks as they don't fit within the 400 VRAM rows. Still, >1.3 million cycles for a single frame remains a reasonable ballpark assumption.
This is decidedly more than the 1.17 million cycles we have between each VSync on the game's target 66 MHz CPUs. So it's not surprising that this effect is not rendered at 56.4 FPS, but instead drops the frame rate of the entire menu by targeting a hardcoded 1 frame per 3 VSync interrupts, or 18.8 FPS. Accordingly, I reduced the frame rate of the video above to represent the actual animation cycle as cleanly as possible.
Apparently, ZUN also tested the game on the 33 MHz PC-98 model that he targeted with TH01 and TH02, and realized that 4,096 points were way too much even at 18.8 FPS. So he also added a mechanism that decrements the number of trailing curves if the last frame took ≥5 VSync interrupts, down to a minimum of only a single extra curve. You can see this in action by underclocking the CPU in your Neko Project fork of choice.
But were any of these measures really necessary? Couldn't ZUN just have allocated a 12 KiB ring buffer to keep the coordinates of previous curves, thus reducing per-frame calculations to just 512 points? Well, he could have, but we now can't use such a buffer to optimize the original animation. The 8-bit main angle offset/animation cycle variable advances by 0x02 every frame, but some of the trailing curves subtract odd numbers from this variable and thus fall between two frames of the main curves.
So let's shelve the idea of high-level algorithmic optimizations. In this particular case though, even micro-optimizations can have massive benefits. The sheer number of points magnifies the performance impact of every suboptimal code generation decision within the inner point loop:
Frequency scaling works by multiplying the 8-bit angles with a fixed-point Q8.8 factor. The result is then scaled back to regular integers via… two divisions by 256 rather than two bitshifts? That's another ≥46 cycles where ≥4 would have sufficed.
The biggest gains, however, would come from inlining the two far calls to the 5-instruction function that calculates one dimension of a polar coordinate, saving another ≥100 cycles.
Multiplied by the number of points, even these low-hanging fruit already save a whopping ≥753,664 cycles per frame on an i486, without writing a single line of ASM! On Pentium CPUs such as the one in the PC-9821Xa7 that ZUN supposedly developed this game on, the savings are slightly smaller because far calls are much faster, but still come in at a hefty ≥491,520 cycles. Thus, this animation easily beats 📝 TH01's sprite blitting and unblitting code, which just barely hit the 6-digit mark of wasted cycles, and snatches the crown of being the single most unoptimized code in all of PC-98 Touhou.
The incredible irony here is that TH03 is the point where ZUN 📝 really📝 started📝 going📝 overboard with useless ASM micro-optimizations, yet he didn't even begin to optimize the one thing that would have actually benefitted from it. Maybe he 📝 once again went for the 📽️ cinematic look 📽️ on purpose?
Unlike TH01's sprites though, all this wasted performance doesn't really matter much in the end. Sure, optimizing the animation would give us more trailing curves on slower PC-98 models, but any attempt to increase the frame rate by interpolating angles would send us straight into fanfiction territory. Due to the 0x02/2.8125° increment per cycle, tripling the frame rate of this animation would require a change to a very awkward (log2384) = 8.58-bit angle format, complete with a new 384-entry sine/cosine lookup table. And honestly, the effect does look quite impressive even at 18.8 FPS.
There are three more bugs and quirks in this animation that are unrelated to performance:
If you've tried counting the number of trailing dots in the video above, you might have noticed that the very first frame actually renders 8×2 trailing curves instead of 7×2, thus rendering an even higher 4,608 points. What's going on there is that ZUN actually requested 8 trailing curves, but then forgot to reset the VSync counter after the initial 30-frame delay. As a result, the game always thinks that the first frame of the menu took ≥30 VSync interrupts to render, thus causing the decrement mechanism to kick in and deterministically reduce the trailing curve count to 7.
This is a textbook example of my definition of a ZUN bug: The code unmistakably says 8, and we only don't get 8 because ZUN forgot to mutate a piece of global state.
The small trailing curves have a noticeable discontinuity where they suddenly get rotated by ±90° between the last and first frame of the animation cycle.
This quirk comes down to the small curve's ẟy angle offset being calculated as ((c/2)-i), with i being the number of the trailing curve. Halving the main cycle variable effectively restricts this smaller curve to only the first half of the sine oscillation, between [0x00, 0x80[. For the main curve, this is fine as i is always zero. But once the trailing curves leave us with a negative value after the subtraction, the resulting angle suddenly flips over into the second half of the sine oscillation that the regular curve never touches. And if you recall how a sine wave looks, the resulting visual rotation immediately makes sense:
Negated input, negated output.
Removing the division would be the most obvious fix, but that would double the speed of the sine oscillation and change the shape of the curve way beyond ZUN's intentions. The second-most obvious fix involves matching the trailing curves to the movement of the main one by restricting the subtraction to the first half of the oscillation, i.e., calculating ẟy as (((c/2)-i) % 0x80) instead. With c increasing by 0x02 on each frame of the animation, this fix would only affect the first 8 frames.
ZUN decided to plot the darker trailing curves on top of the lighter main ones. Maybe it should have been the other way round?
Now with the full 18 curves, a direction change of the smaller trailing curves at the end of the loop that only looks slightly odd, and a reversed and more natural plotting order.
Now that we fully understand how the curve animation works, there's one more issue left to investigate. Let's actually try holding the Z key to auto-select Reimu on the very first frame of the Story Mode Select screen:
The confirmation flash even happens before the menu's first page flip.
Stepping through the individual frames of the video above reveals quite a bit of tearing, particularly when VRAM is cleared in frame 1 and during the menu's first page flip in frame 49. This might remind you of 📝 the tearing issues in the Music Rooms – and indeed, this tearing is once again the expected result of ZUN landmines in the code, not an emulation bug. In fact, quite the contrary: Scanline-based rendering is a mark of quality in an emulator, as it always requires more coding effort and processing power than not doing it. Everyone's favorite two PC-98 emulators from 20 years ago might look nicer on a per-frame basis, but only because they effectively hide ZUN's frequent confusion around VRAM page flips.
To understand these tearing issues, we need to consider two more code details:
If a frame took longer than 3 VSync interrupts to render, ZUN flips the VRAM pages immediately without waiting for the next VSync interrupt.
The hardware palette fade-out is the last thing done at the end of the per-frame rendering loop, but before busy-waiting for the VSync interrupt.
The combination of 1) and the aforementioned 30-frame delay quirk explains Frame 49. There, the page flip happens within the second frame of the three-frame chunk while the electron beam is drawing row #156. DOSBox-X doesn't try to be cycle-accurate to specific CPUs, but 1 menu frame taking 1.39 real-time frames at 56.4 FPS is roughly in line with the cycle counting we did earlier.
Frame 97 is the much more intriguing one, though. While it's mildly amusing to see the palette actually go brighter for a single frame before it fades out, the interesting aspect here is that 2) practically guarantees its palette changes to happen mid-frame. And since the CRT's electron beam might be anywhere at that point… yup, that's how you'd get more than 16 colors out of the PC-98's 16-color graphics mode. 🎨
Let's exaggerate the brightness difference a bit in case the original difference doesn't come across too clearly on your display:
Probably not too much of a reason for demosceners to get excited; generic PC-98 code that doesn't try to target specific CPUs would still need a way of reliably timing such mid-frame palette changes. Bit 6 (0x40) of I/O port 0xA0 indicates HBlank, and the usual documentation suggests that you could just busy-wait for that bit to flip, but an HBlank interrupt would be much nicer.
This reproduces on both DOSBox-X and Neko Project 21/W, although the latter needs the Screen → Real palettes option enabled to actually emulate a CRT electron beam. Unfortunately, I couldn't confirm it on real hardware because my PC-9821Nw133's screen vinegar'd at the beginning of the year. But just as with the image loading times, TH03's remaining code sorts of indicate that mid-frame palette changes were noticeable on real hardware, by means of this little flag I RE'd way back in March 2019. Sure, palette_show() takes >2,850 cycles on a 486 to downconvert master.lib's 8-bit palette to the GDC's 4-bit format and send it over, and that might add up with more than one palette-changing effect per frame. But tearing is a way more likely explanation for deferring all palette updates until after VSync and to the next frame.
And that completes another menu, placing us a very likely 2 pushes away from completing TH03's OP.EXE! Not many of those left now…
To balance out this heavy research into a comparatively small amount of code, I slotted in 2024's Part 2 of my usual bi-annual website improvements. This time, they went toward future-proofing the blog and making it a lot more navigable. You've probably already noticed the changes, but here's the full changelog:
The Progress blog link in the main navigation bar now points to a new list page with just the post headers and each post's table of contents, instead of directly overwhelming your browser with a view of every blog post ever on a single page.
If you've been reading this blog regularly, you've probably been starting to dread clicking this link just as much as I've been. 14 MB of initially loaded content isn't too bad for 136 posts with an increasing amount of media content, but laying out the now 2 MB of HTML sure takes a while, leaving you with a sluggish and unresponsive browser in the meantime. The old one-page view is still available at a dedicated URL in case you want to Ctrl-F over the entire history from time to time, but it's no longer the default.
The new 🔼 and 🔽 buttons now allow quick jumps between blog posts without going through the table of contents or the old one-page view. These work as expected on all views of the blog: On single-post pages, the buttons link to the adjacent single-post pages, whereas they jump up and down within the same page on the list of posts or the tag-filtered and one-page views.
The header section of each post now shows the individual goals of each push that the post documents, providing a sort of title. This is much more useful than wasting space with meaningless commit hashes; just like in the log, links to the commit diffs don't need to be longer than a GitHub icon.
The web feeds that 📝 handlerug implemented two years ago are now prominently displayed in the new blog navigation sub-header. Listing them using <link rel="alternate"> tags in the HTML <head> is usually enough for integrated feed reader extensions to automatically discover their presence, but it can't hurt to draw more attention to them. Especially now that Twitter has been locking out unregistered users for quite some time…
Speaking of microblogging platforms, I've now also followed a good chunk of the Touhou community to Bluesky! The algorithms there seem to treat my posts much more favorably than Twitter has been doing lately, despite me having less than 1/10 of mostly automatically migrated followers there. For now, I'm going to cross-post new stuff to both platforms, but I might eventually spend a push to migrate my entire tweet history over to a self-hosted PDS to own the primary source of this data.
Next up: Staying with main menus, but jumping forward to TH04 and TH05 and finalizing some code there. Should be a quick one.
📝 Over two years since the previous largest delivery, we've now got a new record in every regard: 12 pushes across 5 repos, 215 commits, and a blog post with over 14,000 words and 48 pieces of media. 😱 Who would have thought that the superficially simple task of putting SC-88Pro recordings into Shuusou Gyoku would actually mainly focus on deep research into the underlying MIDI files? I don't typically cover much music-related content because it's a non-issue as far as PC-98 Touhou code is concerned, so it's quite fitting how extensive this one turned out. So here we go, the result of virtually unlimited funding and patience:
So where's the controversy? Romantique Tp obviously made the best and most careful real-hardware SC-88Pro recordings of all of ZUN's old MIDIs, including the original (OST) and arranged (AST) soundtrack of Shuusou Gyoku, right? Surely all I have to do now is to cut them into seamless loops to save a bit of disk space, and then put them into the game? Let's start at the end of the track list with the name registration theme, since it's light on instruments and has an obvious loop point that will be easy to spot in the waveform. But, um… wait a moment, that very first drum note comes a bit late, doesn't it?
At a notated tempo of 96 BPM, these first four beats should take exactly 2.5 seconds, which they do in this seamlessly looping softsynth rendering.
That's… not quite the accuracy and perfection I was expecting. But I think I know what we're seeing and hearing there. Let's look at the first few MIDI events across all channels:
Delta Pulse Beat Channel Event
+540 960 2:000 1 Controller { CC 0, value 0 }
+0 960 2:000 1 Controller { CC 32, value 0 }
+0 960 2:000 1 ProgramChange { 37 }
[…]
+0 960 2:000 2 Controller { CC 0, value 0 }
+0 960 2:000 2 Controller { CC 32, value 0 }
+0 960 2:000 2 ProgramChange { 19 }
[…]
+0 960 2:000 3 Controller { CC 0, value 0 }
+0 960 2:000 3 Controller { CC 32, value 0 }
+0 960 2:000 3 ProgramChange { 6 }
[…]
+0 960 2:000 4 Controller { CC 0, value 0 }
+0 960 2:000 4 Controller { CC 32, value 0 }
+0 960 2:000 4 ProgramChange { 2 }
[…]
Also, the fact that GS doesn't put its drums on a non-general voice bank and instead relies on external channel configuration to differentiate drums from pitched instruments is making this Yamaha kid uncontrollably furious. 🤬
Yup. That's the sound of a vintage hardware synth being slow and taking a two-digit number of milliseconds to process a barrage of simultaneous Program Change messages, playing a MIDI file that doesn't take this reality into account and expects program changes to happen instantly.
I can only speak from my own experience of writing MIDIs for hardware synths here, but having the first note displaced by 50 ms is very much not the way a composer would have intended the music to be heard if the note is clearly notated to occur on the beat. If you had told me about such an issue when playing one of my MIDIs on a certain synth, I would have thanked you for the bug report! And I would have promptly released a fixed version of the MIDI with the Program Change events moved back by a beat or two. In the case of Shuusou Gyoku's MIDIs, this wouldn't even have added any additional delay in-game, as all of these files already start with at least one beat of leading silence to make room for setting Roland-specific synth parameters.
OK, but that's just a single isolated bass drum hit. If we wanted to, we could even fix this issue ourselves by splicing the same note from around the loop end point. Maybe this is just an isolated case and the rest of Romantique Tp's recordings are fine? Well…
By the way, this seamless audio player is what consumed most of the two website pushes this time. The rest went to the slightly redesigned main page, whose progress bars now use the cap bar style and the GitHub badge colors.
This one is even worse. Here, the delay is so long relative to the tempo of the piece that the intended five drum hits pretty much turn into four.
This type of issue doesn't even have to be isolated to the very beginning of a piece. A few of the tracks in both the OST and AST start with an anacrusis on just one or two channels and leave the Program Change event barrage at the beginning of the first full measure. In 幻想科学 ~ Doll's Phantom for example, this creates a flam-like glitch where the bass on channel 2 is pretty much on time, but the crash hit on channel 10 only follows 50 ms later, after the SC-88Pro took its sweet time to process all the Program Change events on the channels between:
This is from the arranged soundtrack for a change. In that one, ZUN at least fixed the issue in the final three MIDIs (シルクロードアリス, 魔女達の舞踏会, and 二色蓮花蝶 ~ Ancients) that closed out this rearranging project in May 2001, which spread out their per-channel setup events over at least a single measure before playing any note.
Sure, all of this is barely noticeable in casual listening, but very noticeable if you're the one who now has to cut these recordings into seamless loops. And these are just the most obvious timing issues that can be easily pinpointed and documented – the actual worst aspects are all the minor tempo and timing fluctuations throughout most of the pieces. With recordings that deviate ever so slightly from the tempo defined in the MIDI files, you can no longer rely on mathematically exact sample positions when cutting loops. Even if those positions do work out from time to time, there'd pretty much always be a discontinuity in the waveform at both ends of the loop, manifesting as a clearly audible click. In the end, the only way of finding good loop points in existing recordings involves straining your ears and listening very, very closely to avoid any audible glitches. 😩
But if you've taken a look at the second tabs in the clips above, you will have noticed that we don't necessarily have to be stuck with recordings from real hardware. In late 2015, Roland released Sound Canvas VA, a VST plugin that emulates the classic core of Roland's old Sound Canvas lineup, including the SC-88Pro. As long as we run such a software synthesizer through a quality VST host, a purely software-based solution should be way superior for recording looped BGM:
By moving from real-time recording to an offline rendering paradigm, we get perfectly accurate note timing, as it no longer matters how long the synth takes to produce each output sample.
We stay entirely in the digital realm instead of going from digital (SC-88Pro) to analog (RCA cable) to digital (line-in recording) again, removing any chance for noise or distortion to ruin audio quality.
We get to directly render at 44,100 Hz instead of being limited to the 32,000 Hz signal coming out of the SC-88Pro's DAC. This can be easily noticed in the half-speed video above, whose SCVA version retains significantly more sibilant high-frequency content compared to the more muffled sound of Romantique Tp's recording.
Doing that also makes it feasible to preserve loudness differences between the pieces of a soundtrack instead of eradicating them by normalizing the volume of each individual track to the digital maximum.
Finally, it's much more time-efficient. We simply hit foobar2000's Convert button and get all MIDIs rendered within a few seconds each, instead of having to wait the entire length of a piece.
Any drawbacks? For our use case, all of them are found in the abysmal software quality of everything around the synth engine. As it's typical for the VST industry, Sound Canvas VA is excessively DRM'd – it takes multiple seconds to start up, and even then only allows a single process to run at any given time, immediately quitting every process beyond the first one with a misleading Parameter File1 Read Error message box. I totally believe anyone who claims that this makes SCVA more annoying than real hardware when composing new music. Retro gamers also dislike how Roland themselves no longer sells the 32-bit builds they used to offer for the first few versions. These old versions are now exclusively available through resellers, or on the seven seas.
But as far as the SC-88Pro emulation is concerned, there don't seem to be any technical reasons against it. There is a long thread over at VOGONS discussing all sorts of issues, but you have to dig quite deep to find any clear descriptions of bugs in SCVA's synth engine. Everything I found either only applies to the SC-55 emulation and not the SC-88Pro, was fixed by Roland in the meantime, or turned out to be a fixable bug in a MIDI file.
But wait, we've already heard one obvious difference between the real SC-88Pro and Sound Canvas VA. Let's listen to the very first clip again:
Ha! You can clearly hear a panning echo in the real-hardware recording that is missing from the Sound Canvas VA rendering. That's an obvious case of a core system effect not being reproduced correctly. If even that's undeniably broken, who knows which other subtle bugs SCVA suffers from, right? Case closed, Romantique Tp was right all along, SCVA is trash, real hardware reigns supreme
Actually, let's look closer into this one. Panning delay effects like this are typically reverb-related, but General MIDI only specifies a single controller to specify the per-channel reverb level from 0 to 127. Any specific characteristics of the reverb therefore have to be configured using vendor-specific system-exclusive messages, or SysEx for short.
So it's down to one of the four SysEx messages at the beginning of the MIDI file:
Since these byte strings represent Roland-specific instructions, we can't learn anything from a raw MIDI event dump alone here. No problem though, let's just load these files into some old MIDI sequencer that targeted Roland synths, open its MIDI event list, and then they will be automatically decoded into a human-readable representation…
…or at least that's what I expected. In Yamaha land, XGworks has done that for Yamaha's own XG SysEx messages ever since 1997:
No configuration required. You can even edit the textual Value1 representation and XGworks parses it back into the closest supported value!
But for Roland synths, there's… nothing similar? Seriously? 😶 Roland fanboys, how do you even live?! I mean, they are quick to recommend the typical bloated and sluggish big-name DAWs that take up multiple gigabytes of disk space, but none of the ones I tried seemed to have this feature. They can't have possibly been flinging around raw byte strings for the past 33 years?!
But once you look more into today's MIDI community, it becomes clear that this is exactly what they've been doing. Why else would so many people use the word complicated to describe Roland SysEx, or call it an old school/cryptic communication protocol in hexadecimal format? The latter is particularly hilarious because if you removed the word cryptic, this might as well describe all of MIDI, not just SysEx. Everything about this is a tooling issue, and Yamaha showed how easily it could have been solved. Instead, we get Sound Canvas experts, who should know more about the ecosystem than I do, making the incredible mental leap from "my DAW doesn't decode or easily generate SysEx" to "SysEx is antiquated" to "please just lift up these settings to the VST level and into my proprietary DAW's proprietary project format, that would be so much better"…
Thankfully that's not entirely true. After some more digging and configuration, I found a somewhat workable solution involving a comparatively modern sequencer called Domino:
Open the File → Preferences menu and associate your MIDI output device with a module map. This makes sense for SysEx encoding/generation since it can limit the options in the UI to what's actually available on your target hardware, but is also required for selecting the respective SysEx map into Domino's SysEx decoder. There is no technical reason for this because SC-88Pro SysEx messages can be uniquely identified by the three vendor, device, and model ID bytes that every message starts with, but would be too easy and user-friendly. The perception of SysEx being a black art must be upheld at all costs.
I've kept the garbled text of the partial translation to emphasize the sheer amount of jank involved in this entire process.
Load a MIDI file and let Domino "analyze" it:
Strangely enough, this will take quite a while – on my system, this analysis step runs at a speed of roughly 4.25 KB/s of MIDI data. Yes, kilobytes.
Unfortunately, "control change macro restoration" also seems to mean that you don't get to see any raw bytes when selecting the respective MIDI track in the UI, but at least we get what we were looking for:
…for the most part?
Alright, that's something we can work with. The GS Reset message is something that every Roland GS MIDI should start with, but it's immediately followed by a message that Domino failed to decode? The two subsequent reverb parameters make sense, but panning delays typically have more parameters than just a reverb level and time.
That unknown SysEx message shares much of the same bytes with the decoded ones though. So let's do what we maybe should have done all along, return to caveman, and check the SC-88Pro manual:
The relevant section from page 194. We can see how the address and value correspond to bytes 5-7 and 8 in the SysEx messages. Byte 9 is a checksum and byte 10 signals the end of the message.
And that's where we find what this particular issue boils down to. The missing SysEx message is clearly intended to be a Reverb Macro command, whose value can range from 0 to 7 inclusive on the SC-88Pro, but ZUN tries to specify Reverb Macro #14h, or 20 in decimal. The SC-88Pro manual does not specify what happens if a SysEx message wants to write an invalid value to a valid address, which means that we've firmly entered the territory of undefined behavior. Edit (2024-03-10):Romantique Tp confirmed that the real SC-88Pro clamps these Reverb Macro IDs to the supported range of 0-7. Therefore, the appropriate course of action for guaranteeing the same sound on other Roland synths would be to fix the MIDI file and specify Reverb Macro #7 instead. But since this behavior remains technically undefined, we can still argue about ZUN's intention behind specifying the Reverb Macro like this:
Clearly, ZUN did want to specify a valid Reverb Macro, but made a typo when manually entering the SysEx byte string, as he was forced to do thanks to terrible tooling. He clearly liked the resulting sound though, so the track should still be preserved with the panning reverb intact.
Clearly, the typical behavior for MIDI synths is to ignore invalid and unsupported SysEx messages, because validating user input is an important characteristic of quality software. This is what SCVA does, and what we hear in its rendering is the default hall reverb with ZUN's level and time adjustments. Therefore, SCVA is right, and the fact that we get a panning delay on the real SC-88Pro is a bug in real hardware.
Clearly, ZUN did not care enough about the reverb to specify a valid Reverb Macro. Whether we get the default reverb or a panning delay is an irrelevant performance detail, and does intentionally not matter when it comes to the intended sound of this track – especially since these four SysEx messages are the full extent of Roland GS-specific sound design in this piece, and the rest of it only uses standard MIDI features.
In fact, 32 out of the 39 MIDIs across both of Shuusou Gyoku's soundtrack use this invalid Reverb Macro. The only ones that don't are
both versions of Gates' theme (天空アーミー), which use the equally invalid Reverb Macro #11,
both versions of Milia's theme (プリムローズシヴァ), which use Reverb Macro #0 (Room 1),
and, again, the three arranged MIDIs that ZUN released last (シルクロードアリス, 魔女達の舞踏会, and 二色蓮花蝶 ~ Ancients), which feature a more detailed effect setup with custom chorus and EQ settings. In the case of Reimu's theme, these settings are even commented within the MIDI file.
And that's where this quest seemed to end, until Romantique Tp themselves came in and suggested that I take a closer look at the GS Advanced Editor, or GSAE for short.
Make sure to connect a MIDI input device before starting GSAE, or it will silently crash immediately after this splash screen. At least it accepts any controller, so this might just be a bug instead of the typical user-hostile kind of hardware dongle DRM that is pervasive in today's synth industry. 1999 would seem a bit too early for that, thankfully.
I was aware of this tool, but hadn't initially considered it because it's always described as just a SysEx generator/encoder. In fact, the very existence of such a tool made no sense to me at first, and seemed to prove my point that the usability of GS SysEx was wholly inferior to what I was used to in Yamaha land. Like, why not build at least a tiny and stripped-down MIDI sequencer around this functionality that would allow you to insert SC-88Pro-specific messages at any point within a sequence, and not just the beginning? I can see the need for such a tool in today's world of closed-source DAWs where hardware MIDI modules are niche and retro and are only kept alive by a small community of enthusiasts. But why would its developers guarantee that MIDI composers would have to hop between programs even back in 1997? I can only imagine that they saw how every just slightly advanced MIDI sequencer or DAW back then already used its own project format instead of raw Standard MIDI Files, and assumed that composers would therefore be program-hopping anyway?
However, GSAE does support the import of settings from a MIDI file and features a SysEx history window that decodes every newly processed Roland SysEx byte string, which is all I was looking for. So let's throw in that same MIDI and…
That's the result of sending just the single F0 41 10 42 12 40 01 30 14 7B F7 message at the top.
Now that's some wild numbers. An equally invalid Reverb Character, and Reverb Level and Time values that even exceed their defined range of 0-127? Could it be that GSAE emulates the real-hardware response to invalid Reverb Macros here, and gives us the exact reverb setting we can hear in Romantique Tp's recording? This could even be the reason why GSAE is still used and recommended within today's Roland MIDI sequencing scene, and hasn't been supplanted by some more modern open-source tool written by the community.
In any case, these values have to come from somewhere, so let's reverse-engineer GSAE and figure out the logic behind them. Shoutout to IDR for being a great help with its automatic generation of IDC debug symbols for the Delphi standard library, and even including a few names of application-level widget class methods by reading Delphi-specific type information from the binary. This little sub-project made me also come around to appreciating Ghidra, whose decompiler and data type manager helped a lot and allowed me to find the relevant code section within just a few hours.
A~nd it turns out that the values all come from out-of-bounds accesses into arrays on the stack. If we combine 25, 235, and 132 back into a 32-bit value, we get 0x19EB84, which is the virtual address of the relevant function's stack frame base pointer.
But it gets even more hilarious: If you enable debug text output via Option → Other Options → SMF → Insert text events to setup measures and export these imported settings back into a MIDI file, GSAE not only retains these invalid Reverb Macro IDs, but stringifies them via a simple lookup into a hardcoded string pointer array, again without any bounds checks. The effects of this are roughly what you would expect:
Reverb Macro IDs between 8 and 27 simply insert wrong strings from adjacent string pointer arrays
Reverb Macro 28 crashes GSAE
Reverb Macro 64 causes GSAE to vomit 65,512 bytes of garbage into the MIDI file
In the end, we have Domino not decoding the Reverb Macro message, and GSAE, the premier SysEx tool for Roland synths, responding to it in even more undefined and clearly bugged ways than real hardware apparently does. That's two programs confirming that whatever ZUN intended was never supposed to work reliably. And while we still don't know exactly what these reverb parameters are supposed to be, these observations solve the mystery as far as I'm concerned, and solidify my personal opinion on the matter.
So what do we do now, and which version do we go with? Optimally, I'd offer both versions and turn this controversy into a personal choice so that everybody wins… and Ember2528 agreed and generously provided all the funding to make it happen. 💸
If you haven't picked your favorite yet, here are some final arguments:
The Romantique Tp recordings certainly have something going for them with their provenance of coming from real hardware, and the care that Romantique Tp put into manually recording every single track, warts and all. I wholeheartedly agree that preserving the raw sound of playing the MIDI files into the hardware without thinking about bugs or quirks is an important angle to take when it comes to preservation. It's good that these recordings exist – after all, you wouldn't know which musical elements you'd possibly be missing in an emulation if you have nothing to compare it to. Even the muffled sound in the half-speed clip above can be an argument in their favor, as the SC-88Pro's DAC operates at 32 kHz and you wouldn't expect any meaningful frequency content between 16,000 and 22,050 Hz to begin with. Any frequency content in that range that does remain in Romantique Tp's recording is simply 📝 rolled-off imaging noise added during the ADC's resampling process.
All this is why they are a definite improvement over kaorin's 2007 recordings of only the AST, which used to be the previous reference recordings within the community. Those had all of the same timing issues and more, in addition to being so excessively volume-boosted that 0.15% of the samples across the entire soundtrack ended up clipped. That's 6.25 seconds out of 68:39m being lost to pure digital noise.
Most importantly though: ZUN himself said that only the real SC-88Pro will play back these files as he intended them to sound. This quote is likely where the tagline of Romantique Tp's entire recording project came from in the first place:
> 全てのデエタはSC-88ProもしくはSC-8850(ロオランド社)にて最適に聴けるように調整してあります
> それ以外の音源でも、作者の意図した音ではない場合があります。
— ZUN on 東方幻想的音楽, his old MIDI page
However. ZUN is not exactly known for accurately and carefully preserving the legacy of his series, or really doing anything beyond parading his old games as unobtainable showpieces at conventions. With all the issues we've seen, preferring real hardware is ultimately just that: an angle, and a preference. This is why I disagree with the heavy and uncritical advertising that is mainly responsible for elevating the Romantique Tp recordings to their current reference status within the community, especially if at least half of the alleged superiority of real hardware is founded on undefined behavior that can easily be fixed in the MIDI files themselves if people only bothered to look.
Here's where I stand: MIDI files are digital sheet music first and foremost, not an inferior version of tracker modules where the samples are sold separately. As such, the specific synth a MIDI file was written for is merely a secondary property of the composition – and even more so if the MIDI file contains little to nothing in terms of sound design and mostly restricts itself to the basic feature set of General MIDI. In turn, synth quirks and bugs are not a defined part of the composition either, unless they are clearly annotated and documented in the file itself. And most importantly: If the MIDI file specifies a certain timing and a recording fails to reproduce that timing, then that recording is not an accurate representation of the MIDI file.
In that regard, Sound Canvas VA is not only the closest alternative to the real thing, as a few people in the MIDI and retrogaming scene do have to admit, but superior to the real thing. I'll gladly take clarity and perfect timing accuracy in exchange for minor differences in effects, especially if the MIDI file does not explicitly and correctly define said effects to begin with. If I want a panning delay as part of the reverb, I add the respective and correct SysEx message to define one – and if I don't, I do not care about the reverb. You might still get a panning delay on a certain synth, and you might even prefer how it sounds, but it's ultimately a rendering artifact and not a consciously intended part of the composition. In that way, it's similar to the individual flavor a musician adds to a performance of a piece of classical music.
And as far as the differences in frequency response and resonant filters are concerned: In Yamaha land, these are exactly the main distinguishing factors between vintage WF-192XG sound cards (resembling the real SC-88Pro in these characteristics) and the S-YXG50 softsynth (resembling SCVA). Once I found out about that softsynth and how much clearer it sounded in comparison, I sold that old PCI sound card soon after.
In the interest of preservation though, there's still one more unexplored solution that could be the ideal middle ground between the two approaches:
Play the MIDIs through a real-hardware SC-88Pro again
Capture the actually observed system-exclusive settings that fall within the synth's supported and documented ranges
Insert them back into the MIDI file, creating a new bugfixed version
Re-record that bugfixed version through Sound Canvas VA
Edit (2024-03-10): And since Romantique Tp has confirmed what exactly happens on real hardware, I'm going to do exactly that. These bugfixed Sound Canvas VA renderings will be a free bonus of the single next Shuusou Gyoku push, and will add another angle to the preservation of these soundtracks. In the meantime though, the Sound Canvas VA packs will sound like they do in the preview videos above.
Just to be clear: I'm not suggesting that Romantique Tp should have been the one to cut their recordings into loops, or even just the one who defined where the loop points are supposed to be. On the surface, this seems to be a non-issue, and you'd just pick a point wherever each track appears to loop, right? But with 39 MIDIs to cut and all the financial support from Ember2528, it made sense to also solve this problem more thoroughly, and algorithmically detect provably correct loop points for all of these files. Who knows, maybe we even find some surprises that make it all worth it?
This is the algorithm I came up with:
At a basic level, we loop over the list of MIDI events and return the earliest and longest subrange that is immediately followed by an identical copy.
MIDI players, however, need loop point definitions that use MIDI pulse units rather than event list indices. This is especially necessary for multi-track/SMF Type 1 sequences, which would otherwise require one loop start/end index pair per track, and then it still wouldn't work because some of the tracks might not even have an event at the loop start/end point. This requires the detection algorithm and the player to agree on how to map event indices to time points and back, and simply going for the first event of each pulse (i.e., any event with a nonzero delta time) makes the most sense here. In turn, we can skip any potential start or end events that have a delta time of 0, speeding up the algorithm significantly for typical compositions with a high degree of polyphony.
Naively considering just the raw MIDI events works for MIDI playback. But as soon as we want to cut a recording based on the detected loop points, we need to account for the fact that MIDI playback is inherently stateful. Each of the 16 channels at the protocol level features at least the 128 continuous controllers (CCs) with a 7-bit state, the 14-bit pitch bend controller, and the 7-bit instrument program value, in addition to the global tempo of the piece. As a result, two ranges of events might look identical, but can still sound differently if the events before the first range changed one piece of state which is then only touched again near the end of that range. This requires us to track the full MIDI state at both the start and end of a loop, and reject any potential loop that differs in these states:
In this example, a naive event-level scan would detect a loop between beats 3 and 6 as the same events are immediately repeated between beats 6 and 9. However, the piece starts with the first four notes at a channel volume of 50, which is only set to its later value of 100 on beat 5. Therefore, the actual loop ranges from beat 5 to 8. In turn, the piece needed to be at least 11 beats long to include the full second copy of the looped events and prove the loop as such.
This check can be a bit too strict in some cases, though. A channel might start with one of its CCs at a specific value but then change the same CC to a different value at a later point before playing the first note. In such a case, the detected loop would be delayed to the second CC change even though the initial CC value has no impact on the sound. By filtering these redundant CC changes, we get to move the loop start point of a few tracks (original 夢機械 ~ Innocent Power and arranged 魔法少女十字軍) back by a few seconds, to the position you'd expect.
Finally, we reject any overlong loops that themselves fully consist of multiple successive copies of the first N events.
Shuusou Gyoku's original MIDI files hide the original game's lack of MIDI looping by simply duplicating the looping sections enough times so that a typical player won't notice. The algorithm we have so far, however, would return a much longer loop if a MIDI file contains more than three successive copies of a looping section. The original version of ハーセルヴズ in particular repeats its 8 looping bars a total of 15 times before the MIDI ends, and this condition is necessary to detect the actual 8-bar loop instead of a 56-bar one.
Of course, this algorithm isn't perfect and won't work for every MIDI file out there. It doesn't consider things like differently ordered events within the same MIDI pulse, (non-)registered parameter numbers, or the effect that SysEx messages can have on the state of individual channels. The latter would require the general SysEx decoding logic that I would have liked to have for the research above… actually, let's add an issue and add the project to the order form. I'd really like to see a comprehensive open-source cross-vendor SysEx decoder library in my lifetime.
As for the implementation, I was happy to write some Rust again for a change, as it's a great fit for these standalone greenfield command-line tools that don't have to directly interact with the legacy C++ code bases that this project usually deals with. It's even better if the foundational functionality is not just available in a crate, but in four, with the community already having gone through multiple iterations to arrive at a tried and tested winner. Who knows, maybe I even get to rewrite this website in it one day? Just for the sheer meme value of doing so, of course.
I also enjoyed this a lot from a technical point of view:
You might think that Rust's typical safety guarantees don't matter for the problem at hand. But then you accidentally write -= instead of += for a u32 that starts out at 0, and Rust immediately panics instead of silently underflowing to u32::MAX. This must have saved me at least 5 minutes of debugging the resulting logic error.
As it turns out, my loop detection algorithm is embarrassingly parallel. You might initially think about it in a sequential way because we always want the earliest occurrence of the longest repeating section of MIDI events, which means that each new loop candidate further into the track has to be longer than the previous one. But since we always iterate over the entire MIDI, it makes perfect sense to divide and conquer the problem. Let's split the list of possible loop end points into equal chunks, scan them all in parallel for the earliest and longest loop within that chunk, and then pick the earliest and longest loop among those intermediate results as the final one. In Rust, you don't even have to think much about the chunks, as all of that can be easily done by replacing the iteration with Rayon's parallel fold and adding a reduce() with the same condition for the final step. This sped up the algorithm by exactly the number of cores in my system.
This algorithm works well for the long MIDI files of Shuusou Gyoku's OST that all contain multiple duplicates of their loop section, but it quickly reaches its limit with the AST. Following the classic two-loop + fade-out format, that soundtrack was meant to be played back in generic MIDI players, and not to actually be put back into the game in looped form. Since the loop algorithm did, in fact, find inconsistencies even in the OST, two copies of the apparent loop are sometimes not enough to prove cases where the actual loop ends much later than you think it does. In a few cases, it would be enough to simply remove all volume change events from the fade-out to prove the actual loop, but in others, the algorithm would need MIDI event data far past the end of the fade-out.
However, just giving up and not looping any of these tracks would be equally unfortunate. So how about shifting the question, from what's the best loop in this MIDI file to what's the best loop if the MIDI didn't fade out and instead repeated its apparent second loop a third time? As long as the detected loop in such a pre-processed file ends before the repeated range, it's still a valid loop in terms of the unmodified original.
Ideally, we want to do this pre-processing programmatically with the same Rust library instead of manually editing the MIDI. Many sequencers (and especially XGworks) apply significant changes to a MIDI file's internal structure when saving its internal representation back to a MIDI file, which might even mess with our loop algorithm. So it would be very nice to have a more trustworthy tool that applies only the edit we actually want, and perfectly retains the rest of the MIDI.
And that's how this sub-project turned into a small suite of command-line MIDI operations in the classic Unix filter/pipeline style: Each command reads a MIDI file from stdin, transforms it, and outputs text or the resulting MIDI file on stdout. This way, we gain maximum transparency and reproducibility as I can document the unique pre-processing steps for each AST track by simply providing the command lines. And sure, we're re-encoding and re-decoding the full MIDI sequence at every step along such a pipeline, but computers are fast, Rust and the midly library in particular are ⚡ blazingly fast ⚡, and the usability benefits of this pipeline model far outweigh any theoretical performance drops.
Here's the full list of commands that made it into the resulting mly tool:
cut: Extremely basic removal of MIDI events within a certain range.
dump: Dumps all MIDI events into a textual table. All event lists in this blog post are based on this output.
duration: Shows the duration of a MIDI file in pulses, beats, seconds, and PCM samples.
filter-note: Removes all Note On events within a certain range, retaining all other events. This allows us to generate separate intro and loop MIDIs, whose renderings we can then splice back into a single loopable waveform with no discontinuities, which is not guaranteed when rendering a single MIDI file. This provides the last missing piece needed for rendering perfect, sample-accurate loops through Sound Canvas VA.
loop-find: The loop detection algorithm described above.
loop-unfold: Duplicates MIDI events from a given point to the end of the track. A budget solution for the problem of creating synthetic loops – arbitrary copying of arbitrary subranges to arbitrary destinations would have been undeniably nicer, but also much more complex, and I didn't need that full flexibility for the task at hand.
smf0: Flattening multi-track/SMF Type 1 MIDI sequences into single-track/SMF Type 0 ones. Having this conversion as a distinct operation in our toolset allows other operations to exclusively support SMF Type 0 if a Type 1 implementation would either take significant additional effort or just duplicate the Type 0 flattening algorithm. This group of operations includes loop-find, cut, and even the real-time output for duration because tempo events can theoretically occur on any track.
This feature set should strike a good balance between not spending too much of the Shuusou Gyoku budget on tangential problems, but still offering a decent solution for the problem at hand. As a counterexample, the obvious killer feature – deserializing a dump back into a Standard MIDI File – would have gone way past the budget. While there are crates that free you from the need to write manual parsing code for basic data structures, they would instead require a lot of attribute boilerplate – and if the library that provided the structures doesn't already come with these attributes, you now have to duplicate all the structures, and convert back and forth between the original structures and your copies. Not to mention that we'd still have to write code for the high-level structure of the dump output…
If we put it all together, this is what we can do:
The best loop found in the raw MIDI file spans 4 events and 200 milliseconds. Clearly, this is not the loop we're looking for.
Let's cut off all events from the start of the fade-out to the end, do a loop-unfold copy of all events from the position during the apparent second loop that corresponds to where the fade-out started, and try looking for a loop in that modified MIDI.
The resulting loop is 1:31m long, which is exactly what we were hoping to find.
The note space loop represents the earliest possible event range with equivalent per-channel controller and pitch bend state at both ends. This loop is only appropriate for MIDI players, as its bounds can fall into the middle of notes that are played with a different channel state at the start and end of the loop. This is why it doesn't show any sample positions.
The recording space loop ensures that this doesn't happen. It's also always placed on a Note On event with non-zero velocity, which eases the splicing of separate filter-note recordings. This way, it's enough to remove leading silence from the loop part and mix it exactly at the indicated sample position.
The detected loop is also nowhere close to the cut point at beat 466, matching our condition for validity. All events within the loop came from ZUN's original composition, and the cut/loop-unfold combo merely provided the remaining 63% of events necessary to prove this loop as such.
So, where are these loop quirks that justify why some of these audio files are longer than you'd think they should be? Just listing them as text wouldn't really communicate just how minor these are. It would be much nicer to visualize them in a way that highlights the exact inconsistencies within a fixed range of MIDI measures. Screenshots of MIDI sequencer or DAW windows won't capture these aspects all too well because these programs are geared toward fine-grained editing of single tracks, not visualization of details across all channels.
REAPER's piano roll nicely snaps to a certain range, but good luck picking out the individual lines from the single volume lane at the bottom of the screen, or spotting a 7-point difference. Not to mention that CC #11 (Expression) makes up an equal part of a channel's final perceived volume, which is the metric we'd actually want to visualize.
Typical MIDI visualizers, however, are on the complete opposite end of the spectrum. In recent years, MIDI visualization has become synonymous with the typical Synthesia style of YouTube videos with a big keyboard at the bottom, note bars flying in from the top, and optional fancy effects once those notes hit the top of the keyboard. The Black MIDI community has been churning out tons of identically looking MIDI visualizers in recent years that mainly seem to differ in the programming language they're written in, and in how well they can cope with the blackest of black MIDIs.
Thankfully, most of these visualizers are open-source and have small and manageable codebases. The project with the most GitHub stars and the most generic name seemed to be the best starting point for hacking in the missing features, despite using GLSL shaders which I had no prior experience with. It was long overdue that I did something with GLSL though – it added a nice educational aspect to these hacks, and it still was easier than deciphering whatever the fastest and hyper-optimized Rust visualizer is doing.
Still, this visualizer needed a total of 18 small features and bugfixes to be actually usable for demonstrating Shuusou Gyoku's loop quirks. As such, these hacks turned into yet another tangential sub-project that could have easily consumed another two pushes if I cleaned up the code and published the result. But that would have really gone way past the budget for something that people might not even care about. So here's what we're going to do:
I've added this MIDI visualizer as a new goal to the order form. This goal is eligible for microtransactions, so you don't have to fund a full push to see the first changes committed and released.
The upstream project seems to have been abandoned recently, which is the perfect excuse for not even trying to merge in my sweeping changes with a series of pull requests. The code sure needs a lot of cleanup and deduplication, and especially a more build system-friendly way of embedding its shader source code.
Every backer who supports this goal with at least 0.1 pushes or microtransactions will get a Windows binary with my current hacked-in changes as a preview, immediately after the purchase. Shoutout to the MIT license for letting me do this 😛
As usual, once the code is done, the final cleaned-up version will be available for free for everyone, in both source code and binary release form.
Alright then! Here's how to read the visualizations:
The transparency of each note represents its velocity multiplied by the channel volume and expression. To spot volume inconsistencies, you'd compare the opacity of equivalent notes in the two ranges.
The X-axis of these visualizations uses linear/real time, so the width of each measure represents the exact time it takes to be played relative to the other measures in the visualized range. To spot tempo inconsistencies, you'd compare the distance between the bar lines.
Notes that are duplicated on two or more channels may be colored differently in the loop start and end views. These are rendering order inconsistencies and don't communicate anything about the MIDI.
Stage 1 theme (フォルスストロベリー), original and arranged version: The string and harmonica channels are slightly louder on the apparent first loop than on the others.
Apparent loop:
0:01m – 1:31m
Actual loop:
1:04m – 2:34m
Mei and Mai's theme (ディザストラスジェミニ), arranged version: The one and only quirk that's caused by different notes – the first loop has an E♭ on the slap bass channel in measure 32, but the second loop has a G♭ in the corresponding measure 72.
Apparent loop:
0:01m – 1:02m
Actual loop:
0:50m – 1:51m
Stage 3 theme (華の幻想 紅夢の宙), original and arranged version:
The trumpet channel starts out panned to the center of the stereo field (64), before being left-panned by 25% (48) at 1:04m, where it stays for the rest of the track.
Apparent loop:
0:01m – 1:29m
Actual loop:
1:04m – 2:32m
I didn't come up with a good way of visualizing panning in a 2D plane, so you have to trust your ears with this one.
Marie's theme (機械サーカス ~ Reverie), arranged version: Every apparent loop modulates up by a semitone 16 measures before it ends, and remains in that new key at the start of the next loop, so the piece technically doesn't loop at all. The original stays in G♯m throughout.
Stage 5 theme (カナベラルの夢幻少女), original version: The ritardando near the supposed end of the first loop drops from 145 BPM to 118 BPM, but only to 129 BPM in all further loops.
Apparent loop:
0:01m – 1:39m
Actual loop:
1:33m – 3:11m
Yup, that means that the intro part technically makes almost up the entire apparent loop. ZUN replaced the ritardando with instant tempo changes in the arranged version, which moves the loop to its expected place at the start of the track.
The loop start and end points are in the respective next measure past this range.
Stage 6 theme (アンティークテラー), arranged version: The string channel starts out with the maximum expression of 127, but then only goes up to 120 after some fading notes later in the piece, where it stays for the beginning of the second loop.
Apparent loop:
0:01m – 1:53m
Actual loop:
0:13m – 2:05m
Same here.
VIVIT-captured-'s first theme (夢機械 ~ Innocent Power), arranged version: Has a unique ending section that starts in Gm and then modulates through Em and Fm before it fades out on F♯m.
VIVIT-captured-'s second theme (幻想科学 ~ Doll's Phantom), original and arranged version: Another fade-related 127 vs. 120 expression inconsistency, this time on the orange square channel.
Apparent loop:
0:01m – 1:32m
Actual loop:
1:03m – 2:34m
VIVIT-captured-'s third theme (少女神性 ~ Pandora's Box), original and arranged version: Another tempo inconsistency: A slightly differently shaped ritardando before the bell tree hit in the supposed first loop.
Marisa's theme (魔女達の舞踏会), arranged version: Has a unique 8-bar ending section that is first played in Cm and then loops in C♯m while fading out.
Ending theme (ハーセルヴズ), arranged version: Probably the best-known one out of these, and I'm talking of course about the beautiful ending section. I'm making the executive decision to not loop this track in-game, and letting it fade to silence instead.
Before we package up these looped soundtracks, let's take a quick look at how they would be shown off in the Music Room. The Seihou Music Rooms carry over the per-channel keyboards from TH05, add the current per-channel volume, expression, and pan pot values, and top it off with a fake spectrum analyzer. All of these visualizations rely on MIDI data, and the Music Room would feel very dull and boring without them. Just look at Kioh Gyoku, whose Music Room basically turns into a still image in WAVE mode.
Retaining these visualizations even when playing waveform BGM was very important for me, and not just because it would make for a unique high-quality feature that would break new ground. It can also double as proof that the waveform versions are, in fact, in perfect sync with both the MIDIs they are based on, and, by extension, the respective stage scripts.
However, this would require the game to process the MIDIs and update the internal visualization state without simultaneously playing them back through the WinMM / MME / midiOut*() API. And just like graphics and text rendering, Shuusou Gyoku's original code came with zero architectural separation between platform-independent processing logic and platform-specific playback…
So I accidentally rewrote almost the entire MIDI code to achieve said separation. This also provided a great occasion to modernize this code and add some much-needed robustness for potential MIDI mods, while retaining the original code's approach of iterating over raw SMF byte streams. It might all have been very excessive for a delivery that was supposed to be just about waveform BGM support, but on the plus side, MIDI output is now portable to any other system's MIDI API as well.
Surprisingly though, it was Shuusou Gyoku's original MIDI timing that quickly turned out to be rather inaccurate, and not the waveforms. The exact numbers vary depending on the piece, but the game played back every MIDI about 1% slower than notated, adding about 2 or 3 seconds to their total playback time after 5 minutes. Tempo changes in particular were the biggest causes of desynchronizations with the waveforms…
To understand how this can happen to begin with, we have to look closer at how you're supposed to use the midiOut*() API. This API is as low-level as it gets, only covering the transmission of a single MIDI message to the selected output device right now. There is no concept of note timing at this low level, so it's completely up to the program to parse delta times and tempo change events out of the MIDI file and correctly time the calls to this API for each MIDI message. With all the code that runs between the API and the actual renderer of the synth for every single message, the resulting timing can only ever be an approximation of the MIDI file. This doesn't really matter for the timescales and polyphony levels of typical music because, again, computers are fast, but such an API is fundamentally unsuitable for accurately playing back even just a moderately complex million-note Black MIDI.
Shuusou Gyoku handles this required manual timing in the simplest possible way: It runs a MIDI processing function (Mid_Proc() in the code) at an interval of 10 ms, which processes and instantly sends out all MIDI events that have occurred at any point within the last 10 ms, maintaining merely their order. This explains not only why the original game incremented its MIDI TIMER by multiples of 10, but also the infamous missing drums when playing the soundtrack through the Microsoft GS Wavetable Synth:
ZUN reduced all drum notes to the minimum possible length allowed by the 480 PPQN pulse resolution of these MIDI files.
In regular music notation, this corresponds to 1/1920th notes.
While the exact real-time length in purely mathematical terms depends on the tempo of a piece, it only has to be ≥13 BPM for a 1/1920th note to be shorter than 10 ms.
Therefore, the higher the BPM, the higher the chance that both a drum note's Note On and Note Off messages are sent within the same call to Mid_Proc(), with the respective two midiOut*() API calls only being at best a two-digit number of microseconds apart.
So it only makes sense why cheap MIDI synths that don't even respond to reverb or release time messages completely drop any note with such a short length. After all, at a sampling rate of 44,100 Hz, a note would have to be at least 22.7 µs long to be represented by even a single PCM sample.
This also extends to the visualizations above, and was the reason why I chose to render all drum notes as fixed-size diamonds. Otherwise, they would barely be visible.
But while sending MIDI events in such quantized chunks might not be perfect, it can't be the cause behind multi-second playback slowdowns. Instead, this issue has to boil down to the way Shuusou Gyoku times each individual message, and specifically how it converts between MIDI pulse units and real-time (milli)seconds. pbg's original MIDI code chose to do this in an equally confusing and inaccurate way: it kept two counters that tracked the current MIDI pulse before and after the latest tempo change, used the value of the latter counter to decide which events to process, and only added the pulse equivalent of 10 ms to this counter at the end of Mid_Proc() in the then current tempo. The commit message for my rewritten algorithm details the problems with this approach using nice ASCII art in case you're interested, but in short, the main problem lies in how the single final addition can only consider a single tempo change within each call to Mid_Proc(). If a MIDI file contains tempo ramps with less than 10 ms between each different tempo, the original game would only use the last of these tempo values as the basis for converting the entire 10 ms back into MIDI pulses. Not to mention that maybe MIDI pulses aren't the best unit in a game that still 📝 treats the FPU as lava and doesn't use any fixed-point means of increasing the resolution of the 10 ms→pulse division either…
On the contrary, it's much more accurate to immediately convert every encountered MIDI delta time to a real-time quantity and use that unit for event timing, especially if we want to restrict ourselves to integer math. Signed 64-bit integers are enough to fit the product of the slowest possible MIDI tempo ((224 - 1) µs per quarter note) and the highest possible MIDI delta time (228 - 1) at nanosecond precision (103), with one bit to spare. Then, we arrive at a much simpler timing algorithm:
Each simultaneously playing track gets a next event timer, starting out at 0
When looking at the next event, add the converted nanosecond value of its delta time to this timer
Subtract the equivalent of 10 ms from each track's timer at the beginning of the processing function
As long as the timer is ≤0, process and send the next message
The additive nature of this timer not only naturally allows more than one event to happen within a single Mid_Proc() call, but also averages out any minor timing inconsistencies across the length of a track.
assert(length_of_tempo_message == 3);
uint32_t tempo = 0;
for(int i = 0; i < length_of_tempo_message; i++) {
- tempo += ((tempo << 8) + (*track_data++));+ tempo = ((tempo << 8) + (*track_data++));
}
Yup – the original code performed two additions per byte, which incorrectly added the interim value at every byte to the final result, and yielded a tempo that is ≈0.8% / ≈1 BPM slower than notated in the MIDI file, matching the number we were looking for. That's why the |/OR operator is the safer one to use in such a bit-twiddling context…
But now I'm curious. This is such a tiny bug that is bound to remain unnoticed until someone compares the game's MIDI output to another renderer. It must have certainly made it into other games whose MIDI code is based on Shuusou Gyoku's, or that pbg was involved with. And sure enough, not only did this bug survive Kioh Gyoku's OOP refactoring, but it even traveled into Windows Touhou, where it remained in every single game that supported MIDI playback. Now we know for a fact that pbg's Program Support role in the TH06 credits involved sharing ready-made, finished code with ZUN:
The broken tempo deserialization in the respective latest full versions of TH06 through TH10. And yes, that's TH10 – even though TH09's trial version was the last game to ship MIDI versions of its soundtrack, TH10 still contained all of pbg's MIDI code that originated back in Shuusou Gyoku, before TH11 finally removed it.
Amusingly, ZUN's compiler even started optimizing the combination of left-shifting and addition to a multiplication with 257 for TH09, which even sort of highlights this bug if you're used to reading x86 ASM.
That leaves support for MIDI loop points as the only missing feature for syncing MIDI data with a looping waveform track. While it didn't require all too much code, pbg's original zero-copy approach of iterating over raw MIDI data definitely injected a lot of complexity into the required branches. Multi-track/SMF Type 1 files require quite a bit of extra thought to correctly calculate delta times across loop boundaries that reach past the end of the respective track, while still allowing the real-time delta values to be resynchronized at tempo changes within the loop – and yes, 3 of ZUN's 19 arranged MIDI files actually do use more than one track, so this wasn't just about maximizing MIDI compatibility for mods. I stuck to the original approach mostly as a challenge and to prove that it's possible without first parsing the entire MIDI sequence into a friendlier internal representation, but I absolutely do not recommend this to anyone else.
After hardcoding the loop points detected by mly into the binary, we only need to call Mid_Proc() once per frame in the Music Room and pass the frame delta time instead of the 10 ms constant. And then, we get this:
The MIDI TIMER now shows off the arguably more interesting current MIDI pulse value rather than just formatting the PASSED TIME in milliseconds. Ironically, displaying this value in a constantly counting way takes more effort now – the new nanosecond-based timing code doesn't use any measure of total MIDI pulses anymore, and they don't naturally fall out of the algorithm either. Instead, the code remembers the total pulse value of the last event it processed and adds the real-time duration that has passed since, similar to the original timing algorithm.
This naturally causes the timer to jump from the loop end pulse to the loop start pulse, proving that Mid_Proc() is in fact looping the sequence.
Alright, now we know what to package:
We're going to have 8 BGM packs for each permutation of soundtrack (OST / AST), sound source (Romantique Tp / Sound Canvas VA), and codec (FLAC / Vorbis), making up 1.15 GiB of music data in total.
When looking at the package names, you will notice that I don't particularly highlight the FLAC versions as lossless. And for good reason – the Romantique Tp recordings had dithering and noise shaping applied to them, and the Sound Canvas VA versions will necessarily have to be volume-normalized and quantized to 16-bit during the conversion to FLAC. If we wanted a BGM pack with the actual raw Sound Canvas VA output, we'd have to implement WavPack support, which is the only lossless codec that supports 32-bit float – and even that codec could only compress these files down to 14 MiB per minute of music, or 508 MB for the entire original soundtrack. That's 1.4× the size of an equivalent thbgm.dat!
The whole packaging process will be complex enough to warrant a build system. I'd also like to generate an extensive README file for each package, not least to describe the Sound Canvas VA rendering and loop-cutting process in complete detail.
The AST packs need to bundle the MIDI files from ZUN's site for Music Room visualization. We might as well add a 9th MIDI-only AST pack then, as it will naturally fall out of the packaging pipeline anyway. Some people sure love their MIDI synths, after all.
The OST packs can fall back on the original game's MIDI files from MUSIC.DAT for their Music Room visualization, so there's no need to bundle those and infringe copyright. Ironically, the game will still require a MUSIC.DAT even if you use a BGM pack, if only for the one number in that file that says that Shuusou Gyoku's soundtrack consists of 20 tracks in total.
ZUN didn't arrange タイトルドメイド, so we need to copy the OST version recorded with the respective sound source into the AST pack.
Unfortunately, we still haven't reached the end of the complications and weird issues that haunt Shuusou Gyoku's music:
The original game reads the in-game track title directly out of the first Sequence Name event of the playing MIDI file. The waveform equivalent would be the Vorbis comment TITLE tag, which therefore should exactly match the original track's title, down to the exact placement of whitespace. As usual, if I emphasize minor things like this, it's not without reason: 幻想科学 ~ Doll's Phantom inconsistently uses halfwidth spaces at both sides of the ~, and wouldn't fit into the Music Room's limited space otherwise.
However, the AST MIDI files jam a bunch of other metadata into their Sequence Names, roughly following the format
【 $title 】 from 秋霜玉 for sc88Pro comp.ZUN
The track titles should definitely not appear in this format in-game, but how do we get rid of this format without hardcoding either the names or the magic to parse the names out of this format?
The absolute state of GS SysEx tooling rears its ugly head one final time in three of the AST MIDIs, which for some reason are missing the Roland vendor prefix byte in all of their SysEx messages and are therefore undeniably bugged. There even seemed to be another SysEx-related bug which Romantique Tp explained away, but not this one:
The irony of using invalid Reverb Macros within already invalid SysEx messages is not lost on me.
This is something we should fix even before running these files through Sound Canvas VA in order to render these with the reverb settings that ZUN clearly (and, for once, unironically) intended.
For perfect preservation of the original BGM/gameplay synchronicity, it makes sense for the waveform versions to retain the leading 1 or 2 beats of silence that the original MIDI files use for their SysEx setup. While some of the AST tracks use a slightly different tempo compared to their OST counterparts, they would still be largely in sync as ZUN didn't rearrange the layout of their setup area… except for, once again, the three tracks used in the Extra Stage. Marisa's and Reimu's boss themes aren't too bad with their 4 beats of setup, but シルクロードアリス takes the cake with a whopping 12 beats of leading silence. That's 5 seconds from the start of the Extra Stage to the first note you'd hear. 🐌
2) and 4) could theoretically be worked around in Shuusou Gyoku's MIDI code, but there's no way around editing the MIDI files themselves as far as 3) is concerned. Thus, it makes sense to apply all of the workarounds to the AST MIDIs as part of the BGM build process – parsing the titles out of the 【brackets】, inserting the Roland vendor prefix byte where necessary, and compressing the setup bars in the Extra Stage themes to match their OST counterparts. Adding any hidden magic to the MIDI code would only have needlessly increased complexity and/or annoyed some modder in the future who would then have to work around it.
Ideally, these edits would involve taking the mly dump output, performing the necessary replacements at a plaintext level, and rebuilding the result back into a MIDI file, bu~t we're unfortunately missing the latter feature. Luckily, someone else had the same idea 13 years ago and
wrote a tool in C that does exactly what we need. Getting it to compile in 2024 only required fixing a typical C thing… why are students and boomers defending this antique of a language again? 🙄
The single most glaring issue, however, is the drastic difference in volume between the individual tracks in both soundtracks. While Romantique Tp had to normalize each track to the maximum possible volume individually as a consequence of the recording process, the Sound Canvas VA renderings reveal just how inconsistent the volume levels of these MIDI files really are:
The peak amplitudes of every track in both soundtracks, as rendered by Sound Canvas VA at maximum volume. Looking at these, you might think that kaorin's 2007 recordings were purposely trying to preserve the clipping that would come out of an SC-88Pro if you don't manually adjust the volume knob for each song, but those recordings are still much louder than even these numbers.
So how do we interpret this? Is this a bug, because no one in their right mind would want their music to clip on purpose, and that in turn means that everything about these volume levels is arbitrary and unintentional? Or is this a quirk, and ZUN deliberately chose these volume levels for compositional reasons? It certainly would make sense for the name registration theme.
Once again, the AST version of シルクロードアリス is the worst offender in this regard as well, but it might also provide some evidence for the quirk interpretation. The fact that almost all of its MIDI channels blast away at full volume might have been an accident that could have gone unnoticed if the volume knob of ZUN's SC-88Pro was turned rather low during the time he arranged this piece, but the excessive left-panning must have been deliberate. Even Romantique Tp agrees:
It might have even made compositional sense if Silk Road Alice was supposed to be a "Western-style piece", but it's not.
And that's with the volume already normalized. Because this one channel of this one track is almost twice as loud as anything else in the AST, we would consequently have to bring down the volume of every other arranged track and the right channel of the same track by almost 50% if we wanted to maintain the volume differences between the individual tracks of the AST. In the process, we lose almost one entire bit of dynamic range. At this rate, you might even consider remixing and remastering the entire thing, but that would involve so many creative decisions to definitely fall into fanfiction territory…
However, normalizing each track to a peak level of 0 dBFS makes much more sense for in-game playback if you consider how loud Shuusou Gyoku's sound effects are. Once again, the best solution would involve offering both versions, but should we really add two more SCVA BGM packs just to cover volume differences? ReplayGain solves this exact problem for regular music listening in a non-destructive way by writing the per-track and per-album gain levels into an audio file's metadata. Since we need metadata support for titles anyway, we can do something similar, albeit not exactly the same for two reasons:
ReplayGain is specified to target an average volume of −17 dBFS, whereas we'd like to target a peak volume of 0 dBFS in order to always use the entire available digital scale. We've got some loud sound effects to compete with, after all.
ReplayGain expresses its gain values in dB, which is cumbersome to work with. In the realm of PCM, volume changes don't need to involve more than a simple multiplication, so let's go with a simple scalar GAIN FACTOR.
And so, we hard-apply the album-level gain during the conversion from 32-bit float to FLAC to preserve the volume differences between the tracks, calculate the track-levelGAIN FACTOR based on the resulting peak levels, add a volume normalization toggle to the Sound / Config menu, enable it by default, and thus make everyone happy. ✅
The final interesting tidbit in building these packages can be found in the way the Sound Canvas VA recordings are looped. When manually cutting loops, you always have to consider that the intro might end with unique notes that aren't present at the end of the loop, which will still be fading out at the calculated loop start point. This necessitates shifting the loop start point by a few bars until these notes are no longer audible – or you could simply ignore the issue because ZUN's compositions are so frantic that no one would ever notice.
With the separate intro and loop files generated by mly, on the other hand, the reverb/release trails are immediately visible and, after trimming trailing silence, exactly define the number of samples that the calculated loop start point needs to be shifted by. The .loop file then remains always exactly as long, in samples, as the duration of the loop reported by mly. If a piece happens to have a constant tempo whose beat duration corresponds to an integer number of samples, we get some very satisfying, round loop durations out of this process. ☺️
So let's play it all back in-game… and immediately run into two unexpected miniaudio limitations, what the…?!
miniaudio uses a fixed linear function for its fade-out envelope, and doesn't offer anything else? We might not even want a logarithmic one this time because symmetry with MIDI's simple quadratic curve would be neat, but we sure don't want a linear function – those stay near the original volume for too long, and then turn quiet way too quickly.
There is no way to access FLAC metadata from miniaudio's public API, even though the library bundles the author's own FLAC library which has this feature?
📝 Back when I evaluated miniaudio, I alluded that I consider single-file C libraries to be massively overrated, and this is exactly why: Once they grow as massive as miniaudio (how ironic), they can quickly lead to their authors treating their dependencies as implementation details and melting down the interfaces that would naturally arise. In a regular library, dr_flac would be a separate, proper dependency, and the API would have a way to initialize a stream from an externally loaded drflac object. But since the C community collectively pretends that multi-file libraries are a burden on other developers, miniaudio ended up with dr_flac copy-pasted into its giant single file, with a silly ma_ namespacing prefix added to all its functions. And why? Did we have to move so far in the other direction just because CMake doesn't support globbing? That's a symptom of CMake not actually solving any problem, not a valid architectural decision that libraries should bend around. 🙄
So unless we fork and hack around in miniaudio, there's now no way around depending on a second, regular copy of dr_flac. Which has now led to the same project organization bloat that single-file libraries originally set out to prevent…
Sigh. At this rate, it makes more sense to just copy-paste and adapt the old BGM streaming code I wrote for thcrap in late 2018, which used dr_flac directly, and extend it with metadata support. With the streaming code moved out of the platform layer and into game logic, it also makes much more sense to implement the squared fade-out curve at that same level instead of copy-pasting and adjusting an unhealthy amount of miniaudio's verbose C code.
While I'm doing the same for the old Vorbis streaming code, it would also make sense to rewrite that one to use stb_vorbis instead of the old libogg+libvorbis reference libraries. There's no need to add two more dependencies if miniaudio already comes with stb_vorbis.c, and that library is widelyacclaimed. So, integration should be a breeze, right?
Well, surprise, rarely have I seen a C library so actively hostile toward being integrated. Both of its API variants are completely unreasonable:
The pulldata API pulls Vorbis data as needed from either a memory buffer containing the entire Vorbis file, or a C FILE* handle.
Effectively, this forces either you to give up disk streaming completely, or your program into C's terrible I/O API with all its buffering slowness and Unicode issues on Windows. The documentation even goes on to suggest just modifying the code if you need anything else, which might be acceptable in the strange world of game development this library originates from, but it sure isn't in the kind of open-source development I do.
The pushdata API expects the caller to gradually feed chunks of Vorbis data. How large do these chunks have to be? Nobody knows – and, even worse, the API doesn't retain any of the data already pushed in. If the buffer you passed is too small, which you don't get to know in advance, you have to pass the same data plus more in the next call. I get that you might want an API like this to avoid dynamic memory allocations, but not only does this API perform plenty of allocations itself, it actively forces its caller to realloc() over and over again. 🙄 The lack of seeking support reveals that this API is geared towards live-streamed audio, and it might very well be acceptable in such a case, but it's nothing we could use for BGM.
What happened to the tried-and-true idea of providing a structure with read, tell, and seek callbacks, and then providing an optional variant for C FILE* handles if you absolutely must? Sure, the whole point of Vorbis is to be small and nobody these days would care about spending a few MB on keeping an entire Vorbis file in memory, but come on. If pulldata made the deliberate and opinionated choice to only support buffers of complete Vorbis streams and argued in the name of simplicity that hand-coded disk streaming isn't worth it in this day and age, I might have even been convinced. And this is from the guy who popularized the concept of single-file C libraries in the first place?
Oh well, tupblocks go brrr. libvorbis definitely shows its age with all the old command-line tools in the lib/ directory that they never moved away and that we now have to remove from our glob. But even that just adds a single line to the Tupfile, and then we get to enjoy its much friendlier API. That sure beats the almost 800 lines of code that miniaudio had to write to integrate stb_vorbis… which I can't even link because the file is too big for GitHub. 🤷
At this point, it would have even made sense to upgrade from a 24-year-old lossy codec to an 11-year-old lossy codec and use Opus instead, since the enforced 48,000 Hz sampling rate is a non-issue when you control the entire audio pipeline. But let's keep compatibility with existing thcrap mods for now.
In the end, the Windows build ended up using only a single one of the miniaudio features that DirectSound doesn't have, and that's the ability to use the more modern WASAPI instead of DirectSound. We're still going to use miniaudio for the Linux port, but as far as Windows is concerned, it would be quite nice to backport BGM streaming to the game's original DirectSound backend. The P0275 build is pushing 1 MiB of binary size for a game that originally came in a 220 KiB binary, so it would remove a noticeable amount of bloat from GIAN07.EXE, but it would also allow waveform BGM to work in the Windows 98-compatible i586 build. If that sounds cool to you, this is the issue you want to fund.
That only left some logic and UI busywork to put it all together, which means that we've almost reached the end of things to talk about! Here's what it all looks like:
BGM pack selection is done in-game through a new submenu. The <Download> option will open the BGM pack release page in the system's preferred browser:
This window presented a great occasion for already implementing the generic boilerplate for vertically scrolling windows with an unlimited number of items. That will come in quite handy once we introduce better replay support… 👀
Even with per-track BGM volume normalization, Shuusou Gyoku's sound effects are still a bit too loud in comparison, especially when mixed on top of that excessively and unfixably left-panned AST version of the Extra Stage theme. Adding separate volume controls for BGM and sound effects really was the only sustainable solution here, and conveniently checks an important quality-of-life box the original game lacked. So important that it was the very first issue I added to the GitHub tracker of my fork:
I really wanted to have Japanese help text in these menus, as it makes them look just so much more consistent and polished. Many thanks to Elfin, who responded to my bounty offer, and will most likely also provide localizations for future features.
In-game music titles are now consistently right-aligned. Leading whitespace in 4 of the original MIDI Sequence Names suggests that pbg might have intended these titles to be centered within the 216 maximum pixels that the original code designated for music titles, but none of those 4 had the correct amount of spaces that would have been required for exact centering:
Right-aligned text matches the one certain intention I can read out of the code, and allows us to consistently trim whitespace from both the original MIDI Sequence Names and the TITLE tags in the BGM packs… at the cost of significantly changing the animation. 🤔
Maybe, all this whitespace had the explicit purpose of making the animation look the way it did originally? But hard-padding the title tags in the BGM packs would be so dumb… 😩 Let's keep it like this for now and fix the animation later.
At startup, the game now shows a new screen if any of the game's .DAT files are missing, displaying their expected absolute path. This is bound to be very important on Linux because each distribution might have its own idea of where these files are supposed to be stored. But even on Windows, this allows GIAN07.EXE to at least run and show something if one or more of these files are not present, instead of crashing at the first attempt of loading anything from them.The ¥ instead of \ is, 📝 once again, a font issue. Good luck finding a font not named MS Gothic that looks good when rendered in this game…
On a more unfortunate note, I dropped the i586 build from this release. Visual Studio 2022's CRT implements the new filesystem and threading code using Win32 API functions that are only available on Vista or later and are not covered by the one ready-made KernelEx package I was able to find, so I couldn't easily test such a build on Windows 98 anymore. Resurrecting the i586 build would therefore involve additional platform abstraction layers that we wouldn't need otherwise. Writing them wouldn't be too expensive, but it only makes sense if there's actual demand. Backporting waveform BGM to DirectSound to restore feature parity would also be a good idea here, as it would avoid the need to litter the current code with #ifdefs at any place that references anything related to BGM packs.
After half a year of being bought out way past the cap, I've finally got some small room left for new orders again. If it weren't for this blog post and the required research and web development work, this delivery would have probably come out in early January, taking half the time it ended up taking. So I really have to start factoring the blog posts into the push prices in a better and fairer way.
Meanwhile, the hate toward my day job only keeps growing, but there's little point in looking for a new one as long as ReC98 remains this motivating and complex. It leaves pretty much no cognitive room for any similarly demanding job. Thus, I want 2024 to be the year where ReC98 either becomes profitable enough to be my only full-time job, or where we conclusively find out that it can't, I go look for a better day job, and ReC98 shifts to a slower pace. Here's the plan:
From now on, I will immediately increase the push price whenever we reach 100% of the cap, either directly through new orders or indirectly through existing subscriptions. The price increase will be relative to how long it took to reach that point since the last re-opening.
If the store continues selling out, I will aim for per push by the end of the year.
In exchange, microtransactions (i.e., deliveries containing just code and no blog posts) will now be half the price of regular pushes for the same amount of delivered code. Or in other words: If you want to fund a goal that's eligible for microtransactions, you can now decide whether your fixed amount of money goes to 2× coding work and 0× blogging, or 1× coding work and 1× blogging.
I'll permanently increase the default level of the cap from 8 to 10 pushes. The past 12 months were full of mod releases that raised the bar, and 2024 shows no signs of stopping that trend.
If we ever reach per push, I plan to hire people for some of the contribution-ideas or anything else that might improve this project. (Well-produced YouTube videos about the findings of this project might be a nice idea!) At that point, I will have reached my goal of living decently off this project alone, and it's time for others to make money in this space as well.
With the new price of per push, this means that there's now a small window in which you can get a full push worth of functionality for , until the current cap is filled up again.
Next up: Probably TH02's endings to relax a bit. Maybe we're also getting some new Touhou-related contributions?
P0264
TH03/TH04/TH05 decompilation (Music Rooms, part 1/2)
P0265
TH03/TH04/TH05 decompilation (Music Rooms, part 2/2 + MAINE.EXE main()) + TH02 PI/RE (Boss damage and position)
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷️ Tags:
Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.
But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!
As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.
The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:
Let's call this "the blank image".
That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right?
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.
On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.
Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:
// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);
// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);
// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);
// We're done, turn off the GRCG.
outportb(0x7C, 0x00);
This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.
As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:
Colors 0 or 1 can't be used, because those don't include any of the bits that can stay constant between frames.
If the lowest bit of a palette color index has no effect on the displayed color, text drawn in either of the two colors won't be visually affected by the polygon animation and will always appear on top. TH04 and TH05 rely on this property with their colors 2/3, 4/5, and 6/7 being identical, but this would work in TH02 and TH03 as well.
But this doesn't apply to TH02 and TH03's palettes, so how do they do it? The secret: They simply include all text pixels in nopoly_B. This allows text to use any color with an odd palette index – the lowest bit then won't be affected by the polygons ORed into the first bitplane, and the other bitplanes remain unchanged.
TH04 is a curious case. Ostensibly, it seems to remove support for odd text colors, probably because the new 10-frame fade-in animation on the comment text would require at least the comment area in VRAM to be captured into nopoly_B on every one of the 10 frames. However, the initial pixels of the tracklist are still included in nopoly_B, which would allow those to still use any odd color in this game. ZUN only removed those from nopoly_B in TH05, where it had to be changed because that game lets you scroll and browse through multiple tracklists.
The contents of nopoly_B with each game's first track selected.
Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:
Due to the polygon animation, the Music Room is one of the few double-buffered menus in PC-98 Touhou, rendering to both VRAM pages on alternate frames instead of using the other page to store a background image. Unfortunately though, this doesn't actually translate to tearing-free rendering because ZUN's initial implementation for TH02 mixed up the order of the required operations. You're supposed to first wait for the GDC's VSync interrupt and then, within the display's vertical blanking interval, write to the relevant I/O ports to flip the accessed and shown pages. Doing it the other way around and flipping as soon as you're finished with the last draw call of a frame means that you'll very likely hit a point where the (real or emulated) electron beam is still traveling across the screen. This ensures that there will be a tearing line somewhere on the screen on all but the fastest PC-98 models that can render an entire frame of the Music Room completely within the vertical blanking interval, causing the very issue that double-buffering was supposed to prevent.
ZUN only fixed this landmine in TH05.
The polygons have a fixed vertex count and radius depending on their index, everything else is randomized. They are also never reinitialized while OP.EXE is running – if you leave the Music Room and reenter it, they will continue animating from the same position.
TH02 and TH04 don't handle it at all, causing held keys to be processed again after about a second.
TH03 and TH05 correctly work around the quirk, at the usual cost of a 614.4 µs delay per frame. Except that the delay is actually twice as long in frames in which a previously held key is released, because this code is a mess.
But even in 2024, DOSBox-X is the only emulator that actually replicates this detail of real hardware. On anything else, keyboard input will behave as ZUN intended it to. At least I've now mentioned this once for every game, and can just link back to this blog post for the other menus we still have to go through, in case their game-specific behavior matches this one.
TH02 is the only game that
separately lists the stage and boss themes of the main game, rather than following the in-game order of appearance,
continues playing the selected track when leaving the Music Room,
always loads both MIDI and PMD versions, regardless of the currently selected mode, and
does not stop the currently playing track before loading the new one into the PMD and MMD drivers.
The combination of 2) and 3) allows you to leave the Music Room and change the music mode in the Option menu to listen to the same track in the other version, without the game changing back to the title screen theme. 4), however, might cause the PMD and MMD drivers to play garbage for a short while if the music data is loaded from a slow storage device that takes longer than a single period of the OPN timer to fill the driver's song buffer. Probably not worth mentioning anymore though, now that people no longer try fitting PC-98 Touhou games on floppy disks.
Exactly 40 (TH02/TH03) / 38 (TH04/TH05) visible bytes per line,
padded with 2 bytes that can hold a CR/LF newline sequence for easier editing.
Every track starts with a title line that mostly just duplicates the names from the hardcoded tracklist,
followed by a fixed 19 (TH02/TH03/TH04) / 9 (TH05) comment lines.
In TH04 and TH05, lines can start with a semicolon (;) to prevent them from being rendered. This is purely a performance hint, and is visually equivalent to filling the line with spaces.
All in all, the quality of the code is even slightly below the already poor standard for PC-98 Touhou: More VRAM page copies than necessary, conditional logic that is nested way too deeply, a distinct avoidance of state in favor of loops within loops, and – of course – a couple of gotos to jump around as needed.
In TH05, this gets so bad with the scrolling and game-changing tracklist that it all gives birth to a wonderfully obscure inconsistency: When pressing both ⬆️/⬇️ and ⬅️/➡️ at the same time, the game first processes the vertical input and then the horizontal one in the next frame, making it appear as if the latter takes precedence. Except when the cursor is highlighting the first (⬆️ ) or 12th (⬇️ ) element of the list, and said list element is not the first track (⬆️ ) or the quit option (⬇️ ), in which case the horizontal input is ignored.
And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier…
For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".
With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌
The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.
Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there!
Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.
This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!
These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.
Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.
P0262
Decompilation (TH04/TH05 main/option menu)
P0263
Decompilation (TH04/TH05 first-launch sound setup menu + TH05 title screen animation)
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:
And once again, the Shuusou Gyoku task was too complex to be satisfyingly solved within a single month. Even just finding provably correct loop sections in both the original and arranged MIDI files required some rather involved detection algorithms. I could have just defined what sounded like correct loops, but the results of these algorithms were quite surprising indeed. Turns out that not even Seihou is safe from ZUN quirks, and some tracks technically loop much later than you'd think they do, or don't loop at all. And since I then wanted to put these MIDI loops back into the game to ensure perfect synchronization between the recordings and MIDI versions, I ended up rewriting basically all the MIDI code in a cross-platform way. This rewrite also uncovered a pbg bug that has traveled from Shuusou Gyoku into Windows Touhou, where it survived until ZUN ultimately removed all MIDI code in TH11 (!)…
Fortunately, the backlog still had enough general PC-98 Touhou funds that I could spend on picking some soon-important low-hanging fruit, giving me something to deliver for the end of the month after all. TH04 and TH05 use almost identical code for their main/option menus, so decompiling it would make number go up quite significantly and the associated blog post won't be that long…
Wait, what's this, a bug report from touhou-memories concerning the website?
Tab switchers tended to break on certain Firefox versions, and
video playback didn't work on Microsoft Edge at all?
Those are definitely some high-priority bugs that demand immediate attention.
The tab switcher issue was easily fixed by replacing the previous z-index trickery with a more robust solution involving the hidden attribute. The second one, however, is much more aggravating, because video playback on Edge has been broken ever since I 📝 switched the preferred video codec to AV1.
This goes so far beyond not supporting a specific codec. Usually, unsupported codecs aren't supposed to be an issue: As soon as you start using the HTML <video> tag, you'll learn that not every browser supports all codecs. And so you set up an encoding pipeline to serve each video in a mix of new and ancient formats, put the <source> tag of the most preferred codec first, and rest assured that browsers will fall back on the best-supported option as necessary. Except that Edge doesn't even try, and insists on staying on a non-playing AV1 video. 🙄
The codecs parameter for the <source> type attribute was the first potential solution I came across. Specifying the video codec down to the finest encoding details right in the HTML markup sounds like a good idea, similar to specifying sizes of images and videos to prevent layout reflows on long pages during the initial page load. So why was this the first time I heard of this feature? The fact that there isn't a simple ffprobe -show_html_codecs_string command to retrieve this string might already give a clue about how useful it is in practice. Instead, you have to manually piece the string together by grepping your way through all of a video's metadata…
…and then it still doesn't change anything about Edge's behavior, even when also specifying the string for the VP9 and VP8 sources. Calling the infamously ridiculous HTMLMediaElement.canPlayType() method with a representative parameter of "video/webm; codecs=av01.1.04M.08.0.000.01.13.00.0" explains why: Both the AV1-supporting Chrome and Edge return "probably", but only the former can actually play this format. 🤦
But wait, there is an AV1 video extension in the Microsoft Store that would add support to any unspecified favorite video app. Except that it stopped working inside Edge as of version 116. And even if it did: If you can't query the presence of this extension via JavaScript, it might as well not exist at all.
Not to mention that the favorite video app part is obviously a lie as a lot of widely preferred Windows video apps are bundled with their own codecs, and have probably long supported AV1.
In the end, there's no way around the utter desperation move of removing the AV1 <source> for Edge users. Serving each video in two other formats means that we can at least do something here – try visiting the GitHub release page of the P0234-1 TH01 Anniversary Edition build in Edge and you also don't get to see anything, because that video uses AV1 and GitHub understandably doesn't re-encode every uploaded video into a variety of old formats.
Just for comparison, I tried both that page and the ReC98 blog on an old Android 6 phone from 2014, and even that phone picked and played the AV1 videos with the latest available Chrome and Firefox versions. This was the phone whose available Firefox version didn't support VP9 in 2019, which was my initial reason for adding the VP8 versions. Looks like it's finally time to drop those… 🤔 Maybe in the far future once I start running out of space on this server.
Removing the <source> tags can be done in one of two places:
server-side, detecting Edge via the User-Agent header, or
I went with 2) because more dynamic server-side code would only move us further away from static site generation, which would make a lot of sense as the next evolutionary step in the architecture of this website. The client-side solution is much simpler too, and we can defer the deletion until a user actually hovers over a specific video.
And while we're at it, let's also add a popup complaining about this whole state of affairs. Edge is heavily marketed inside Windows as "the modern browser recommended by Microsoft", and you sure wouldn't expect low-quality chroma-subsampled VP9 from such a tagline. With such a level of anti-support for AV1, Edge users deserve to know exactly what's going on, especially since this post also explains what they will encounter on other websites.
That's the polite way of putting it.
Alright, where was I? For TH01, the main menu was the last thing I decompiled before the 100% finalization mark, so it's rather anticlimactic to already cover the TH04/TH05 one now, with both of the games still being very far away from 100%, just because people will soon want to translate the description text in the bottom-right corner of the screen. But then again, the ZUN Soft logo animation would make for an even nicer final piece of decompiled code, especially since the bouncing-ball logo from TH01, TH02, and TH03 was the very first decompilation I did, all the way back in 2015.
The code quality of ZUN's VRAM-based menus has barely increased between TH01 and TH05. Both the top-level and option menu still need to know the bounding rectangle of the other one to unblit the right pixels when switching between the two. And since ZUN sure loved hardcoded and copy-pasted numbers in the PC-98 days, the coordinates both tend to be excessively large, and excessively wrong. Luckily, each menu item comes with its own correct unblitting rectangle, which avoids any graphical glitches that would otherwise occur.
As for actual observable quirks and bugs, these menus only contain one of each, and both are exclusive to TH04:
Quitting out of the Music Room moves the cursor to the Start option. In TH05, it stays on Music Room.
Changing the S.E. mode seems to do nothing within TH04's menus, and would only take effect if you also change the Music mode afterward, or launch into the game.
And yes, these videos do have a frame rate of 2 FPS.
Now that 100% finalization of their OP.EXE binaries is within reach, all this bloat made me think about the viability of a 📝 single-executable build for TH04's and TH05's debloated and anniversary versions. It would be really nice to have such a build ready before I start working on the non-ASCII translations – not just because they will be based on the anniversary branch by default, but also because it would significantly help their development if there are 4 fewer executables to worry about.
However, it's not as simple for these games as it was for TH01. The unique code in their OP.EXE and MAINE.EXE binaries is much larger than Borland's easily removed C++ exception handler, so I'd have to remove a lot more bloat to keep the resulting single binary at or below the size of the original MAIN.EXE. But I'm sure going to try.
Speaking of code that can be debloated for great effect: The second push of this delivery focused on the first-launch sound setup menu, whose BGM and sound effect submenus are almost complete code duplicates of each other. The debloated branch could easily remove more than half of the code in there, yielding another ≈800 bytes in case we need them.
If hex-editing MIKO.CFG is more convenient for you than deleting that file, you can set its first byte to FF to re-trigger this menu. Decompiling this screen was not only relevant now because it contains text rendered with font ROM glyphs and it would help dig our way towards more important strings in the data segment, but also because of its visual style. I can imagine many potential mods that might want to use the same backgrounds and box graphics for their menus.
How about an initial language selection menu in the same style?
With the two submenus being shown in a fixed sequence, there's not a lot of room for the code to do anything wrong, and it's even more identical between the two games than the main menu already was. Thankfully, ZUN just reblits the respective options in the new color when moving the cursor, with no 📝 palette tricks. TH04's background image only uses 7 colors, so he could have easily reserved 3 colors for that. In exchange, the TH05 image gets to use the full 16 colors with no change to the code.
Rounding out this delivery, we also got TH05's rolling Yin-Yang Orb animation before the title screen… and it's just more bloat and landmines on a smaller scale that might be noticeable on slower PC-98 models. In total, there are three unnecessary inter-page copies of the entire VRAM that can easily insert lag frames, and two minor page-switching landmines that can potentially lead to tearing on the first frame of the roll or fade animation. Clearly, ZUN did not have smoothness or code quality in mind there, as evidenced by the fact that this animation simply displays 8 .PI files in sequence. But hey, a short animation like this is 📝 another perfectly appropriate place for a quick-and-dirty solution if you develop with a deadline.
And that's 1.30% of all PC-98 Touhou code finalized in two pushes! We're slowly running out of these big shared pieces of ASM code…
I've been neglecting TH03's OP.EXE quite a bit since it simply doesn't contain any translatable plaintext outside the Music Room. All menu labels are gaiji, and even the character selection menu displays its monochrome character names using the 4-plane sprites from CHNAME.BFT. Splitting off half of its data into a separate .ASM file was more akin to getting out a jackhammer to free up the room in front of the third remaining Music Room, but now we're there, and I can decompile all three of them in a natural way, with all referenced data.
Next up, therefore: Doing just that, securing another important piece of text for the upcoming non-ASCII translations and delivering another big piece of easily finalized code. I'm going to work full-time on ReC98 for almost all of December, and delivering that and the Shuusou Gyoku SC-88Pro recording BGM back-to-back should free up about half of the slightly higher cap for this month.
TH03 finally passed 20% RE, and the newly decompiled code contains no
serious ZUN bugs! What a nice way to end the year.
There's only a single unlockable feature in TH03: Chiyuri and Yumemi as
playable characters, unlocked after a 1CC on any difficulty. Just like the
Extra Stages in TH04 and TH05, YUME.NEM contains a single
designated variable for this unlocked feature, making it trivial to craft a
fully unlocked score file without recording any high scores that others
would have to compete against. So, we can now put together a complete set
for all PC-98 Touhou games: 2021-12-27-Fully-unlocked-clean-score-files.zip
It would have been cool to set the randomly generated encryption keys in
these files to a fixed value so that they cancel out and end up not actually
encrypting the file. Too bad that TH03 also started feeding each encrypted
byte back into its stream cipher, which makes this impossible.
The main loading and saving code turned out to be the second-cleanest
implementation of a score file format in PC-98 Touhou, just behind TH02.
Only two of the YUME.NEM functions come with nonsensical
differences between OP.EXE and MAINL.EXE, rather
than 📝 all of them, as in TH01 or
📝 too many of them, as in TH04 and TH05. As
for the rest of the per-difficulty structure though… well, it quickly
becomes clear why this was the final score file format to be RE'd. The name,
score, and stage fields are directly stored in terms of the internal
REGI*.BFT sprite IDs used on the high score screen. TH03 also
stores 10 score digits for each place rather than the 9 possible ones, keeps
any leading 0 digits, and stores the letters of entered names in reverse
order… yeah, let's decompile the high score screen as well, for a full
understanding of why ZUN might have done all that. (Answer: For no reason at
all. )
And wow, what a breath of fresh air. It's surely not
good-code: The overlapping shadows resulting from using
a 24-pixel letterspacing with 32-pixel glyphs in the name column led ZUN to
do quite a lot of unnecessary and slightly confusing rendering work when
moving the cursor back and forth, and he even forgot about the EGC there.
But it's nowhere close to the level of jank we saw in
📝 TH01's high score menu last year. Good to
see that ZUN had learned a thing or two by his third game – especially when
it comes to storing the character map cursor in terms of a character ID,
and improving the layout of the character map:
That's almost a nicely regular grid there. With the question mark and the
double-wide SP, BS, and END options, the cursor
movement code only comes with a reasonable two exceptions, which are easily
handled. And while I didn't get this screen completely decompiled,
one additional push was enough to cover all important code there.
The only potential glitch on this screen is a result of ZUN's continued use
of binary-coded
decimal digits without any bounds check or cap. Like the in-game HUD
score display in TH04 and TH05, TH03's high score screen simply uses the
next glyph in the character set for the most significant digit of any score
above 1,000,000,000 points – in this case, the period. Still, it only
really gets bad at 8,000,000,000 points: Once the glyphs are
exhausted, the blitting function ends up accessing garbage data and filling
the entire screen with garbage pixels. For comparison though, the current world record
is 133,650,710 points, so good luck getting 8 billion in the first
place.
Next up: Starting 2022 with the long-awaited decompilation of TH01's Sariel
fight! Due to the 📝 recent price increase,
we now got a window in the cap that
is going to remain open until tomorrow, providing an early opportunity to
set a new priority after Sariel is done.
P0165
TH01 decompilation (Missiles, part 1/2 + large boss sprites, part 1/3)
P0166
TH01 decompilation (Large boss sprites, part 2/3)
P0167
TH01 decompilation (Large boss sprites, part 3/3 + Stage initialization + Defeat animation + Route selection)
💰 Funded by:
Ember2528
🏷️ Tags:
OK, TH01 missile bullets. Can we maybe have a well-behaved entity type,
without any weirdness? Just once?
Ehh, kinda. Apart from another 150 bytes wasted on unused structure members,
this code is indeed more on the low end in terms of overall jank. It does
become very obvious why dodging these missiles in the YuugenMagan, Mima, and
Elis fights feels so awful though: An unfair 46×46 pixel hitbox around
Reimu's center pixel, combined with the comeback of
📝 interlaced rendering, this time in every
stage. ZUN probably did this because missiles are the only 16×16 sprite in
TH01 that is blitted to unaligned X positions, which effectively ends up
touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would
have been totally possible to render every missile in every frame at roughly
the same amount of CPU time that the original game uses for interlaced
rendering:
Note that all missile sprites only use two colors, white and green.
Instead of naively going with the usual four bitplanes, extract the
pixels drawn in each of the two used colors into their own bitplanes.
master.lib calls this the "tiny format".
Use the GRCG to draw these two bitplanes in the intended white and green
colors, halving the amount of VRAM writes compared to the original
function.
(Not using the .PTN format would have also avoided the inconsistency of
storing the missile sprites in boss-specific sprite slots.)
That's an optimization that would have significantly benefitted the game, in
contrast to all of the fake ones
introduced in later games. Then again, this optimization is
actually something that the later games do, and it might have in fact been
necessary to achieve their higher bullet counts without significant
slowdown.
After some effectively unused Mima sprite effect code that is so broken that
it's impossible to make sense out of it, we get to the final feature I
wanted to cover for all bosses in parallel before returning to Sariel: The
separate sprite background storage for moving or animated boss sprites in
the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin
with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from
main memory on every frame. After all, TH01 and TH02 had a minimum required
clock speed of 33 MHz, half of the speed required for the later three games.
So, he simply blitted these boss sprites to both VRAM pages, leading
the usual unblitting calls to only remove the other sprites on top of the
boss. However, these bosses themselves want to move across the screen…
and this makes it necessary to save the stage background behind them
in some other way.
Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM
into a sprite slot. No problem with that approach in theory, as the size of
all these bigger sprites is a multiple of 32×32; splitting a larger sprite
into these smaller 32×32 chunks makes the code look just a little bit clumsy
(and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot
that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is
blitted on top of her regular sprite, using just regular sprite
transparency, she ends up with her infamous third arm:
Ironically, there's an unused code path in Mima's unblit function where ZUN
assumes a height of 48 pixels for Mima's animation sprites rather than the
actual 64. This leads to even clumsier .PTN function calls for the bottom
128×16 pixels… Failing to unblit the bottom 16 pixels would have also
yielded that third arm, although it wouldn't have looked as natural. Still
wouldn't say that it was intentional; maybe this casting sprite was just
added pretty late in the game's development?
So, mission accomplished, Sariel unblocked… at 2¼ pushes. That's quite some time left for some smaller stage initialization
code, which bundles a bunch of random function calls in places where they
logically really don't belong. The stage opening animation then adds a bunch
of VRAM inter-page copies that are not only redundant but can't even be
understood without knowing the hidden internal state of the last VRAM page
accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any
complexity limit on inlining arithmetic expressions, as long as they only
operate on compile-time constants. That's how we get macro-free,
compile-time Shift-JIS to JIS X 0208 conversion of the individual code
points in the 東方★靈異伝 string, in a compiler from 1994. As long as you
don't store any intermediate results in variables, that is…
But wait, there's more! With still ¼ of a push left, I also went for the
boss defeat animation, which includes the route selection after the SinGyoku
fight.
As in all other instances, the 2× scaled font is accomplished by first
rendering the text at regular 1× resolution to the other, invisible VRAM
page, and then scaled from there to the visible one. However, the route
selection is unique in that its scaled text is both drawn transparently on
top of the stage background (not onto a black one), and can also change
colors depending on the selection. It would have been no problem to unblit
and reblit the text by rendering the 1× version to a position on the
invisible VRAM page that isn't covered by the 2× version on the visible one,
but ZUN (needlessly) clears the invisible page before rendering any text.
Instead, he assigned a separate VRAM color for both
the 魔界 and 地獄 options, and only changed the palette value for
these colors to white or gray, depending on the correct selection. This is
another one of the
📝 rare cases where TH01 demonstrates good use of PC-98 hardware,
as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.
Then, why does this still not count as good-code? When
changing palette colors, you kinda need to be aware of everything
else that can possibly be on screen, which colors are used there, and which
aren't and can therefore be used for such an effect without affecting other
sprites. In this case, well… hover over the image below, and notice how
Reimu's hair and the bomb sprites in the HUD light up when Makai is
selected:
This push did end on a high note though, with the generic, non-SinGyoku
version of the defeat animation being an easily parametrizable copy. And
that's how you decompile another 2.58% of TH01 in just slightly over three
pushes.
Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and
SinGyoku without needing any more detours into non-boss code. Thanks to the
current TH01 funding subscriptions, I can plan to cover most, if not all, of
Sariel in a single push series, but the currently 3 pending pushes probably
won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got
quite a lot of not specifically TH01-related funds in the backlog to pass
the time though.
Due to recent developments, it actually makes quite a lot of sense to take a
break from TH01: spaztron64 has
managed what every Touhou download site so far has failed to do: Bundling
all 5 game onto a single .HDI together with pre-configured PC-98
emulators and a nice boot menu, and hosting the resulting package on a
proper website. While this first release is already quite good (and much
better than my attempt from 2014), there is still a bit of room for
improvement to be gained from specific ReC98 research. Next up,
therefore:
Researching how TH04 and TH05 use EMS memory, together with the cause
behind TH04's crash in Stage 5 when playing as Reimu without an EMS driver
loaded, and
reverse-engineering TH03's score data file format
(YUME.NEM), which hopefully also comes with a way of building a
file that unlocks all characters without any high scores.
P0135
Separating translation units, part 6/10 (TH05 PMD loading / Music Room piano)
P0136
Separating translation units, part 7/10 (starting to catch up with TH04)
💰 Funded by:
[Anonymous]
🏷️ Tags:
Alright, no more big code maintenance tasks that absolutely need to be
done right now. Time to really focus on parts 6 and 7 of repaying
technical debt, right? Except that we don't get to speed up just yet, as
TH05's barely decompilable PMD file loading function is rather…
complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98
Touhou, I first consult the disassembly of Wolfenstein 3D. That game was
originally compiled with the quite similar Borland C++ 3.0, so it's quite
helpful to compare its ASM to the
officially released source
code. If I find the instructions in question, they mostly come from
that game's ASM code, leading to the amusing realization that "even John
Carmack was unable to get these instructions out of this compiler"
This time though, Wolfenstein 3D did point me
to Borland's intrinsics for common C functions like memcpy()
and strchr(), available via #pragma intrinsic.
Bu~t those unfortunately still generate worse code than what ZUN
micro-optimized here. Commenting how these sequences of instructions
should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely
though, clarifying the control flow, and clearly exposing a ZUN
bug: TH05's snd_load() will hang in an infinite loop when
trying to load a non-existing -86 BGM file (with a .M2
extension) if the corresponding -26 BGM file (with a .M
extension) doesn't exist either.
Unsurprisingly, the PMD channel monitoring code in TH05's Music Room
remains undecompilable outside the two most "high-level" initialization
and rendering functions. And it's not because there's data in the
middle of the code segment – that would have actually been possible with
some #pragmas to ensure that the data and code segments have
the same name. As soon as the SI and DI registers are referenced
anywhere, Turbo C++ insists on emitting prolog code to save these
on the stack at the beginning of the function, and epilog code to restore
them from there before returning.
Found that out in
September 2019, and confirmed that there's no way around it. All the
small helper functions here are quite simply too optimized, throwing away
any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate
that I do try.
Within that same 6th push though, we've finally reached the one function
in TH05 that was blocking further progress in TH04, allowing that game
to finally catch up with the others in terms of separated translation
units. Feels good to finally delete more of those .ASM files we've
decompiled a while ago… finally!
But since that was just getting started, the most satisfying development
in both of these pushes actually came from some more experiments with
macros and inline functions for near-ASM code. By adding
"unused" dummy parameters for all relevant registers, the exact input
registers are made more explicit, which might help future port authors who
then maybe wouldn't have to look them up in an x86 instruction
reference quite as often. At its best, this even allows us to
declare certain functions with the __fastcall convention and
express their parameter lists as regular C, with no additional
pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even
more amazing than previously thought when it comes to returning
pseudo-registers from inline functions. A nice example for
how this can improve readability can be found in this piece of TH02 code
for polling the PC-98 keyboard state using a BIOS interrupt:
inline uint8_t keygroup_sense(uint8_t group) {
_AL = group;
_AH = 0x04;
geninterrupt(0x18);
// This turns the output register of this BIOS call into the return value
// of this function. Surprisingly enough, this does *not* naively generate
// the `MOV AL, AH` instruction you might expect here!
return _AH;
}
void input_sense(void)
{
// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
// never emits as such, giving us only the three instructions we need.
_AH = keygroup_sense(8);
// Whereas this one gives us the one additional `MOV BH, AH` instruction
// we'd expect, and nothing more.
_BH = keygroup_sense(7);
// And now it's obvious what both of these registers contain, from just
// the assignments above.
if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
key_det |= INPUT_UP;
}
// […]
}
I love it. No inline assembly, as close to idiomatic C code as something
like this is going to get, yet still compiling into the minimum possible
number of x86 instructions on even a 1994 compiler. This is how I keep
this project interesting for myself during chores like these.
We might have even reached peak
inline already?
And that's 65% of technical debt in the SHARED segment repaid
so far. Next up: Two more of these, which might already complete that
segment? Finally!
P0124
TH04 decompilation (Character selection, part 1/2)
P0125
TH04 decompilation (Character selection, part 2/2)
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:
Turns out that TH04's player selection menu is exactly three times as
complicated as TH05's. Two screens for character and shot type rather than
one, and a way more intricate implementation for saving and restoring the
background behind the raised top and left edges of a character picture
when moving the cursor between Reimu and Marisa. TH04 decides to backup
precisely only the two 256×8 (top) and 8×244 (left) strips behind the
edges, indicated in red in the picture
below.
These take up just 4 KB of heap memory… but require custom blitting
functions, and expanding this explicitly hardcoded approach to TH05's 4
characters would have been pretty annoying. So, rather than, uh, not
explicitly hardcoding it all, ZUN decided to just be lazy with the backup
area in TH05, saving the entire 640×400 screen, and thus spending 128 KB
of heap memory on this rather simple selection shadow effect.
So, this really wasn't something to quickly get done during the first half
of a push, even after already having done TH05's equivalent of this menu.
But since life is very busy right now, I also used the occasion to start
addressing another code organization annoyance: master.lib's single master.h header file.
Now that ReC98 is trying to develop (or at least mimic) a more
type-safe C++ foundation to model the PC-98 hardware, a pure C header
(with counter-productive C++ extensions) is becoming increasingly
unidiomatic. By moving some of the original assumptions about function
parameters into the type system, we can also reduce the reliance on its
Japanese-only documentation without having to translate it
It's quite bloated, with at least 2800 lines of code that
currently are #included into the vast majority of files, not
counting master.h's recursively included C standard library
headers. PC-98 Touhou only makes direct use of a rather small fraction of
its contents.
And finally, all the DOS/V compatibility definitions are especially
useless in the context of ReC98. As I've noted
📝 time and
📝 time again, porting PC-98 Touhou to
IBM-compatible DOS won't be easy, and MASTER_DOSV won't be
helping much. Therefore, my upstream version of ReC98 will never include
all of master.lib. There's no point in lengthening compile times for
everyone by default, and those will be getting quite noticeable
after moving to a full 16-bit build process.
(Actually, what retro system ports should rather be doing: Get rid
of master.lib's original ASM code, replace it with
readable, modern
C++, and then simply convert the optimized assembly output of modern
compilers to your ISA of choice. Improving the landscape of such
assembly or object file converters would benefit everyone!)
So, time to start a new master.hpp header that would contain
just the declarations from master.h that PC-98 Touhou
actually needs, plus some semantic (yes, semantic) sugar. Comparing just
the old master.h to just the new master.hpp
after roughly 60% of the transition has been completed, we get median
build times of 319 ms for master.h, and 144 ms for
master.hpp on my (admittedly rather slow) DOSBox setup.
Nice!
As of this push, ReC98 consists of 107 translation units that have to be
compiled with Turbo C++ 4.0J. Fully rebuilding all of these currently
takes roughly 37.5 seconds in DOSBox. After the transition to
master.hpp is done, we could therefore shave some 10 to 15
seconds off this time, simply by switching header files. And that's just
the beginning, as this will also pave the way for further
#include optimizations. Life in this codebase will be great!
Unfortunately, there wasn't enough time to repay some of the actual
technical debt I was looking forward to, after all of this. Oh well, at
least we now also have nice identifiers for the three different boldface
options that are used when rendering text to VRAM, after procrastinating
that issue for almost 11 months. Next up, assuming the existing
subscriptions: More ridiculous decompilations of things that definitely
weren't originally written in C, and a big blocker in TH03's
MAIN.EXE.
So, TH05 OP.EXE. The first half of this push started out
nicely, with an easy decompilation of the entire player character
selection menu. Typical ZUN quality, with not much to say about it. While
the overall function structure is identical to its TH04 counterpart, the
two games only really share small snippets inside these functions, and do
need to be RE'd separately.
The high score viewing (not registration) menu would have been next.
Unfortunately, it calls one of the GENSOU.SCR loading
functions… which are all a complete mess that still needed to be sorted
out first. 5 distinct functions in 6 binaries, and of course TH05 also
micro-optimized its MAIN.EXE version to directly use the DOS
INT 21h file loading API instead of master.lib's wrappers.
Could have all been avoided with a single method on the score data
structure, taking a player character ID and a difficulty level as
parameters…
So, no score menu in this push then. Looking at the other end of the ASM
code though, we find the starting functions for the main game, the Extra
Stage, and the demo replays, which did fit perfectly to round out
this push.
Which is where we find an easter egg! 🥚 If you've ever looked into
怪綺談2.DAT, you might have noticed 6 .REC files
with replays for the Demo Play mode. However, the game only ever seems to
cycle between 4 replays. So what's in the other two, and why are they
40 KB instead of just 10 KB like the others? Turns out that they
combine into a full Extra Stage Clear replay with Mima, with 3 bombs and 1
death, obviously recorded by ZUN himself. The split into two files for the
stage (DEMO4.REC) and boss (DEMO5.REC) portion is
merely an attempt to limit the amount of simultaneously allocated heap
memory.
To watch this replay without modding the game, unlock the Extra Stage with
all 4 characters, then hold both the ⬅️ left and ➡️ right arrow keys in the
main menu while waiting for the usual demo replay.
I can't possibly be the first one to discover this, but I couldn't find
any other mention of it. Edit (2021-03-15): ZUN did in fact document this replay
in Section 6 of TH05's OMAKE.TXT, along with the exact method
to view it.
Thanks
to Popfan for the discovery!
Here's a recording of the whole replay:
Note how the boss dialogue is skipped. MAIN.EXE actually
contains no less than 6 if() branches just to distinguish
this overly long replay from the regular ones.
I'd really like to do the TH04 and TH05 main menus in parallel, since we
can expect a bit more shared code after all the initial differences.
Therefore, I'm going to put the next "anything" push towards covering the
TH04 version of those functions. Next up though, it's back to TH01, with
more redundant image format code…
Finally, after a long while, we've got two pushes with barely anything to
talk about! Continuing the road towards 100% PI for TH05, these were
exactly the two pushes that TH05 MAINE.EXE PI was estimated
to additionally cost, relative to TH04's. Consequently, they mostly went
to TH05's unique data structures in the ending cutscenes, the score name
registration menu, and the
staff roll.
A unique feature in there is TH05's support for automatic text color
changes in its ending scripts, based on the first full-width Shift-JIS
codepoint in a line. The \c=codepoint,color
commands at the top of the _ED??.TXT set up exactly this
codepoint→color mapping. As far as I can tell, TH05 is the only Touhou
game with a feature like this – even the Windows Touhou games went back to
manually spelling out each color change.
The orb particles in TH05's staff roll also try to be a bit unique by
using 32-bit X and Y subpixel variables for their current position. With
still just 4 fractional bits, I can't really tell yet whether the extended
range was actually necessary. Maybe due to how the "camera scrolling"
through "space" was implemented? All other entities were pretty much the
usual fare, though.
12.4, 4.4, and now a 28.4 fixed-point format… yup,
📝 C++ templates were
definitely the right choice.
At the end of its staff roll, TH05 not only displays
the usual performance
verdict, but then scrolls in the scores at the end of each stage
before switching to the high score menu. The simplest way to smoothly
scroll between two full screens on a PC-98 involves a separate bitmap…
which is exactly what TH05 does here, reserving 28,160 bytes of its global
data segment for just one overly large monochrome 320×704 bitmap where
both the screens are rendered to. That's… one benefit of splitting your
game into multiple executables, I guess?
Not sure if it's common knowledge that you can actually scroll back and
forth between the two screens with the Up and Down keys before moving to
the score menu. I surely didn't know that before. But it makes sense –
might as well get the most out of that memory.
The necessary groundwork for all of this may have actually made
TH04's (yes, TH04's) MAINE.EXE technically
position-independent. Didn't quite reach the same goal for TH05's – but
what we did reach is ⅔ of all PC-98 Touhou code now being
position-independent! Next up: Celebrating even more milestones, as
-Tom- is about to finish development on his TH05
MAIN.EXE PI demo…
P0092
TH01 decompilation (Score menu, part 2)
P0093
TH01 decompilation (Score menu, part 3)
P0094
TH01 decompilation (Score menu, part 4 + Endings, part 1)
💰 Funded by:
Yanga, Ember2528
🏷️ Tags:
Three pushes to decompile the TH01 high score menu… because it's
completely terrible, and needlessly complicated in pretty much every
aspect:
Another, final set of differences between the REIIDEN.EXE
and FUUIN.EXE versions of the code. Which are so
insignificant that it must mean that ZUN kept this code in two
separate, manually and imperfectly synced files. The REIIDEN.EXE
version, only shown when game-overing, automatically jumps to the
enter/終 button after the 8th character was entered,
and also has a completely invisible timeout that force-enters a high score
name after 1000… key presses? Not frames? Why. Like, how do you
even realistically such a number. (Best guess: It's a hidden easter egg to
amuse players who place drinking glasses on cursor keys. Or beer bottles.)
That's all the differences that are maybe visible if you squint
hard enough. On top of that though, we got a bunch of further, minor code
organization differences that serve no purpose other than to waste
decompilation time, and certainly did their part in stretching this out to
3 pushes instead of 2.
Entered names are restricted to a set of 16-bit, full-width Shift-JIS
codepoints, yet are still accessed as 8-bit byte arrays everywhere. This
bloats both the C++ and generated ASM code with needless byte splits,
swaps, and bit shifts. Same for the route kanji. You have this 16-, heck,
even 32-bit CPU, why not use it?! (Fun fact: FUUIN.EXE is
explicitly compiled for a 80186, for the most part – unlike
REIIDEN.EXE, which does use Turbo C++'s 80386 mode.)
The sensible way of storing the current position of the alphabet
cursor would simply be two variables, indicating the logical row and
column inside the character map. When rendering, you'd then transform
these into screen space. This can keep the on-screen position constants in
a single place of code.
TH01 does the opposite: The selected character is stored directly in terms
of its on-screen position, which is then mapped back to a character
index for every processed input and the subsequent screen update. There's
no notion of a logical row or column anywhere, and consequently, the
position constants are vomited all over the code.
Which might not be as bad if the character map had a uniform
grid structure, with no gaps. But the one in TH01 looks like this:
And with no sense of abstraction anywhere, both input handling and
rendering end up with a separate if branch for at least 4 of
the 6 rows.
In the end, I just gave up with my usual redundancy reduction efforts for
this one. Anyone wanting to change TH01's high score name entering code
would be better off just rewriting the entire thing properly.
And that's all of the shared code in TH01! Both OP.EXE and
FUUIN.EXE are now only missing the actual main menu and
ending code, respectively. Next up, though: The long awaited TH01 PI push.
Which will not only deliver 100% PI for OP.EXE and
FUUIN.EXE, but also probably quite some gains in
REIIDEN.EXE. With now over 30% of the game decompiled, it's about
time we get to look at some gameplay code!
P0090
TH01 decompilation (Input blockers + Input, part 1)
P0091
TH01 decompilation (Input, part 2 + Score menu, part 1)
💰 Funded by:
Yanga, Ember2528
🏷️ Tags:
Back to TH01, and its high score menu… oh, wait, that one will eventually
involve keyboard input. And thanks to the generous TH01 funding situation,
there's really no reason not to cover that right now. After all,
TH01 is the last game where input still hadn't been RE'd.
But first, let's also cover that one unused blitting function, together
with REIIDEN.CFG loading and saving, which are in front of
the input function in OP.EXE… (By now, we all know about
the hidden start bomb configuration, right?)
Unsurprisingly, the earliest game also implements input in the messiest
way, with a different function for each of the three executables. "Because
they all react differently to keyboard inputs ",
apparently? OP.EXE even has two functions for it, one for the
START / CONTINUE / OPTION / QUIT main
menu, and one for both Option and Music Test menus, both of which directly
perform the ring arithmetic on the menu cursor variable. A consistent
separation of keyboard polling from input processing apparently wasn't all
too obvious of a thought, since it's only truly done from TH02 on.
This lack of proper architecture becomes actually hilarious once you
notice that it did in fact facilitate a recursion bug!
In case you've been living under a rock for the past 8 years, TH01 shipped
with debugging features, which you can enter by running the game via
game d from the DOS prompt. These features include a
memory info screen, shown when pressing PgUp, implemented as one blocking
function (test_mem()) called directly in response to the
pressed key inside the polling function. test_mem() only
returns once that screen is left by pressing PgDown. And in order to poll
input… it directly calls back into the same polling function that called
it in the first place, after a 3-frame delay.
Which means that this screen is actually re-entered for every 3 frames
that the PgUp key is being held. And yes, you can, of course, also
crash the system via a stack overflow this way by holding down PgUp for a
few seconds, if that's your thing. Edit (2020-09-17): Here's a video from
spaztron64, showing off this
exact stack overflow crash while running under the
VEM486
memory manager, which displays additional information about these
sorts of crashes:
What makes this even funnier is that the code actually tracks the last
state of every polled key, to prevent exactly that sort of bug. But the
copy-pasted assignment of the last input state is only done aftertest_mem() already returned, making it effectively pointless
for PgUp. It does work as intended for PgDown… and that's why you
have to actually press and release this key once for every call to
test_mem() in order to actually get back into the game. Even
though a single call to PgDown will already show the game screen
again.
In maybe more relevant news though, this function also came with what can
be considered the first piece of actual gameplay logic! Bombing via
double-tapping the Z and X keys is also handled here, and now we know that
both keys simply have to be tapped twice within a window of 20 frames.
They are tracked independently from each other, so you don't necessarily
have to press them simultaneously.
In debug mode, the bomb count tracks precisely this window of
time. That's why it only resets back to 0 when pressing Z or X if it's
≥20.
Sure, TH01's code is expectedly terrible and messy. But compared to the
micro-optimizations of TH04 and TH05, it's an absolute joy to work on, and
opening all these ZUN bug loot boxes is just the icing on the cake.
Looking forward to more of the high score menu in the next pushes!
🎉 TH04's and TH05's OP.EXE are now fully
position-independent! 🎉
What does this mean?
You can now add any data or code to the main menus of the two games, by
simply editing the ReC98 source, writing your mod in ASM or C/C++, and
recompiling the code. Since all absolute memory addresses have now been
converted to labels, this will work without causing any instability. See
the position independence section in the FAQ
for a more thorough explanation about why this was a problem.
What does this not mean?
The original ZUN code hasn't been completely reverse-engineered yet, let
alone decompiled. Pretty much all of that is still ASM, which might make
modding a bit inconvenient right now.
Since this push was otherwise pretty unremarkable, I made a video
demonstrating a few basic things you can do with this:
Now, what to do for the last outstanding Touhou Patch Center push?
Bullets, or resident structures?