Blog

Showing all posts tagged

📝 Posted:
💰 Funded by:
[Anonymous], Ember2528, LeyDud
🏷️ Tags:

Most of my blog posts are way too long because they tend to cover multiple and often unrelated topics. I'm well aware that this blog would reach way more people if I just split these posts into multiple smaller and more thematically focused ones. This time, however, there would have been absolutely no way I could have split off anything about TH03's enemy, fireball, explosion, chaining, and combo systems while still leaving you with a coherent understanding. If this were a multi-post series, you'd be clicking back and forth to even just fully understand one of those systems in its entirety.

Really, what is this mess? I've procrastinated RE work on any of these systems ever since 📝 2022, when it became clear just how much enemies, fireballs, and explosions are connected with each other. Because both enemies and fireballs are things that can explode, ZUN decided to use the same 34-byte structure for them and the explosions they turn into, forming a single 64-element array of explodable entities.
This might remind you of 📝 TH04's and 📝 TH05's custom entity structures. Aren't these even used for 6 different things in each of the two games? Well, the difference between those games and TH03 is like night and day:

This was very much not code written to be read and understood. This was code that naturally evolved from an explorative process of playing around with the data, with the goal of creating an exciting game whose exact mechanics are hard to figure out from the outside.
But the result is a ≥2,200-line mess of tangled spaghetti that not only accesses the ostensibly same structure in subtly different ways, but also splashes an unhealthy dose of mutable global state on top, adding an extra layer of intractability. It's so convoluted that I would have loved to refactor it immediately before continuing with anything else in this project. It certainly makes sense to do so earlier rather than later because TH03 is the single most ideal PC-98 Touhou game for these sorts of sweeping architecture changes. Its highly dynamic gameplay makes it highly likely for small accidental deviations from ZUN's gameplay logic to snowball into wildly different game states, and we can easily observe these by comparing the demo replays against their 📝 original state. Alas, we're still 719 potential memory references away from position independence making that possible.

This has been the single hardest to understand piece of game logic in all of PC-98 Touhou. The RE content of this post alone beats 📝 the previous record holder for the post with the largest amount of PC-98 Touhou RE content by 2.38×, and would be the second-heaviest post by HTML size that this project has seen so far, just behind 📝 the giant Shuusou Gyoku waveform BGM project. What else could possibly be left in these games that would be harder than this? A lot of the not yet RE'd code is either an isolated gameplay system or isolated to specific characters or bosses, and 📝 a lot of script code will be highly repetitive… yeah, I'm pretty certain that PC-98 Touhou won't have a single remaining RE task that will take as long to complete as this one.
How do you structure a blog post that covers so many intertwined systems? Let's try building this up in the logical order from most to least intertwined:

  1. An important note about durations in TH03
  2. TH03's chaining system
  3. Difficulty-specific tuning
  4. TH03's combo system
  5. TH03's enemy formation scripts
  6. TH03's enemies themselves
  7. TH03's explosions
  8. TH03's fireballs
  9. TH03's score reduction and extend glitches
  10. Completing the Fediverse migration

Before I can talk about anything else, I need to clarify the most important fact about in-game durations in TH03:

Any gameplay-related frame time or real-time duration in TH03 does not include the first 27 frames of every WARNING!! popup shown throughout a round. These frames freeze gameplay precisely because they run in a separate blocking function that runs its own VSync delay loop.

Obviously, this applies both to this blog and to anything you can read in the code. The gameplay community has been rather confused as far as precise durations are concerned, so I'm going to occasionally link back to this explanation in this post.


With that out of the way, let's start with…

The chaining system

…which orchestrates most of TH03's wildly escalating gameplay. Internally, chaining works like this:

The charge values added for destroying enemies with explosions require further explanation, but we can already cover the pellet_and_fireball_value-dependent values added when destroying those two:

Pellet/fireball value Chain effect
≥ 00 and < 04 charge_fireball +2
≥ 04 and < 10 charge_fireball +5
≥ 10 and < 20 charge_fireball +2, charge_exatt +1
≥ 20 charge_exatt +2

A particularly interesting detail about those: If more than one explosion hits a pellet or fireball within the same frame, these charges are added separately for each colliding explosion, to their respective chains. For red fireballs, TH03 then also adds a final +1 to the charge_exatt value of the chain whose colliding explosion has the highest index within the overall array of explodable entities.

Explaining the combo system would be the logical next step, but that system has a dependency on…

Difficulty-specific tuning

Where we get right to the final variable that would 📝 define TH03's difficulty in a round of netplay:

round_speed

This is a Q4.4 fixed-point value that is limited to a range from 0.0 to 7.9375 inclusive, increments by 0.0625 every 64 frames, and starts out at the following values:

Easy Normal Hard Lunatic 📝 Demo
(Round ID × 1.0) (Round ID × 2.0) (2.0 + (Round ID × 2.0)) 6.0 4.0
With the round ID being 0-based and representing either the current VS Mode round or the number of times a Story Mode stage has been repeated.

This variable is then used to derive a whole variety of speeds and limits:

Yup – the difficulty setting in the Option menu merely controls where all of these values start out at. After 6,144 frames of gameplay, or 1:49 minutes, even an Easy round will have accelerated to Lunatic levels.

enemy_speed

Whenever the game spawns a new formation, it updates this variable by integer-dividing round_speed by 2.0, resulting in a variable that ranges from 0 to 3 inclusive.
This value is then used to scale several speed-related variables using rather weird formulas. For example, ZUN scales movement durations using a formula of

(⌊duration/2⌋ × 3) - ⌊(enemy_speed × duration)/6

Which looks like you could simplify it down to

(duration × (9 - enemy_speed))/6

But as the ⌊floor signs⌋ indicate, all of these divisions operate, once again, on integers, where additional divisions lead to additionally truncated remainders. In this case, they introduce an error of ±1 compared to the simplified formula at regular intervals. Such an error might look like ZUN tried to correctly round the non-truncated floating-point result, but that result would sometimes be even further away from what you get from the janky double-division formula. Let's take a duration of 25, for example:

enemy_speed Simplified, real Double-divided
037.50036
133.33332
229.16628
325.00024

Hence, this makes no sense on any level. The simplified expressions would take up less space in both the C++ code and the binary and execute faster. :zunpet:
The same happens for the speeds that enemies move at. In the code, ZUN uses a formula of

(enemy_speed × speed)/9⌋ + ⌊(2 × speed)/3

which looks like

(speed × (enemy_speed + 6))/9

but actually isn't, and suffers from the same off-by-one errors at regular intervals.

The final and most complex enemy_speed-dependent formula is used for tuning the angles for wavy or circular motions. This time around, ZUN actually used 16-bit angles, treating each enemy's angle field as effectively a Q8.8 fixed-point value. The movement vectors themselves are still calculated by indexing master.lib's 8-bit sine and cosine lookup tables with the top 8 bits of such an angle, but the extra 8 bits below add extra precision when adding the tuned angle_speed on every frame of such a motion.
This extended value range allowed ZUN to reduce the impact of enemy_speed to, literally, a very small degree:

(512 × angle_speed)/3⌋ + ⌊(enemy_speed × ⌊(512 × angle_speed)/3⌋)/9

This formula would have benefited most from a simplification to

((512 × angle_speed)/3) + ((512 × angle_speed × enemy_speed)/27)

which would have only introduced a maximum discrete error of 1.7 per frame, or an error of 0.0094° when converted to 360° angles. But the more critical issue with this formula lies in the first term, ⌊(512 × angle_speed)/3⌋. Multiplying an 8-bit angle_speed by 512 effectively left-shifts the value by 9 bits, requiring 17 bits to store all possible resulting values. However, ZUN then assigns this temporary result to a signed 16-bit variable, causing an overflow into negative numbers – and thus, an incorrectly reversed enemy rotation – as soon as angle_speed reaches ≥0x40, rather than the ≥0x80 where the sign is actually supposed to flip.
Then again, who would seriously add ±90° to an enemy's movement angle per frame. -0x08 and +0x08 are the highest speeds used in the original scripts, so this counts as neither a quirk nor a landmine in my book.


But now we can look at…

The combo system

As you might have already guessed from the list of per-chain fields, the game indeed only tracks the total combo bonus for each player as a single divided-by-10 16-bit value that every chain with ≥2 hits adds onto. For clarity, I'll only use the displayed values in this post. Once the displayed value reaches 655,350, it therefore becomes impossible to cancel the 📝 80-frame cooldown period by adding more bonus points. This also means that you lose all bonus points you would have earned within these last 80 frames.
The bonus value itself grows as you destroy certain entity types with an explosion:

Entity Bonus
Enemies (160 × hits)
Pellets (( (20 × hits) + (160 × round_speed)) × f)
Blue fireballs1000 + (160 × hits)
Red fireballs4440 + (160 × hits)
This multiplication effectively converts round_speed into an integer value.

With f being:

The thick border in the table indicates once again that the pellet and fireball bonuses are awarded multiple times if the entity is destroyed by more than one explosion on the same frame, just like the charges of its chain. Here are some messy screenshots of that case happening:

Screenshot of two explosions destroying a fireball on the same frame of TH03 gameplayScreenshot of the immediately following frame, showing how the game added 31,600 bonus points for destroying just this one fireball, which is more than twice as high as you'd expect from the table
That's 2 more hits on the Hit! counter, and an additional 1,600 points of score, and Mima demonstrably is not hitting anything other than this one red fireball here.

The bonus difference between the two frames is 31,600, which is exactly

First hit ((4,440 + (14 × 160) + (160 × 7.125)) × 2) +
Second hit ((4,440 + (15 × 160) + (160 × 7.125)) × 2)

With a round_speed of 7.125, we can further deduce that this screenshot was made about 21 seconds into a round at Lunatic difficulty, which checks out when looking at the score. Conclusion: The best way to make things happen in TH03 is to destroy red fireballs with explosions, and preferably multiple explosions in a single frame.

Boss Attacks and Panics

People seem to have the impression that this system and its bonus point (or Spell Point) requirements are complicated, so let's describe it all in just a few bullet points:


Now that we know most of what there is to know about scoring, let's get the whole system going by placing some enemies in more or less chainable positions on the playfield!

Enemy formations

These are loaded from ENEDAT.DAT, which is the only bytecode format in the entirety of TH03. The original game does come with hardcoded limits of up to 16 enemies per formation and up to 24 formations in total, but the actual number of formations is neatly taken from the file itself.
The original ENEDAT.DAT defines 18 different formations, which I all recorded and posted to the Fediverse along with screenshots of their reconstructed script code, in their original order within the file. Isn't it nice that my primary PR channel can now hold lossless AV1 videos and lossless images that are just as self-hosted as this blog is, and that I can just link to without them taking up space in a blog post? Since there's nothing beyond the internal IDs in the game's memory that you could possibly want to cross-reference these formations with, I chose to number them using the same 0-based scheme. For the same reason, the names aren't important because nothing in the game needs to name these formations, so I just used the names that the gameplay community came up with, taken from these videos by Christian Azinn and KirbyComment.

Obviously, all these source code images mean that I've also written a dumper for the bytecode format, supporting the 13 functions that ZUN actually used in the original ENEDAT.DAT. Since we 📝 no longer build 32-bit Windows binaries as part of our build process, the ReC98 build process currently exclusively compiles this dumper as a DOS binary to bin\Pipeline\enedat.com. This allowed me to use the game's own header files at the expense of not having a trivially buildable native Windows or Linux version at this point. Such a build wouldn't be all that interesting without a compile option either, I'd say – and that part would definitely fall into the realm of contribution-ideas or require dedicated funding.

Gameplay details

These are quickly summarized:

As usual for these sorts of script file-formats, I once again wrote a full…

Command reference

…which only requires a few additional bits of context to understand:

Opcode Function and parameters Description
0x00 Stop Immediately removes the enemy from the playfield and stops script execution.
0x01 Linear move
angle, speed, duration
Sets the enemy's angle and speed fields to the given parameters, recalculates the velocity, and moves the enemy at this velocity for the given duration.
0x02 Circular move
angle_start, speed, angle_speed, duration
Moves the enemy along a circular path. Calculates a new velocity from the given speed and the sine and cosine of angle on every frame. Starts by resetting angle to angle_start, and then adds angle_speed on every frame of the motion.
0x03 Wait
duration
Does nothing for a while. Exclusively used to delay later enemies that follow the same path as earlier ones.
0x04 Wavy X / linear Y move
speed_x, angle_speed, velocity_y, duration
Moves the enemy on a sinusoidal path. Calculates a new velocity by applying a cosine oscillation multiplied by the given speed on the wavy axis while using the given constant velocity on the linear axis. Starts out with angle at 0x00 and adds angle_speed on every frame of the motion.
Most notably, this is the only movement type used in formations #1 (Drunkards) and #12 (Off-centered Crossing).
0x05 Wavy Y / linear X move
speed_y, angle_speed, velocity_x, duration
0x06 Move
duration
Continues moving the enemy at its current velocity.
0x07 Set speed and move
speed, duration
Like 0x01, but only sets the enemy's speed to the given parameter, retaining its current angle for the velocity calculation.
0x08 Linear move, stopping at player Y
angle, speed, duration
Like 0x01, but stops if the enemy's center coordinate intersects with the player sprite on the Y or X axis.
0x09 Linear move, stopping at player X
angle, speed, duration
0x0A Directional circular move
angle_start, speed, angle_speed, velocity_x_plus, velocity_y_plus, duration
Like 0x02, but adds a constant vector on top of the recalculated per-frame velocity.
0x10 Spawn
center_x÷8, center_y÷8, size_words[4], hp[4], clip_x, clip_bottom, unused
Makes enemy appear. The center_x and center_y values are in playfield space; storing them as multiples of 8 is a neat way to cover a range from (⁠-128 × 8⁠) = -1024 to (⁠+127 × 8⁠) = +1016 pixels in a single byte. size_words (the size of the enemy in multiples of 16) and hp are conveniently indexed with enemy_speed (see above).
0x80 Loop (absolute jump)
target, count
Supposed to loop the block between the current instruction pointer (IP) and target (0x80) or (disp + IP) (0x81), but broken due to several bugs in the implementation. These bugs would later be fixed in TH04, where such a loop appears in the very first enemy formation of Stage 1.
0x81 Loop (relative jump)
disp, count
0x82 Set clip_x flag Sets either of the two flags to true. Useful if the enemy was previously spawned with clip_x or clip_bottom set to false: In that case, the enemy first gets to live outside the boundaries of its playfield without being clipped, and this call can then reactivate clipping for proper removal later in the script. This can be seen in formations #16 (Zigzag) and #17 (Flying Junction).
0x83 Set clip_bottom flag

With no way of deduplicating the multiple enemies that fly along the same path in most formations, the scripts in ENEDAT.DAT end up highly copy-pasted, with individual enemies often only differing in the initial delay. It's hard to criticize this though, as this simple non-abstracted approach also provided the flexibility for ZUN to have enemies on multiple paths within the same formation in the first place.

If there is one design flaw in this format, it's the mere existence of the stop instruction. It does fulfill the role of terminating the script interpreter, whose instruction pointer would otherwise continue past the end of an enemy's script, but even that detail wouldn't justify its existence if we consider the design of these enemy formations. Every enemy of every formation is designed to fly past some edge of the playfield, get clipped, despawn, and end script execution that way, yet they still remove themselves using this stop instruction instead of relying on playfield edge clipping to do so.
As usual, I wouldn't mention this if it didn't cause at least one quirk in the game. If we look closely at formation #13 (Folding "7"), we see that the center enemies on the straight vertical path call stop() just a bit too early and despawn themselves just a few pixels above the bottom edge of the playfield:

You could equally argue that this quirk is simply caused by the human error of passing a duration to move_linear() that is shorter than it should have been. Since the playfield is 368 pixels tall, a 64-pixel enemy spawned at a top Y position of (-32 - (64/2)) = -64 and moving at a velocity of 4 pixels per second would need to keep moving for ((64 + 368)/4) = 108 frames rather than just 100. But removing the stop instruction would have made that quirk obvious by causing clearly visible glitches as a result of the script instruction pointer no longer being kept from running past the end of the respective enemy scripts.


Enemies

Despite all the explanations above, there's still a bit left to be said about enemies themselves:


For reasons that will soon become apparent, I'll skip over fireballs for now. Instead, let's continue with…

Explosions

Collision detection is the only piece of explosion-related code shared between enemies and fireballs, so we can visualize most of the gritty internal details by simply looking at the hitbox frame data… and of course you're getting all 30 permutations of explosion size/source and destroyable entity:

Thanks to the anonymous backer for providing the Anything budget for these multi-row tabs at the end of the 6th push!

We can summarize these 30 videos in just 5 bullet points:

But there is one more obscure detail that we can't see in these frame data videos. Even within the window where explosions can hit enemies, TH03 further caps the total amount of damage that a single explosion can deal to them. Each enemy-originating explosion can deal up to (size_words × 2) HP of damage in total, while fireball-originating explosions can deal up to 4 HP if the explosion originated from a blue fireball and 6 HP if it originated from a red one. ZUN's code suggests that this detail is among the most accidentally emerged features in the entire game – because of course he did not separately store the damage cap in one of the 14 padding bytes within the explodable structure, but simply reinterpreted size_words (for enemies) or the variant field (for fireballs) as half of the cap's value. :zunpet:
This limitation even seems largely pointless from a game design standpoint at first:

Sure, once we consider the small number of frames where explosions can actually hit enemies, this might make it less likely for older explosions to kill enemies further along the formation's path, but does it really matter in practice?
Turns out that we have to look no further than formation #4 (Racetrack), which sends 12 enemies with increasing HP values along the same path. If we colorize explosions based on the total amount of damage they've dealt, we can clearly see the desync just 48 frames after the first explosion:

I chose black for explosions that currently can't deal damage to enemies. The other 6 colors (dark green, light green, dark red, light red, dark blue, and light blue to represent 0 to 5 dealt hits in this order) are TH03's regular in-game VRAM colors from #5 to #10.

And now imagine a more complex game state where these earlier explosions might in turn cause hit-count-dependent pellets, charged fireballs, or charged Extra Attacks to fire earlier…


Only one left to go then!

Fireballs

As usual, let's start with a few facts in bullet-point form:

The more interesting parts about fireballs all relate to their destruction. Let's start with destruction by explosions, which seems to always send one new red fireball to the other player's field. In reality, these spawns are limited by a "generation number" system: Each blue fireball starts with a generation number of 0, while each red fireball starts, on paper, with 1 plus the generation number of whatever fireball was involved in its creation. In the case of explosions, the game obviously uses the generation number of the exploding fireball itself, but will only spawn such a new fireball if that number is <4.
This might look like even more of a rarely-triggered and expendable gameplay detail than the enemy damage cap we've seen above. How often does the "same" fireball really get transferred between both playfields 4 times in a row, without any player dropping a single generation? Well, exactly this case happens a mere 41.8 real-time seconds into 📝 the very first demo of Reimu vs. Mima. Removing the generation limit would fork gameplay as an explosion would then spawn another 5th-generation fireball that otherwise wouldn't have been there.

Sometimes, however, the game also appears to spawn additional red fireballs in response to destroying only a single one with an explosion. The cleanest example I could find in the four demos happens 1:05 minutes into the Kotohime Kotohime vs. Marisa Marisa demo:

Note how the "primary" new fireball described above is spawned from the center of its explosion, while the "additional" new fireball is spawned from the center of the explosion that caused the old fireball to explode.

With all the features we've previously looked at, it's easy to explain why we get that second fireball – because the chain's fireball charge value was high enough to spawn one. But why is this new fireball red? Doesn't 夢時空.TXT say that red [fireballs] have been sent back at least once?

赤いのは、1回以上送りかえされたことのある奴です。

Yup, 📝 another case where the manual is flat out wrong. The color variant of newly spawned fireballs is read from a global variable that is set to red at the beginning of the fireball collision detection function, set back to blue before returning, and not modified anywhere else. Hence, this color variable does not just apply to the fireballs spawned directly in the collision handler itself, but also to any fireballs spawned indirectly via the chaining system as a result of fireballs colliding with explosions. Given the fact that there's a significant gameplay difference between red fireballs and blue ones, this is quite a significant error in the manual, I'd say… :thonk:

But it gets even quirkier. Which generation number do these chain-charged red fireballs spawn with? Logically, you'd expect 0, but red fireballs are always spawned with a generation number of at least 1. Apart from that, though, there is no other generation number that the code could meaningfully propagate. These fireballs are spawned by the same generic explosion collision handler that would also spawn blue fireballs upon destroying pellets, which only has access to a hitbox and not to the structure instance of the object it tests all explosions against.
But once again, the fireball spawn function reads the previous generation number from a global variable, so it will assign… 1 plus whatever generation number the previous explosion-destroyed fireball had. :tannedcirno:
And yes, the previously destroyed fireball, not the current one, because the collision handler that spawns these chain-charged fireballs is called before the global variable is updated with the affected fireball's generation number.

You can also destroy a fireball without an explosion though. Internally, this is done by removing the old fireball and spawning a new transferring red one at the same position. This new fireball will start with the old fireball's generation number incremented by 1, matching the explanation of the generation system from above. Unlike explosion-destroyed fireballs, there is no limit to the number of times a particular fireball can be transferred in this way. So you can go higher than 5 generations as long as you keep using shots to transfer fireballs back and forth – and if you manage 256 of those transfers, the generation counter will overflow back to 0. :onricdennat:
But then, the game logic goes completely off the rails in the rest of the collision handler. For starters, this kind of fireball destruction still increases the Extra Attack charge of a chain, despite the fact that player shots and Charge Shots exist completely separate from the chaining system and the fact that the respective player may not even have any active explosions on their playfield. ZUN simply takes the chain slot that the next explosion would be assigned to, and adds +1 (for blue fireballs) or +2 (for red fireballs) to its charge_exatt value. Second reason why these charges aren't reset when starting a new chain, I guess…

And then, the code performs another charge_exatt firing check without adding any new charge, in what could have only possibly been a leftover function call that was supposed to go into the branch that handles fireball destruction by explosions. On this non-explosion code path, the chain_slot variable used in this check remains uninitialized, which leads to a potential read and write access outside of the bounds of this array…


…wait a moment, this immediately reminds me of certain bug reports from players that smelled like they were caused by out-of-bounds array accesses in unrelated parts of the game. I've been waiting to find one of these accesses ever since I heard about these bugs, and it's great that I immediately stumbled over the issue on my first RE pass over the fireball collision handler. Could it be that I've just found…

TH03's score reduction and extend glitches

Yup. In their thousands of hours of mastering this game's intricate systems, the gameplay community has encountered two very rare and seemingly random glitches without any obvious cause:

  1. Random score reduction
  2. Two extra Story Mode lives out of nowhere

And indeed, both of them could be caused if charge_exatt were indexed with a chain_slot ≥16, which would cause certain bytes to accidentally get reinterpreted as an Extra Attack charge value, compared against round_speed, and reset to 0 if their value is high enough. Want to take a guess at which particular pieces of data lie particularly close to charge_exatt in TH03's original memory layout? Well…

DS: +00+01+02+03 +04+05+06+07 +08+09+0A+0B +0C+0D+0E+0F
4B3E P1 Extra Attack charge per chain
4B4E P2 Extra Attack charge per chain
4B5E (temporary data) P1 score digits (📝 little-endian BCD) P2 score 
4B6E  digits (📝 little-endian BCD) (temporary data) ☯️
With ☯️ being the number of Story Mode extends gained. These score digits do not include the one's digit, which represents the number of continues used and is stored separately.
The 16 bytes below hold more gameplay-relevant data relating to Yumemi's Charge Shot and Gauge Attack, but researching that was out of scope for this delivery.

This gives us the following chain_slot ranges for the two glitches, depending on which player destroys the fireball:

  1. If chain_slot is between 0x26 and 0x35 inclusive (P1) 0x16 and 0x25 inclusive (P2) , the charge check will hit one of the score digits, causing the score reduction variant of the glitch if the digit's value is high enough.
  2. If chain_slot is exactly 0x3E (P1) 0x2E (P2) , the charge check will hit the byte that controls the number of Story Mode extends gained, causing the extend variant of the glitch if the byte's value is high enough.
    To understand how this can work, we need to look into how ZUN implemented TH03's extend system:

    • At the start of each round, extends_gained is initialized to the value of P1's second-highest score digit, or to 255 if the overall score is ≥20 million.
    • On every frame, the game then compares P1's second-highest score digit against extends_gained. If the digit is higher, it then grants an extend and increments the byte accordingly.
    • At a score of ≥20 million, extends_gained is set to 255, which is higher than any possible single digit and thus blocks the game from granting any further extends. Otherwise, you'd get additional ones at 110 million, 120 million, 210 million, …

    If you edit memory and set extends_gained from any nonzero value to 0, the game will therefore think that you haven't received any extends yet, and newly grant you as many of them as you're supposed to have with P1's current score: one additional extend if the score is ≥10 million, and two if your score is ≥20 million. There are no other checks that prevent the game from granting extends under this condition, so you could repeat this process until you've reached 255 lives, the maximum possible value.
    The "blocking" value of 255 also explains how this glitch can exist in the first place. It can only work if the game sets extends_gained to 0 on its own as a result of the Extra Attack charge check, and the regular values of 0, 1, and 2 are all smaller than the smallest possible Extra Attack charge value of 3. 255, on the other hand, is larger than any possible charge value. Thus, this glitch can only happen if you have ≥20 million points, but then, it will consistently happen whenever its other conditions are met.

But how likely could it possibly be for the chain_slot to fall within these exact ranges? After all, this bug is reported to only happen very rarely.
Turns out that explaining how and why we actually get these effects is not trivial in the slightest. Let's look at the stack layout across the relevant functions during a random frame of gameplay, and try to reason about the value we will actually find in this uninitialized local variable at runtime:

Function BP | SP after prolog Stack
far main()├── (SEG1:MAI) 1000 | 0FFE INST 
near round_main()├┬─ (SEG1:RMN) 0FFA | 0FFA INST →MAI 1000 
far enemies_render()│├┬ (SEG4:ENR) 0FF4 | 0FF0 INST →MAI 1000 SEG1 →RMN 0FFA INST INED 
near enemy_put() / enemy_explosion_put()││└ (SEG4:ENP) 0FEC | 0FE4 INST →MAI 1000 SEG1 →RMN 0FFA INST INED →ENR 0FF4 en_x en_y en_s en_p 
far fireballs_hittest_and_render()│├┬ (SEG4:FHR) 0FF4 | 0FF2 INST →MAI 1000 SEG1 →RMN 0FFA INST INED →ENR 0FF4 en_x en_y en_s en_p
near fireballs_hittest()││├ (SEG4:FHT) 0FEE | 0FEA INST →MAI 1000 SEG1 →RMN 0FFA INST →FHR 0FF4 0FF4 INST en_y en_s en_p

Where:

That only leaves us with a handful of explicit numbers in this table, all of which are copies of the previous function's base pointer (BP). These get saved onto the stack as part of the typical x86 function prolog, which works like this:

  1. The CALL instruction pushes the caller's code segment (for far calls) and the offset of the next instruction on the call site, before jumping to the address indicated by CALL's operand.
  2. The new function then calls either ENTER <stack size>, 0 or the more RISC-y and consistently faster equivalent sequence of PUSH BP / MOV BP, SP / SUB SP, <stack size>, saving the previous function's base pointer and making room for all required local variables. The key insight here is that SP is simply subtracted. This is exactly where the "garbage values" of uninitialized variables come from, since they will start out with the value of whatever was previously written to their corresponding location on the stack.
  3. If the function needs to modify SI or DI, it also pushes those registers, which further decreases SP.
  4. Upon returning, the function executes the epilog counterpart of these instructions in reverse order:
    1. POP DI and POP SI if needed
    2. LEAVE or the equivalent sequence of MOV SP, BP / POP BP
    3. RETN (near) or RETF (far) to pop the instruction pointer and (for RETF) the caller's code segment

And it's these explicit numbers that reveal that ZUN got incredibly lucky:

This is also why it matters that the game renders at least one enemy or enemy-originating explosion on every frame of gameplay. The only time it renders neither is during the round start animation, but at that time, there also aren't any fireballs to be destroyed. Therefore, this exact tree of function calls is indeed the only one we need to look at, and proves that the formally undefined initial value of chain_slot is indeed deterministic on the one and only code path where it would be read from.

But from this alone, there should be no chance of the game ever writing outside the bounds of the Extra Attack charge array. So what could we possibly be missing here?
Well, just looking at MAIN.EXE's code flow doesn't actually give us the full picture. This is still a PC-98 game, and needs a full PC-98 system to run in the first place. And once we consider all the subsystems involved in running the game, we notice that three of them need to take control of the x86 instruction pointer at regular intervals:

If any of the corresponding IRQs fires, the CPU has to immediately call the corresponding interrupt handler. This requires saving the current CPU flags, code segment, and offset onto the stack, very much like a call to a regular function. While this would only affect the "inactive" area of the stack below SP at the time of the interrupt call, this area exactly controls the "uninitialized" value of future local variables. Thus, interrupts can easily modify these values beyond the state they would have as a result of the normal flow of code.
This theory is supported by the fact that players have reported the extend glitch in particular to occur much more frequently on underclocked PC-98 systems. If each clock cycle takes up a bigger fraction of a second and the three interrupts have to run at regular intervals, we'd get more of them on every frame of gameplay. Thus, slower clock speeds increase the chance of such an interrupt to fall within the exact window of instructions to influence the initial value of the chain_slot variable.
Usually, running PC-98 Touhou over and over in underclocked emulators would make for an awful research and debugging experience. Thankfully, DOSBox-X's Turbo (fast-forward) option counteracts the slowdown in exactly the right way: While the reduced clock speed will stretch each 📝 logical frame to multiple real frames, fast-forwarding the emulation will then display all these frames as fast as possible and without blocking at the emulated VSync signal. The result runs very close to the intended 56.423 FPS, but with way more of those IRQs firing per second.

With this setup, let's see what happens if PMD's timer interrupt happens to fire somewhere near the end of enemies_render:

Function BP | SP after prolog Stack
far main()├── (SEG1:MAI) 1000 | 0FFE INST 
near round_main()├┬─ (SEG1:RMN) 0FFA | 0FFA INST →MAI 1000 
far enemies_render()│├┬ (SEG4:ENR) 0FF4 | 0FF0 INST →MAI 1000 SEG1 →RMN 0FFA INST INED 
near enemy_put() / enemy_explosion_put()││├ (SEG4:ENP) 0FEC | 0FE4 INST →MAI 1000 SEG1 →RMN 0FFA INST INED →ENR 0FF4 en_x en_y en_s en_p 
interrupt opnint()││└ (PMD_:INT) 0FF4 | 0FEA INST →MAI 1000 SEG1 →RMN 0FFA INST INED FLAG SEG4 →ENR en_y en_s en_p
far fireballs_hittest_and_render()│├┬ (SEG4:FHR) 0FF4 | 0FF2 INST →MAI 1000 SEG1 →RMN 0FFA INST INED FLAG SEG4 →ENR en_y en_s en_p
near fireballs_hittest()││├ (SEG4:FHT) 0FEE | 0FEA INST →MAI 1000 SEG1 →RMN 0FFA INST →FHR 0FF4 SEG4 INST en_y en_s en_p
With FLAG being the x86's 16-bit flag register that gets saved to the stack in addition to the current CS and IP registers when calling interrupt functions.

And there we have it. If any of those three interrupts is serviced within the very specific window of x86 instructions between the last rendered enemy and the call to fireballs_hittest(), the call to the interrupt handler will modify the stack in such a way that the future chain_slot variable will start out with the top 8 bits of the address of MAIN.EXE's fourth code segment, instead of its usually guaranteed value of 15. This will be more likely if the last living or exploding enemy was spawned into a slot close to the beginning of the 48-element enemy array, as TH03 will then spend more time looping over the rest of the structure without calling either of the two *_put() functions.

Alright, now we know what the exact variant of the glitch depends on. But how likely is it for MAIN.EXE's fourth code segment to actually fall within the affected ranges of memory, and how can we influence this placement?
First off, since DOS predates ASLR, any specific DOS system will always load its kernel, the shell, and any other drivers at deterministic addresses every time, as long as you don't change any of the parameters that influence this placement. Consequently, the system will also end up with the exact same free regions of memory every time, causing DOS to load a game at the exact same address as well. Of course, since DOS is an open platform where you can get arbitrary code execution by just, uh, writing code, building, and running it, you can easily push the game higher in memory by writing a TSR that reserves the desired amount of memory and executing it before TH03. Pushing TH03 lower, however, would require either

  1. the infamous oldskool wizardry of freeing up as much conventional RAM as possible,
  2. changing the DOS kernel, or
  3. changing the 📝 version or type of the PMD driver loaded from GAME.BAT. (Note that this won't apply to the debloated or Anniversary Editions if you directly launch them via the debloat or anniv binaries because 📝 the integrated TSR spawning code pushes these drivers above the TH03 process).

Option 2) is particularly impactful, especially if we compare earlier DOS kernels with later ones:

And this trend matches exactly with the addresses for segment #4 that we can observe when we go out into the wild and compare various distributions of TH03 against each other:

* The Touhou98 Experience v3.00 release is also built around this .HDI.
Created by copying the TH03 files onto that one old widely circulating TH04 .HDI that needed a later DOS version due to 📝 the no-EMS crash bug. Representative of other 5-game .HDI setups that might be floating around the Internet or that people have built for themselves, or real-hardware setups.
†† The bold columns indicate the default setting the package came with.
Game distribution / Kernel PC-9801-26K
PMD.COM
PC-9801-86
PMD86.COM
PC-9801-73
PMDB2.COM
That one old widely circulating .HDI*
MS-DOS 3.3
0x3927 0x3A83 0x3A2B
Custom, MS-DOS 6.20 0x2CCC 0x2E28 0x2DD0
Raw files, current DOSBox-X 0x2C91 0x2DED n/a
2021 Ultimate Collection††
MS-DOS 7
DOSBox-X 0x2C82 n/a 0x2D86
Neko Project 0x2D98 0x2EF4 0x2E9C

Cross-referencing these with the memory map above then gives us the following affected addresses:

I guess we can thank spaztron64 for accidentally building just the right setup to reveal the extend glitch in the first place. I certainly needed to be told of its existence by the gameplay community!
Slot destroyed by P1 destroyed by P2
0x2C P1's 10-million digit temporary data
0x2D P1's 100-million digit temporary data
0x2E P2's 10's digit Extends gained
0x39 temporary data not yet RE'd
0x3A temporary data not yet RE'd

The fix

So, we've got two glitches that may or may not appear on any given PC-98 setup, and whose effects depend on the combination of operating system, sound card model, and machine speed. That should make it abundantly clear that we can classify the underlying code issue as a landmine. There is nothing worth preserving about system-specific behavior on any ReC98 branch other than master, especially since we want to get away from the architecture in the long term.
It's also obvious that we must fix this to retain gameplay integrity in any build that wants to support netplay:

But how do we fix it? The offending call to chain_fire_charged_exatt() might look redundant at first, but it has definite effects on gameplay on both branches:

The ideal fix is equally simple, though. Just initializing chain_slot to 15 removes all undefined behavior while simultaneously locking down the deterministic and observable effect of normal code flow. We can also easily implement this without breaking the original position-dependent binary by taking the 4 necessary additional bytes from the function's own bloat. And boy is there a lot of it; removing just even the most obvious single piece of bloat in this one function freed up 23 bytes, leaving 19 bytes unused. Any concerns about memory budgets for minor mods are vastly overblown.
This means that we're already up to the second release of TH03's debloated and Anniversary Editions!

Richard Stallman cosplaying as a shrine maiden ReC98 (version P0335) 2026-03-16-ReC98.zip


So that was 3,374 words to explain the ramifications of a single uninitialized local variable, leaving us at 5.375 pushes in total. What better way to round out this one than to finish the unfinished subproject from last time:

Completing the Fediverse migration

📝 As of last time, my Fediverse presence was only missing three pieces of data to function as a full-on replacement for Twitter:

  1. All posts since 2023-07-02, which had twice as much media attached to them as the previous 8½ years combined
  2. Alt text for images
  3. Polls and their results, if this is even possible without assigning each vote to an existing Fediverse user

With the latter two not being part of Twitter's data archive, I already expected that I'd have to cobble together some very awkward code to obtain this data. But I didn't expect that this would cast an even worse light on Twitter:

And I am certainly not paying just to efficiently retrieve data that should have been part of my data archive all along if I can at all avoid it.

Aren't there a few evil third-party projects that could help here? Nitter, for example, is well-known for retrieving Twitter timelines and displaying them on efficient, server-rendered, and easily scrapeable pages, bypassing the need for both activated JavaScript and a Twitter account just for reading posts. Sadly, it didn't render alt text on its frontend at the time in early February when I wanted to complete the Fediverse import. But it's open-source – and although I couldn't comprehend the exact mechanisms of the unofficial Twitter API they use, it might be the best starting point.

And sure enough, someone else previously needed alt text as well, found the string in the API response that Nitter was already processing, and wrote the code to pass it on to the frontend. All I had to do then was to cherry-pick these commits, adjust them for the current API's response schema, and run my own Nitter instance. This was the first time I hacked around in anything written in Nim, but I encountered no issues when building the project on Linux, although the build system is definitely on the slower end as far as systems languages are concerned.
Since this was ridiculous enough for what I wanted to do, I then pushed the updated versions of these commits, in the hope that someone else could save those 10-20 minutes of fixing merge conflicts. Since that issue was languishing for over three years, I certainly didn't expect that Nitter's maintainer would actually merge these commits 1½ weeks later. Even better, though! Thanks to ReC98, alt text is now shown on the main nitter.net instance. As of this blog post, the widget in Nitter's frontend still suffers from a visual overlapping bug in its unexpanded form when displaying multi-line alt text, but having that text in the server-rendered DOM at all is all I was asking for…

…except that my GoToSocial importer can't just automatically scrape this text from nitter.net because they prevent scraping, and have put a lot of effort into keeping it prevented. Specifically, any request made with the Python Requests library will return a 200 OK response code but an empty response body.
The Nitter wiki links a few other public instances, but none of those would work for us either. Some of these are still running a Nitter version older than 1c06a67, as of this blog post. And the ones that do run a current version employ similar scraping prevention techniques: They either respond with a 403 Forbidden, a redirect to a JavaScript challenge page, or just outright close the HTTP connection. Working around that would be way beyond reasonable, considering the budget that this task was supposed to take up… and if they don't want people to scrape them, I should respect that. Yet another example of AI crawlers ruining everything, I guess…
zedeus did recommend Twikit for programmatic access to Twitter data in the issue I linked above, but that library is currently unusable as well. A simple get_tweet_by_id() call returns 403 Forbidden, suggesting that you have to log in first, but any attempt to login() gets blocked by CloudFlare.

So yeah. You might only need to run such an import script once, but if you want a complete migration with alt text and polls, there's no way around temporarily self-hosting a Nitter instance, as far as I can tell. Lovely.
Speaking of polls, though…

Polls

With a scraping setup in place, retrieving poll data and calculating the exact votes from the given percentages on Nitter's frontend was no big deal. Instead, all of the annoyance here lies in then getting that data onto a Fediverse server:

  1. Very understandably, GoToSocial lacks an API to pre-fill poll results,
  2. but it also refuses to create backdated statuses with polls altogether.

These two points defeat any lasting automated migration code I could add to my GoToSocial importer. Printing out the imported data to stdout is the best I can do; any efficient method of importing backdated polls relies on removing the condition from 2) and compiling your own build of GoToSocial.
Thankfully, this is no big deal even when cross-compiling from Windows to Linux. Then, your POST /api/v1/statuses requests can at least include the poll's question labels and return the database ID of the newly created poll for further manipulation at the SQL level. Then, you can reformat the stdout output into a single query and run it manually:

UPDATE polls SET
	voters = 87,
	votes = '[25, 49, 11, 2]',
	expires_at = '2022-04-16 23:53:20.000000+00:00'
WHERE id = {poll_id};

But wait, really? You can manipulate a poll's results, both inside GoToSocial and on all other clients, with just a single SQL query? If these results don't need to be associated with specific users or at least their originating servers, doesn't this mean that polls on the Fediverse are inherently untrustworthy?
Of course, it makes little sense to verify poll results before displaying them on a server's own frontend. You can always just place a reverse proxy in front of the server and rig a poll in that way. But the inbox mechanism of ActivityPub already technically breaks anonymity across servers from the point of view of the server admin, so I would have expected both clients and other federated servers to at least run some sort of validation upon receiving a poll. Although that would definitely cause heavy load on servers and clients as these polls receive more and more answers, within a protocol that is already quite chatty:thonk:

In any case, this kind of validation would have to exist on an entirely different layer that has nothing to do with ActivityPub. That protocol – or rather, the underlying ActivityStreams vocabulary – only specifies how Questions and their associated answers are sent between servers. But once these votes are on a server, they only need to be stored in the minimal way expected by the usual client API. Which, as these words suggest, actually isn't even part of ActivityPub: Although that standard does attempt to specify a client API in addition to the server-to-server protocol, its development and adoption have stalled. Instead, most clients in common use simply implement the API that Mastodon once came up with, as a de facto standard. Mastodon's poll feature was quickly implemented in the most basic way, and no one has ever even proposed adding any validation features on top, as far as I can tell. And so, the GET /api/v1/statuses/{id} endpoint of any Mastodon-compatible Fediverse server is simply expected to return

{
	…
	"poll": {
		"options": [
			{ "title": "Option 1", "votes_count": 0 },
			{ "title": "Option 2", "votes_count": 0 },
		]
	}
	…
}

without any reference to where these votes are coming from, expecting the client to fully trust the server as far as the legitimacy of these votes is concerned. Of course, I'm a pretty trustworthy person if I may say so myself, but this detail turns the Fediverse into a pretty bad place to conduct even just semi-serious polls…
This is also why I haven't pointed any of the old links to Twitter polls on this website to my ActivityPub server yet. As long as the @ReC98Project account still exists and you can still view individual tweets without a Twitter account, it remains the authoritative and ultimately more trustworthy home of these poll results.

So in the end, centralized systems, and Twitter in particular, would still be indispensable for at least the occasional poll I tend to run. Bluesky, unfortunately, still doesn't natively support polls and instead forces people to build custom poll systems in the meantime. While slightly more trustworthy than Fediverse clients on paper, the ones I've seen are just Strawpoll clones that don't come with any kind of authentication, and often even allow people to place multiple votes by simply opening a new private browser window. Unsurprisingly, I therefore received several random and apparenly botted answers within seconds of creating a poll on both the systems I've used so far, making their results even more unusable. One of them also went offline within roughly 6 months of me using it, taking its one associated post with it.
Or maybe I should just conduct polls by asking for thought-out replies and rewarding people with, like, of ReC98 budget going to a goal of their choice…

On the topic of Bluesky, though…

The hell is this video quality? You upload lossless AV1 through the bsky.app frontend, and they blow up the video to 15.5× its original size by applying every single one of the worst possible processing steps:

  1. For starters, they re-encode your video as H.264 using the High profile with YUV420P chroma subsampling. This is the same state of the art from 2005 that Twitter still expects you to produce yourself 21 years later. I'd expect any competitor to be better than that, but with Bluesky, that's just the start.
  2. Because then, they enforce 30 FPS, duplicating or dropping frames as needed. This is inexcusable and disgusting.
  3. They also bilinearly scale your video to 720 pixels in its shortest dimension. At least they maintain its original aspect ratio…
  4. But this scaling process also targets a constant bitrate of 3,000 kbit/s, regardless of whether the video actually needs that much. I'm getting flashbacks of a certain dumb and uninformed if MP3, then 320kbps CBR, everything below sucks mentality.
  5. Those bits would have been much better invested in keyframes. Of which we get exactly zero, causing seeking lags at consistent spots throughout these videos.

But maybe that's just bsky.app's backend doing its own video conversion. Maybe its developers did collect statistics and then made the conscious and informed decision that those exact processing steps would deliver the optimal trade-off between quality, traffic, and compatibility for the kind of HD content that people typically want to upload to such a platform nowadays. Maybe, my use case is just a rare exception… wait, gamers are unhappy too? 🤨
Surely then, the AT Protocol network still stores the original lossless video I uploaded, and an alternate client could play it directly, right? That's certainly what I would expect from a platform that advertises self-hosting, control over one's own data, and account portability.

So let's check Bluesky's data archive, and… the hell is a CAR file? A proprietary binary format in place of the standard .zip file you'd get from even the most evil corporations on the planet? Why the hell would they confront people trying to access their own data with a blog post containing Go code, directly linked from the data export modal? Usually, platforms are hard to leave due to network effects and people's inherent laziness to open another browser tab, but I've never seen a service also put up such a deliberate technical hurdle. How expensive could it have possibly been to set up a server that zips up all of a user's data on request? Why did they have to externalize these costs to their users?
The biggest joke, though: All code in that blog post is taken from this repository with example code, but the README in that directory says:

Check out the goat command line tool, which does the same thing and is actively maintained.

Except that the goat subcommand that actually downloads post-attached blobs, goat blob export, downloads directly from the AT Protocol network and doesn't even need the CAR file! Which, by the way, you could also download via goat repo export. And why wouldn't you, because you sure need this tool anyway to do anything because no existing tooling works with this format!
Anyway. Let's look at the blob of one of those videos, and…

$ ffprobe -hide_banner 03_source.webm
Input #0, matroska,webm, from '03_source.webm':
  Metadata:
    ENCODER         : Lavf62.3.100



  Duration: 00:00:05.12, start: 0.000000, bitrate: 454 kb/s
  Stream #0:0:
    Video:
      av1 (libdav1d) (High), gbrp(pc, gbr/bt709/iec61966-2-1, progressive),
      288x368, SAR 1:1 DAR 18:23, 56.42 fps, 56.42 tbr, 1k tbn
    Metadata:
      ENCODER         : Lavc62.11.100 libaom-av1
      DURATION        : 00:00:05.122000000
$ ffprobe -hide_banner bafkreid7mp5u7mb32o2a5n7zl2hbfxkiiaiebh7x77vik27alu2hfvuq5q
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'bafkreid7mp5u7mb32o2a5n7zl2hbfxkiiaiebh7x77vik27alu2hfvuq5q':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf61.1.100
  Duration: 00:00:05.17, start: 0.000000, bitrate: 193 kb/s
  Stream #0:0[0x1](und):
    Video:
      h264 (High) (avc1 / 0x31637661), yuv420p(tv, unknown/bt709/iec61966-2-1, progressive),
      288x368 [SAR 1:1 DAR 18:23], 189 kb/s, 30 fps, 30 tbr, 15360 tbn (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
      encoder         : Lavc61.3.100 libx264

That's basically the same video before the scaling. Still YUV420P, still 30 FPS, still no keyframes.
I hate everything about this – and most importantly, the fact that I even have to bother with this complete waste of human engineering effort just because these kinds of platform choices are decided by politics and marketing for the vast majority of people. If I didn't land the occasional viral hit on there, I would have discontinued Bluesky immediately after this discovery. Maybe I should just trade one corporate platform for another and start looking into Tumblr, as touhou-memories recommended…

Oh well. Next up: Headless Shuusou Gyoku, preparing the automated replay validation I've wanted for years. Since Shuusou Gyoku's new replay system is long overdue now, I will put all future budget from Ember2528 into that system until it's done. If you'd like more TH03 done sooner, please fund it separately – but with of free budget left, there should be plenty of room for everyone.
That said, my life priorities do seem to have shifted for good by now. So I might need to reduce the cap if all of you really end up commissioning that much… But let's see how fast things will actually progress in the future, now that this worst possible piece of reverse-engineering is finally done.

📝 Posted:
💰 Funded by:
Ember2528, Arandui
🏷️ Tags:

Here we go, the finale of the Shuusou Gyoku Linux port, culminating in packages for the Arch Linux AUR and Flathub! No intro, this is huge enough as it is.

  1. Compiling with C++ Standard Library Modules for Linux
  2. Porting the remaining logic code to Clang
  3. Picking a free MS Gothic replacement
  4. Reasons for using the standard Linux text library stack
  5. The individual Linux text rendering libraries
  6. Debugging vertical placement issues
  7. The new icon
  8. Packaging
  9. Future work

Before we could compile anything for Linux, I still needed to add GCC/Clang support to my Tup building blocks, in what's hopefully the last piece of build system-related work for a while. Of course, the decision to use one compiler over the other for the Linux build hinges entirely on their respective support for C++ standard library modules. I 📝 rolled out import std; for the Windows build last time and absolutely do not want to code without it anymore. According to the cppreference compiler support table at the time I started development, we had the choice between

  1. experimental support in the not-yet-released GCC 15, and
  2. partial support as of Clang 17, two versions ago.

GCC's current implementation does compile in current snapshot builds, but still throws lots of errors when used within the Shuusou Gyoku codebase. Clang's allegedly partial support, on the other hand, turned out just fine for our purposes. So for now, Clang it is, despite not being the preferred C/C++ compiler on most Linux distributions. In the meantime, please forgive the additional run-time dependency on libc++, its C++ standard library implementation. 🙇 Let's hope that it all will actually work in GCC 15 once that version comes out sometime in 2025.

At a high level, my Tup building blocks only have to do a single thing to support standard library modules with a given compiler: Finding the std and std.compat module interface units at the compiler's standard locations, and compiling them with the same compiler flags used for the rest of the project. Visual Studio got the right idea about this: If you compile on its command prompts, you're already using a custom shell with environment variables that define the necessary paths and parameters for your target platform. Therefore, it makes sense to store these module units at such an easily reachable path – and sure enough, you can reliably find the std module unit at %VCToolsInstallDir%\modules\std.ixx. While this is hands down the optimal way of locating this file, I can understand why GCC and Clang would want module lookup to work in generic shells without polluting environment variables. In this case, asking some compiler binary for that path is a decent second-best option.
Unfortunately, that would have been way too simple. Instead, these two compilers approached the problem from the angle of general module usage within the common build systems out there:

Wonderful. Just what we wanted to do all along, only with an additional layer of indirection that now forces every build system to include a JSON parser somewhere in its architecture. 🤦
In CMake's defense, they did try to get other build systems, including Tup, involved in these proposals. Can't really complain now if that was the consensus of everybody who wanted to engage in this discussion at the time. Still, what a sad irony that they reached out to Tup users on the exact day in 2019 at which I retired from thcrap and shelved all my plans of using Tup for modern C++ code…

So, to locate the interface units of standard library modules on Clang and GCC, a build system must do the following:

  1. Ask the compiler for the path to the modules.json file, using the 30-year-old -print-file-name option.
    GCC and Clang implement this option in the worst possible way by basically conditionally prepending a path to the argument and then printing it back out again. If the compiler can't find the given file within its inscrutable list of paths or you made a typo, you can only detect this by string-comparing its output with your parameter. I can't imagine any use case that wouldn't prefer an error instead.
    Clang was supposed to offer the conceptually saner -print-library-module-manifest-path option, but of course, this is modern C++, and every single good idea must be accompanied by at least one other half-baked design or implementation decision.

  2. Load the JSON file with the returned file name.

  3. Parse the JSON file.

  4. Scan the "modules" array for an entry whose "logical-name" matches the name of the standard module you're looking for.

  5. Discover that the "source-path" is actually relative and will need to be turned into an absolute one for your compilation command line. Thankfully, it's just relative to the path of the JSON file we just parsed.

Sure, you can turn everything into a one-liner on Linux shells, but at what cost?

clang++ -stdlib=libc++ -c -Wno-reserved-module-identifier -std=c++2c --precompile $(dirname $(clang -print-file-name=libc++.modules.json))/$(jq -r '.["modules"][] | select(."logical-name"=="std")."source-path"' $(clang -print-file-name=libc++.modules.json))
You might argue that Tup rules are a rather contrived case. Tup by itself can't store the output of processes in variables because rule generation and rule execution are two separate phases, so we need to call clang -print-file-name at both of the places in the command line where we need the file name. But, uh, CMake's implementation is 170 lines long

At least it's pretty straightforward to then use these compiled modules. As far as our Tup building blocks are concerned, it's just another explicit input and a set of command-line flags, indistinguishable from a library. For Clang, the -fmodule-file=module_name=path option is all that's required for mapping the logical module names to the respective compiled debug or release version.
GCC, however, decided to tragically over-engineer this mapping by devising a plaintext protocol for a microservice like it's 2014. Reading the usage documentation is truly soul-crushing as GCC tries everything in its power to not be like Clang and just have simple parameters. Fortunately, this mapper does support files as the closest alternative to parameters, which we can just echo from Tup for some 📝 90's response file nostalgia. At least I won't have to entertain this folly for a moment longer after the Lua code is written and working…


So modules are justifiably hard and we should cut compiler writers some slack for having to come up with an entirely new way of serializing C++ code that still works with headers. But surely, there won't be any problems with the smaller new C++ features I've started using. If they've been working in MSVC, they surely do in Clang as well, right? Right…?
Once again, C++ standard versions are proven to be utterly meaningless to anyone outside the committee and the CppCon presenters who try to convince you they matter. Here's the list of features that still don't work in Clang in early 2025:

It almost looked like it'd finally be time for my long-drafted rant about the state of modern C++, but the language just barely redeemed itself with the last two sentences there. Some other time, then…
On the bright side, all my portability work on game logic code had exactly the effect I was hoping for: Everything just worked after the first successful compilation, with zero weird run-time bugs resulting from the move from a 32-bit MSVC build to 64-bit Clang. 🎉


Before we can tackle text rendering as the last subsystem that still needs to be ported away from Windows, we need to take a quick look at the font situation. Even if we don't care about pixel-perfectly matching the game's text rendering on Windows, MS Gothic seems to be the only font that fits the game's design at all:

However, MS Gothic is non-free and any use of the font outside of a Windows system violates Microsoft's EULA. In spite of that, the AUR offers three ways of installing this font regardless:

  1. The ttf-ms-*auto-* packages download a Windows 10 or 11 ISO from a somewhat official download link on Microsoft's CDN and extract the font files from there. Probably good enough if downloading 5 GB only to scrape a single 9 MB font file out of that image doesn't somehow feel wrong to you.
  2. The ttf-ms-win10-cdn-* packages download just the font files from… somewhere on IPFS. :thonk:
  3. The regular, non-auto or -cdn ttf-ms-win* packages leave it up to you where exactly you get the files from. While these are the clearest options in how they let you manually perform the EULA infringement, this manual nature breaks automated AUR helpers. And honestly, requiring you to copy over all 141 font files shipped with modern Windows is massively overkill when we only need a single one of them. At that point, you might as well just copy msgothic.ttc to ~/.local/share/fonts and don't bother with any package. Which, by the way, works for every distro as well as Flatpaks, which can freely access fonts on the host system.

You might want to go the extra mile and use any of these methods for perfectly accurate text rendering on Linux, and supporting MS Gothic should definitely be part of the intended scope of this port. But we can't expect this from everyone, and we need to find something that we can bundle as part of the Flatpak.

So, we need an alternative free Japanese font that fits the metric constraints of MS Gothic, has embedded bitmaps at the exact sizes we need, and ideally looks somewhat close. Checking all these boxes is not too easy; Japanese fonts with a full set of all Kanji in Shift-JIS are a niche to begin with, and nobody within this niche advertises embedded bitmaps. As the DPI resolutions of all our screens only get higher, well-designed modern fonts are increasingly unlikely to have them, thus further limiting the pool to old fonts that have long been abandoned and probably only survived on websites that barely function anymore.
Ultimately, the ideal alternative turned out to be a font named IPAMonaGothic, which I found while digging through the Winetricks source code. While its embedded bitmaps only cover MS Gothic's first half for font heights between 10 and 16 pixels rather than going all the way to 22 pixels, it happens to be exactly the range we need for this game.


Alright then, how are we going to get these fonts onto the screen with something that isn't GDI? With all the emphasis on embedded bitmaps, you might come to the conclusion that all we want to do is to place these bitmap glyphs next to each other on a monospaced grid. Thus, all we'd need is a TTF/OTF library that gives us the bitmap for a given Unicode code point. Why should we use any potentially system-specific API then?
But if we instead approach this from the point of view of GDI's feature set, it does seem better to match a standard Windows text rendering API with the equivalent stack of text rendering libraries that are typically used by Linux desktop environments. And indeed, there are also solid reasons why this is a better idea for now:

Let's look at what this stack consists of and how the libraries interact with each other:

In the end, a typical desktop Linux program requires every single one of these 8 libraries to end up with a combined API that resembles Ye Olde Win32 GDI in terms of functionality and abstraction level. Sure, the combination of these eight is more powerful than GDI, offering e.g. affine transformations and text rendering along a curved path. But you can't remove any of these libraries without falling behind GDI.

Even then, my Linux implementation of text rendering for Shuusou Gyoku still ended up slightly longer than the GDI one due to all the Pango and Cairo contexts we have to manually manage. But I did come up with a nice trick to reduce at least our usage of Cairo: Since GDI needs to be used together with DirectDraw, the GDI implementation must keep a system-memory copy of the entire 📝 text surface due to 📝 DirectDraw's possibility of surface loss. But since we only use Cairo with SDL, the Cairo surface in system memory does not actually need to match the SDL-managed GPU texture. Thus, we can reduce the Cairo surface to the role of a merely temporary system-memory buffer that only is as large as the single largest text rectangle, and then copy this single rectangle to the intended packed place within the texture. I probably wouldn't have realized this if the seemingly most simple way to limit rendering to a fixed rectangle within a Cairo surface didn't involve creating another Cairo surface, which turned out to be quite cumbersome.


But can this stack deliver the pixel-perfect rendering we'd like to have? Well, almost:

Cue hours of debugging to find the cause behind these vertical shifts. The overview above already suggested it, but this bug hunt really drove home how this entire stack of libraries is a huge pile of redundantly implemented functionality that interacts with and overrides each other in undocumented and mostly unconfigurable ways. Normally, I don't have much of a problem with that as long as I can step through the code, but stepping through Cairo and especially Pango is a special kind of awful. Both libraries implement dynamic typing and object-oriented paradigms in C, thus hiding their actually interesting algorithms under layers and layers of "clean" management functions. But the worst part is a particularly unexpected piece of recursion: To layout a paragraph of text, Pango requires a few font metrics, which it calculates by laying out a language-specific paragraph of example text. No, I do not like stepping through functions that much, please don't put a call to the text layout function into the text layout function to make me debug while I debug, dawg…
It'll probably take many more years until most of this stack has been displaced with the planned Rust rewrites. But honestly, I don't have great hopes as long as they stay with this pile-of-libraries approach. This pile doesn't even deserve to be called a stack given the circular dependency between FreeType and HarfBuzz

Ultimately, these are the bugs we're seeing here:

  1. When rendering strings that contain both Japanese and Latin characters with MS Gothic, the Japanese characters are pushed down by about 1/8th of the font height. This one was already reported in June 2023 and is a bug in either HarfBuzz, Pango, or MS Gothic. With the main HarfBuzz developer confused and without an idea for a clean solution, the bug has remained unfixed for 1½ years.
    For now, the best workaround would be to revert the commit that introduced the baseline shift. Since the Flatpak release can bundle whatever special version of whatever library it needs, I can patch this bug away there, but distro-specific packages or self-compiled builds would have to patch Pango themselves. LD_LIBRARY_PATH is a clean way of opting into the patched library without interfering with the regular updates of your distro, but there's still a definite hurdle to setting it up.

  2. The remaining 1-pixel vertical shift is, weirdly enough, caused by hinting. Now why would a technique intended for improving the sharpness of outline fonts even apply to bitmap fonts to begin with? As you might have guessed, the pile-of-libraries approach strikes once more:

Don't you love it when the concerns are so separated that they end up overlapping again? I'm so looking forward to writing my own bitmap font renderer for the multilingual PC-98 translations, where the memory constraints of conventional DOS RAM make it infeasible to use any libraries of this pile to begin with 😛


Before we can package this port for Flathub, there's one more obstacle we have to deal with. Flathub mandates that any published and publicly listed app must come with an icon that's at least 128×128 pixels in size. pbg did not include the game's original 32×32 icon in the MIT-licensed source code release, but even if he did, just taking that icon and upscaling it by 4× would simultaneously look lame and more official than it perhaps should.
So, the backers decided to commission a new one, depicting VIVIT in her title screen pose but drawn in a different style as to not look too official. Mr. Tremolo Measure quickly responded to our search and Ember2528 liked his PC-98-esque pixel art style, so that's what we went for:

The 16×16 version of the new Shuusou Gyoku icon commissioned from Mr. Tremolo MeasureThe 32×32 version of the new Shuusou Gyoku icon commissioned from Mr. Tremolo MeasureThe 48×48 version of the new Shuusou Gyoku icon commissioned from Mr. Tremolo MeasureThe 128×128 version of the new Shuusou Gyoku icon commissioned from Mr. Tremolo Measure
Mr. Tremolo Measure on Bluesky.
The repo also contains textless and boxless variants.

However, the problem with pixel art icons is that they're strongly tied to specific resolutions. This clashes with modern operating system UIs that want to almost arbitrarily scale icons depending on the context they appear in. You can still go for pixel art, and it sure looks gorgeous if their resolution exactly matches the size a GUI wants to display them at. But that's a big if – if the size doesn't match and the icon gets scaled, the resulting blurry mess lacks all the definition you typically expect from pixel art. Even nearest-neighbor integer upscaling looks more cheap rather than stylized as the coarse pixel grid of the icon clashes with the finer pixel grid of everything surrounding it.

So you'd want multiple versions of your icon that cover all the exact sizes it will appear at, which is definitely more expensive than a single smooth piece of scalable vector artwork. On a cursory look through Windows 11, I found no fewer than 7 different sizes that icons are displayed at:

And that's just at 1× display scaling and the default zooming factors in Explorer.

But it gets worse. Adding our commissioned multi-resolution icon to an .exe seems simple enough:

  1. Bundle the individual images into a single .ico file using magick in1.png in2.png … out.ico
  2. Write a small resource script, call rc, and add the resulting .res file to the link command line
  3. Be amazed as that icon appears in the title and task bars without you writing a single line of code, thanks to SDL's window creation code automatically setting the first icon it finds inside the executable

But what's going on in Explorer?

An .ico file of the new Shuusou Gyoku icon commissioned from Mr. Tremolo Measure. Explorer's extra large icon mode shows the highest-resolution 128×128-pixel variant in a 128×128-pixel box, as expected.An .exe binary with the same .ico file embedded. Strangely, Explorer's extra large icon mode shows the 48×48-pixel variant in the center of a 256×256-pixel box.
Same Extra large icons setting for both.

That's the 48×48 variant sitting all tiny in the center of a 256×256 box, in a context where we expect exactly what we get for the .ico file. Did I just stumble right into the next underdocumented detail? What was the point of having a different set of rules for icons in .exe files? Make that 📝 another Raymond Chen explanation I'm dying to hear…
Until then, here's what the rules appear to be:

Oh well, let's nearest-neighbor-scale our 128×128 icon by 2× and move on to Linux, where we won't have such archaic restrictions…

…which is not to say that pixel art icons don't come with their own issues there. 🥲
On Linux, this kind of metadata is not part of the ELF format, but is typically stored in separate Desktop Entry files, which are analogous to .lnk shortcuts on Windows. Their plaintext nature already suggests that icon assignment is refreshingly sane compared to the craziness we've seen above, and indeed, you simply refer to PNG or even SVG files in a separate directory tree that supports arbitrary size variants and even different themes. For non-SVG icons, menus and panels can then pick the best size variant depending on how many pixels they allot to an icon. The overwhelming majority of the ones I've seen do a good job at picking exactly the icon you'd expect, and bugs are rare.

But how would this work for title and task bars once you started the app? If you launched it through a Desktop Entry, a smart window manager might remember that you did and automatically use the entry's icon for every window spawned by the app's process. Apparently though, this feature is rather rare, maybe because it only covers this single use case. What about just directly starting an app's binary from a shell-like environment without going through a Desktop Entry? You wouldn't expect window managers to maintain a reverse mapping from binaries to Desktop Entries just to also support icons in this other case.

So, there must be some way for a program to tell the window manager which icon it's supposed to use. Let's see what SDL has to offer… and the documentation only lists a single function that takes a single image buffer and transfers its pixels to the X11 or Wayland server, overriding any previous icon. 😶
Well great, another piece of modern technology that works against pixel art icons. How can we know which size variant we should pick if icon sizing is the job of the window manager? For the same reason, this function used to be unimplemented in the Wayland backend until the committee of Wayland stakeholders agreed on the xdg-toplevel-icon protocol last year.
Now, we could query the size of the window decorations at all four edges to at least get an approximation, but that approach creates even more problems:

Most importantly though: What if that icon is also used in a taskbar whose icons have a different size than the ones in title bars? Both X11's _NET_WM_ICON property and Wayland's xdg-toplevel-icon-v1 protocol support multiple size variants, but SDL's function does not expose this possibility. It might look as if SDL 3 supports this use case via its new support for alternate images in surfaces, but this feature is currently only used for mouse cursors. That sounds like a pull request waiting to happen though, I can't think of a reason not to do the same for icons. contribution-ideas?

But if SDL 2's single window icon function used to be unsupported on Wayland, did SDL 2 apps just not have icons on Wayland before October 2024?
Digging deeper reveals the tragically undocumented SDL_VIDEO_X11_WMCLASS environment variable, which does what we were hoping to find all along. If you set it to the name of your program's Desktop Entry file, the window manager is supposed to locate the file, parse it, read out the Icon value, and perform the usual icon and size lookup. Window class names are a standard property in both X11 and Wayland, and since SDL helpfully falls back on this variable even on Wayland, it will work on both of them.

Or at least it should. Ultimately, it's up to the window manager to actually implement class-derived icons, and sadly, correct support is not as widespread as you would expect.
How would I know this? Because I've tested them all. 🥲 That is, all non-AUR options listed on the Arch Wiki's Desktop environment and Window manager pages that provide something vaguely resembling a desktop you can launch arbitrary programs from:

WM / DE Manually transferred pixels Class-derived icons Notes
awesome✔️Does not report border sizes back to SDL immediately after window creation
Blackbox
bspwmNo title bars
Budgie✔️✔️Title bars have no icons. Taskbar falls back on the icon from the Desktop Entry file the app was launched with.
Cinnamon✔️✔️Title bars have no icons, but they work fine in the taskbar. Points out the difference between native and Flatpak apps!
COSMIC✔️✔️Title bars have no icons, but they work fine in the taskbar. Points out the difference between native and Flatpak apps!
CutefishTitle bars have no icons. The status bar only seems to support the X11 _NET_WM_ICON property, and not the older XWMHints mechanism used by e.g. xterm.
DeepinDid not start
Enlightenment✔️Taskbar falls back on the icon from the Desktop Entry file the app was launched with. Only picks the correctly scaled icon variant in about half of the places, and just scales the largest one in the other half.
Fluxbox✔️
GNOME Flashback / Metacity✔️Title bars have no icons
GNOME✔️✔️Title bars have no icons
GNOME ClassicHow do you get this running? The variables just start regular GNOME.
herbstluftwmNo title bars
i3✔️
IceWM✔️Only doesn't work for Flatpaks because it uses a hardcoded list of icon paths rather than $XDG_DATA_DIRS
KDE (Plasma)✔️✔️Taskbar (but not window) falls back on the icon from the Desktop Entry file the app was launched with
LXDE✔️
LXQt✔️
MATE✔️Title bars have no icons
MWM
NotionNo title bars
Openbox✔️
Pantheon✔️✔️
PekWM
QtileNo title bars
StumpwmDid not start
SwayArchitected in a way that made icons too complex to bother with. Might get easier once they take a look at the xdg-toplevel-icon protocol.
twm
UKUIWindow decorations and taskbar didn't work
WestonOnly supports client-side decorations
Xfce✔️Taskbar only supports manually transferred icons. Scaling of class-derived icons in title bars is broken.
xmonadNo title bars
I tested all window managers, compositors, and/or desktop environments at their latest version as of January 2025 in their default configuration. There were no differences between the X11 and Wayland versions for the ones that offer both.
Yes, you can probably rice title bars and icons onto WMs that don't have them by default. I don't have the time.

That's only 6 out of 33 window managers with a bug-free implementation of class-derived icons, and still 6 out of 28 if we disregard all the tiling window managers where icons are not in scope. If you actually want icons in the title bar, the number drops to just 2, KDE and Pantheon. I'm really impressed by IceWM there though, beating all other similarly old and minimal window managers by shipping with an almost correct implementation.
For now, we'll stay with class-derived icons for budget reasons, but we could add a pixel transfer solution in the future. And that was the 2,000-word story behind this single line of code… 📕


On to packaging then, starting with Arch! Writing my first PKGBUILD was a breeze; as you'd expect from the Arch Wiki, the format and process are very well documented, and the AUR provides tons of examples in case you still need any.
The PKGBUILD guidelines have some opinions about how to handle submodules, but applying them would complicate the PKGBUILD quite a bit while bringing us nowhere close to the 📝 nirvana of shallow and sparse submodules I've scripted earlier. But since PKGBUILDs are just shell scripts that can naturally call other shell scripts, we can just ignore these guidelines, run build.sh, and end up with a simpler PKGBUILD and the intended shorter and less bloated package creation process.

Sadly, PKGBUILDs don't easily support specifying a dependency on either one of two packages, which we would need to codify the font situation. Due to the way the AUR packages both IPAMonaGothic and MS Gothic together with their Mincho and proportional variants, either of them would be Shuusou Gyoku's largest individual dependency. So you'd only want to install one or the other, but probably not both. We could resolve this by editing the PKGBUILDs of both font packages and adding a provides entry for a new and potentially controversial virtual package like ttf-japanese-14-and-16-pixel-bitmap that Shuusou Gyoku could then depend on. But with both of the packages being exclusive to the AUR, this dependency would still be annoying to resolve and you'd have no context about the difference.
Thus, the best we can do is to turn both MS Gothic and IPAMonaGothic into optional dependencies with a short one-line description of the difference, and elaborating on this difference in a comment at the top of the PKGBUILD. Thankfully, the culture around Arch makes this a non-issue because you can reasonably expect people to read your PKGBUILD if they build something from the AUR to begin with. You do always read the PKGBUILD, right? :tannedcirno:


Flatpak, on the other hand… I'm not at all opposed to the fundamental idea of installing another distro on top of an already existing distro for wider ABI compatibility; heck, Flatpak is basically no different from Wine or WSL in this regard. It's just that this particular ABI-widening distro works in a rather… unnatural way that crosses the border into utter cringe at times.
There are enough rants about Flatpak from a user's perspective out there, criticizing the bloat relative to native packages, the security implications of bundling libraries, and the questionable utility of its sandbox. But something I rarely see people talk about is just how awful Flatpak is from a developer's point of view:

If that's the supposed future of shipping programs on Linux, they've sure made this dev look back into the past with newfound fondness. I'm now more motivated than ever to separately package Shuusou Gyoku for every distribution, if only to see whether there's just a single distro out there whose packaging system is worse than Flatpak. But then again, packaging this game for other distros is one of the most obvious contribution-ideas there is.
In the end though, the fact that we need to patch Pango to correctly render MS Gothic means that there is a point to shipping Shuusou Gyoku as a Flatpak, beyond just having a single package that works on every distro. And with a download size of 3.4 MiB and an installed size of 6.4 MiB, Shuusou Gyoku almost exemplifies the ideal use case of Flatpak: Apart from miniaudio, BLAKE3, the IPAMonaGothic font, the temporary libc++, and the patched Pango, all other dependencies of the Linux port happen to be part of the Freedesktop runtime and don't add more bloat to the system.


And so, we finally have a 100% native Linux port of Shuusou Gyoku, working and packaged, after 36 pushes! 🎉 But as usual, there's always that last bit of optional work left. The three biggest remaining portability gaps are

Despite 📝 spending 10 pushes on accurate waveform BGM, MIDI support seems to be the most worthwhile feature out of the three. The whole point of the BGM work was that Linux doesn't have a native MIDI synth, so why should packagers or even the users themselves jump through the hoops of setting up some kind of softsynth if it most likely won't sound remotely close to a SC-88Pro? But if you already did, the lack of support might indeed seem unexpected.
But as described in the issue, MIDI support can also mean "a Windows-like plug-and-play" experience, without downloading a BGM pack. Despite the resulting unauthentic sound, this might also be a worthwhile thing to fund if we consider that 14 of the 17 YouTube channels that have uploaded Shuusou Gyoku videos since P0275 still had MIDI playing through the Microsoft GS Wavetable Synth and didn't bother to set up a BGM pack.

Finally, we might want to patch IPAMonaGothic at some point down the line. While a fix for the ascent and descent values that achieves perfect glyph placement without relying on hinting hacks would merely be nice to have, matching the Unicode coverage of its embedded bitmaps with MS Gothic will be crucial for non-ASCII Latin script translations. IPAMonaGothic's outlines do cover the entire Latin-1 Supplement block, but the font is missing embedded bitmaps for all of this block's small letters. Since the existing outlines prevent any glyph fallback in both Fontconfig and GDI, letters like ä, ö, ü, and ñ currently render as spaces.

FontForge screenshot of MS Gothic's embedded 7×14px glyphs in the Basic Latin and Latin-1 Supplement blocks, showing full coverage of both blocksFontForge screenshot of IPAMonaGothic's embedded 7×14px glyphs in the Basic Latin and Latin-1 Supplement blocks, showing missing small letters in the latter block
Not pictured here is the fact that IPAMonaGothic also suffers from Greek and Cyrillic glyphs being full-width, like most Japanese fonts from the Shift-JIS era. If we ever translate Shuusou Gyoku into those scripts, we'd probably just hunt for a different font altogether. But it's not worth going on such a hunt for Latin scripts that are only missing a few special characters.

Ideally, I'd like to apply these edits by modifying the embedded bitmaps in a more controlled, documented, and diffable way and then recompiling the font using a pipeline of some sort. The whole field of fonts often feels impenetrable because the usual editing workflow involves throwing a binary file into a bulky GUI tool and writing out a new binary file, and it doesn't have to be this way. But it looks like I'd have to write key parts of that pipeline myself:

That would increase the price of translations by about one extra push if you all agree that this is a good idea. If not, then we just go for the usual way of patching the .ttf file after all. In any case, we then get to host the edited font at a much nicer place than the Wayback Machine.

But for now, here's the new build:

Next up: TH02 bullets! Here's to 2025 bringing less build system and maintenance work and more actual progress.

📝 Posted:
💰 Funded by:
GhostPhanom, [Anonymous], Blue Bolt, Yanga
🏷️ Tags:

I'm 13 days late, but 🎉 ReC98 is now 10 years old! 🎉 On June 26, 2014, I first tried exporting IDA's disassembly of TH05's OP.EXE and reassembling and linking the resulting file back into a binary, and was amazed that it actually yielded an identical binary. Now, this doesn't actually mean that I've spent 10 years working on this project; priorities have been shifting and continue to shift, and time-consuming mistakes were certainly made. Still, it's a good occasion to finally fully realize the good future for ReC98 that GhostPhanom invested in with the very first financial contribution back in 2018, deliver the last three of the first four reserved pushes, cross another piece of time-consuming maintenance off the list, and prepare the build process for hopefully the next 10 years.
But why did it take 8 pushes and over two months to restore feature parity with the old system? 🥲

  1. The previous build system(s)
  2. Migrating the 16-bit build part to Tup
  3. Optimizing MS-DOS Player
  4. Continued support for building on 32-bit Windows
  5. The new tier list of supported build platforms
  6. Cleaning up #include lists
  7. TH02's High Score menu

The original plan for ReC98's good future was quite different from what I ended up shipping here. Before I started writing the code for this website in August 2019, I focused on feature-completing the experimental 16-bit DOS build system for Borland compilers that I'd been developing since 2018, and which would form the foundation of my internal development work in the following years. Eventually, I wanted to polish and publicly release this system as soon as people stopped throwing money at me. But as of November 2019, just one month after launch, the store kept selling out with everyone investing into all the flashier goals, so that release never happened.

In theory, this build system remains the optimal way of developing with old Borland compilers on a real PC-98 (or any other 32-bit single-core system) and outside of Borland's IDE, even after the changes introduced by this delivery. In practice though, you're soon going to realize that there are lots of issues I'd have to revisit in case any PC-98 homebrew developers are interested in funding me to finish and release this tool…

The main idea behind the system still has its charm: Your build script is a regular C++ program that #includes the build system as a static library and passes fixed structures with names of source files and build flags. By employing static structure initialization, even a 1994 Turbo C++ would let you define the whole build at compile time, although this certainly requires some dank preprocessor magic to remain anywhere near readable at ReC98 scale. 🪄 While this system does require a bootstrapping process, the resulting binary can then use the same dependency-checking mechanisms to recompile and overwrite itself if you change the C++ build code later. Since DOS just simply loads an entire binary into RAM before executing it, there is no lock to worry about, and overwriting the originating binary is something you can just do.
Later on, the system also made use of batched compilation: By passing more than one source file to TCC.EXE, you get to avoid TCC's quite noticeable startup times, thus speeding up the build proportional to the number of translation units in each batch. Of course, this requires that every passed source file is supposed to be compiled with the same set of command-line flags, but that's a generally good complexity-reducing guideline to follow in a build script. I went even further and enforced this guideline in the system itself, thus truly making per-file compiler command line switches considered harmful. Thanks to Turbo C++'s #pragma option, changing the command line isn't even necessary for the few unfortunate cases where parts of ZUN's code were compiled with inconsistent flags.
I combined all these ideas with a general approach of "targeting DOSBox": By maximizing DOS syscalls and minimizing algorithms and data structures, we spend as much time as possible in DOSBox's native-code DOS implementation, which should give us a performance advantage over DOS-native implementations of MAKE that typically follow the opposite approach.

Of course, all this only matters if the system is correct and reliable at its core. Tup teaches us that it's fundamentally impossible to have a reliable generic build system without

  1. augmenting the build graph with all actual files read and written by each invoked build tool, which involves tracing all file-related syscalls, and
  2. persistently serializing the full build graph every time the system runs, allowing later runs to detect every possible kind of change in the build script and rebuild or clean up accordingly.

Unfortunately, the design limitations of my system only allowed half-baked attempts at solving both of these prerequisites:

  1. If your build system is not supposed to be generic and only intended to work with specific tools that emit reliable dependency information, you can replace syscall tracing with a parser for those specific formats. This is what my build system was doing, reading dependency information out of each .OBJ file's OMF COMENT record.
  2. Since DOS command lines are limited to 127 bytes, DOS compilers support reading additional arguments from response files, typically indicated with an @ next to their path on the command line. If we now put every parameter passed to TCC or TLINK into a response file and leave these files on disk afterward, we've effectively serialized all command-line arguments of the entire build into a makeshift database. In later builds, the system can then detect changed command-line arguments by comparing the existing response files from the previous run with the new contents it would write based on the current build structures. This way, we still only recompile the parts of the codebase that are affected by the changed arguments, which is fundamentally impossible with Makefiles.

But this strategy only covers changes within each binary's compile or link arguments, and ignores the required deletions in "the database" when removing binaries between build runs. This is a non-issue as long as we keep decompiling on master, but as soon as we switch between master and similarly old commits on the debloated/anniversary branches, we can get very confusing errors:

Screenshot of a seemingly weird error in my 16-bit build system that complains about TH01's vector functions being undefined when linking REIIDEN.EXE, shown when switching between the `anniversary` and `master` branches.
The symptom is a calling convention mismatch: The two vector functions use __cdecl on master and pascal on debloated/anniversary. We've switched from anniversary (which compiles to ANNIV.EXE) back to master (which compiles to REIIDEN.EXE) here, so the .obj file on disk still uses the pascal calling convention. The build system, however, only checks the response files associated with the current target binary (REIIDEN.EXE) and therefore assumes that the .obj files still reflect the (unchanged) command-line flags in the TCC response file associated with this binary. And if none of the inputs of these .obj files changed between the two branches, they aren't rebuilt after switching, even though they would need to be.

Apparently, there's also such a thing as "too much batching", because TCC would suddenly stop applying certain compiler optimizations at very specific places if too many files were compiled within a single process? At least you quickly remember which source files you then need to manually touch and recompile to make the binaries match ZUN's original ones again…

But the final nail in the coffin was something I'd notice on every single build: 5 years down the line, even the performance argument wasn't convincing anymore. The strategy of minimizing emulated code still left me with an 𝑂(𝑛) algorithm, and with this entire thing still being single-threaded, there was no force to counteract the dependency check times as they grew linearly with the number of source files.
At P0280, each build run would perform a total of 28,130 file-related DOS syscalls to figure out which source files have changed and need to be rebuilt. At some point, this was bound to become noticeable even despite these syscalls being native, not to mention that they're still surrounded by emulator code that must convert their parameters and results to and from the DOS ABI. And with the increasing delays before TCC would do its actual work, the entire thing started feeling increasingly jankier.

While this system was waiting to be eventually finished, the public master branch kept using the Makefile that dates back to early 2015. Back then, it didn't take long for me to abandon raw dumb batch files because Make was simply the most straightforward way of ensuring that the build process would abort on the first compile error.
The following years also proved that Makefile syntax is quite well-suited for expressing the build rules of a codebase at this scale. The built-in support for automatically turning long commands into response files was especially helpful because of how naturally it works together with batched compilation. Both of these advantages culminate in this wonderfully arcane incantation of ASCII special characters and syntactically significant linebreaks:

tcc … @&&|
$**
|

Which translates to "take the filenames of all dependents of this explicit rule, write them into a temporary file with an autogenerated name, insert this filename into the tcc … @ command line, and delete the file after the command finished executing". The @ is part of TCC's command-line interface, the rest is all MAKE syntax.

But 📝 as we all know by now, these surface-level niceties change nothing about Makefiles inherently being unreliable trash due to implementing none of the aforementioned two essential properties of a generic build system. Borland got so close to a correct and reliable implementation of autodependencies, but that would have just covered one of the two properties. Due to this unreliability, the old build16b.bat called Borland's MAKER.EXE with the -B flag, recompiling everything all the time. Not only did this leave modders with a much worse build process than I was using internally, but it also eventually got old for me to merge my internal branch onto master before every delivery. Let's finally rectify that and work towards a single good build process for everyone.


As you would expect by now, I've once again migrated to Tup's Lua syntax. Rewriting it all makes you realize once again how complex the PC-98 Touhou build process is: It has to cover 2 programming languages, 2 pipeline steps, and 3 third-party libraries, and currently generates a total of 39 executables, including the small programs I wrote for research. The final Lua code comprises over 1,300 lines – but then again, if I had written it in 📝 Zig, it would certainly be as long or even longer due to manual memory management. The Tup building blocks I constructed for Shuusou Gyoku quickly turned out to be the wrong abstraction for a project that has no debug builds, but their 📝 basic idea of a branching tree of command-line options remained at the foundation of this script as well.
This rewrite also provided an excellent opportunity for finally dumping all the intermediate compilation outputs into a separate dedicated obj/ subdirectory, finally leaving bin/ nice and clean with only the final executables. I've also merged this new system into most of the public branches of the GitHub repo.

As soon as I first tried to build it all though, I was greeted with a particularly nasty Tup bug. Due to how DOS specified file metadata mutation, MS-DOS Player has to open every file in a way that current Tup treats as a write access… but since unannotated file writes introduce the risk of a malformed build graph if these files are read by another build command later on, Tup providently deletes these files after the command finished executing. And by these files, I mean TCC.EXE as well as every one of its C library header files opened during compilation. :tannedcirno:
Due to a minor unsolved question about a failing test case, my fix has not been merged yet. But even if it was, we're now faced with a problem: If you previously chose to set up Tup for ReC98 or 📝 Shuusou Gyoku and are maybe still running 📝 my 32-bit build from September 2020, running the new build.bat would in fact delete the most important files of your Turbo C++ 4.0J installation, forcing you to reinstall it or restore it from a backup. So what do we do?

The easiest solution, however, is to just put a fixed Tup binary directly into the ReC98 repo. This not only allows me to make Tup mandatory for 64-bit builds, but also cuts out one step in the build environment setup that at least one person previously complained about. :onricdennat: *nix users might not like this idea all too much (or do they?), but then again, TASM32 and the Windows-exclusive MS-DOS Player require Wine anyway. Running Tup through Wine as well means that there's only one PATH to worry about, and you get to take advantage of the tool checks in the surrounding batch file.
If you're one of those people who doesn't trust binaries in Git repos, the repo also links to instructions for building this binary yourself. Replicating this specific optimized binary is slightly more involved than the classic ./configure && make && make install trinity, so having these instructions is a good idea regardless of the fact that Tup's GPL license requires it.

One particularly interesting aspect of the Lua code is the way it handles sprite dependencies:

th04:branch(MODEL_LARGE):link("main", {
	{ "th04_main.asm", extra_inputs = {
		th02_sprites["pellet"],
		th02_sprites["sparks"],
		th04_sprites["pelletbt"],
		th04_sprites["pointnum"],
	} },
	-- …
}

If build commands read from files that were created by other build commands, Tup requires these input dependencies to be spelled out so that it can arrange the build graph and parallelize the build correctly. We could simply put every sprite into a single array and automatically pass that as an extra input to every source file, but that would effectively split the build into a "sprite convert" and "code compile" phase. Spelling out every individual dependency allows such source files to be compiled as soon as possible, before (and in parallel to) the rest of the sprites they don't depend on. Similarly, code files without sprite dependencies can compile before the first sprite got converted, or even before the sprite converter itself got compiled and linked, maximizing the throughput of the overall build process.

Running a 30-year-old DOS toolchain in a parallel build system also introduces new issues, though. The easiest and recommended way of compiling and linking a program in Turbo C++ is a single tcc invocation:

tcc … main.cpp utils.cpp master.lib

This performs a batched compilation of main.cpp and utils.cpp within a single TCC process, and then launches TLINK to link the resulting .obj files into main.exe, together with the C++ runtime library and any needed objects from master.lib. The linking step works by TCC generating a TLINK command line and writing it into a response file with the fixed name turboc.$ln… which obviously can't work in a parallel build where multiple TCC processes will want to link different executables via the same response file.
Therefore, we have to launch TLINK with a custom response file ourselves. This file is echo'd as a separate parallel build rule, and the Lua code that constructs its contents has to replicate TCC's logic for picking the correct C++ runtime .lib file for the selected memory model.

	-c -s -t c0t.obj obj\th02\zun_res1.obj obj\th02\zun_res2.obj, bin\th02\zun_res.com, obj\th02\zun_res.map, bin\masters.lib emu.lib maths.lib ct.lib
The response file for TH02's ZUN_RES.COM, consisting of the C++ standard library, two files of ZUN code, and master.lib.

While this does add more string formatting logic, not relying on TCC to launch TLINK actually removes the one possible PATH-related error case I previously documented in the README. Back in 2021 when I first stumbled over the issue, it took a few hours of RE to figure this out. I don't like these hours to go to waste, so here's a Gist, and here's the text replicated for SEO reasons:

Issue: TCC compiles, but fails to link, with Unable to execute command 'tlink.exe'

Cause: This happens when invoking TCC as a compiler+linker, without the -c flag. To locate TLINK, TCC needlessly copies the PATH environment variable into a statically allocated 128-byte buffer. It then constructs absolute tlink.exe filenames for each of the semicolon- or \0-terminated paths, writing these into a buffer that immediately follows the 128-byte PATH buffer in memory. The search is finished as soon as TCC finds an existing file, which gives precedence to earlier paths in the PATH. If the search didn't complete until a potential "final" path that runs past the 128 bytes, the final attempted filename will consist of the part that still managed to fit into the buffer, followed by the previously attempted path.

Workaround: Make sure that the BIN\ path to Turbo C++ is fully contained within the first 127 bytes of the PATH inside your DOS system. (The 128th byte must either be a separating ; or the terminating \0 of the PATH string.)

Now that DOS emulation is an integral component of the single-part build process, it even makes sense to compile our pipeline tools as 16-bit DOS executables and then emulate them as part of the build. Sure, it's technically slower, but realistically it doesn't matter: Our only current pipeline tools are 📝 the converter for hardcoded sprites and the 📝 ZUN.COM generators, both of which involve very little code and are rarely run during regular development after the initial full build. In return, we get to drop that awkward dependency on the separate Borland C++ 5.5 compiler for Windows and yet another additional manual setup step. 🗑️ Once PC-98 Touhou becomes portable, we're probably going to require a modern compiler anyway, so you can now delete that one as well.

That gives us perfect dependency tracking and minimal parallel rebuilds across the whole codebase! While MS-DOS Player is noticeably slower than DOSBox-X, it's not going to matter all too much; unless you change one of the more central header files, you're rarely if ever going to cause a full rebuild. Then again, given that I'm going to use this setup for at least a couple of years, it's worth taking a closer look at why exactly the compilation performance is so underwhelming …


On the surface, MS-DOS Player seems like the right tool for our job, with a lot of advantages over DOSBox:

But once I began integrating it, I quickly noticed two glaring flaws:

Granted, even the DOSBox-X performance is much slower than we would like it to be. Most of it can be blamed on the awkward time in the early-to-mid-90s when Turbo C++ 4.0J came out. This was the time when DOS applications had long grown past the limitations of the x86 Real Mode and required DOS extenders or even sillier hacks to actually use all the RAM in a typical system of that period, but Win32 didn't exist yet to put developers out of this misery. As such, this compiler not only requires at least a 386 CPU, but also brings its own DOS extender (DPMI16BI.OVL) plus a loader for said extender (RTM.EXE), both of which need to be emulated alongside the compiler, to the great annoyance of emulator maintainers 30 years later. Even MS-DOS Player's README file notes how Protected Mode adds a lot of complexity and slowdown:

8086 binaries are much faster than 80286/80386/80486/Pentium4/IA32 binaries. If you don't need the protected mode or new mnemonics added after 80286, I recommend i86_x86 or i86_x64 binary.

The immediate reaction to these performance numbers is obvious: Let's just put DOSBox-X's dynamic recompiler into MS-DOS Player, right?! 🙌 Except that once you look at DOSBox-X, you immediately get why Takeda Toshiya might have preferred to start from scratch. Its codebase is a historically grown tangled mess, requiring intimate familiarity and a significant engineering effort to isolate the dynamic core in the first place. I did spend a few days trying to untangle and copy it all over into MS-DOS Player… only to be greeted with an infinite loop as soon as everything compiled for the first time. 😶 Yeah, no, that's bound to turn into a budget-exceeding maintenance nightmare.

Instead, let's look at squeezing at least some additional performance out of what we already have. A generic emulator for the entire CISCy instruction set of the 80386, with complete support for Protected Mode, but it's only supposed to run the subset of instructions and features used by a specific compiler and linker as fast as possible… wait a moment, that sounds like a use case for profile-guided optimization! This is the first time I've encountered a situation that would justify the required 2-phase build process and lengthy profile collection – after all, writing into some sort of database for every function call does slow down MS-DOS Player by roughly 15×. However, profiling just the compilation of our most complex translation unit (📝 TH01 YuugenMagan) and the linking of our largest executable (TH01's REIIDEN.EXE) should be representative enough.
I'll get to the performance numbers later, but even the build output is quite intriguing. Based on this profile, Visual Studio chooses to optimize only 104 out of MS-DOS Player's 1976 functions for speed and the rest for size, shaving off a nice 109 KiB from the binary. Presumably, keeping rare code small is also considered kind of fast these days because it takes up less space in your CPU's instruction cache once it does get executed?

With PGO as our foundation, let's run a performance profile and see if there are any further code-level optimizations worth trying out:

So, what do we get?

MS-DOS Player build Full build (Pipeline + 5 games + research code) Median translation unit + median link 📝 YuugenMagan compile + link
GenericPGOGenericPGOGenericPGO
MAME x86 core 46.522s / 50.854s32.162s / 34.885s1.346s / 1.429s0.966s / 0.963s6.975s / 7.155s4.024s / 3.981s
NP21/W core,
before optimizations
34.620s / 36.151s30.218s / 31.318s1.031s / 1.065s0.885s / 0.916s5.294s / 5.330s4.260s / 4.299s
No initial memset() 31.886s / 34.398s27.151s / 29.184s0.945s / 1.009s0.802s / 0.852s5.094s / 5.266s4.104s / 4.190s
Limited instructions 32.404s / 34.276s26.602s / 27.833s0.963s / 1.001s0.783s / 0.819s5.086s / 5.182s3.886s / 3.987s
No paging 29.836s / 31.646s25.124s / 26.356s0.865s / 0.918s0.748s / 0.769s4.611s / 4.717s3.500s / 3.572s
No cycle counting 25.407s / 26.691s21.461s / 22.599s0.735s / 0.752s0.617s / 0.625s3.747s / 3.868s2.873s / 2.979s
2024-06-27 build 26.297s / 27.629s21.014s / 22.143s0.771s / 0.779s0.612s / 0.632s4.372s / 4.506s3.253s / 3.272s
Risky optimizations 23.168s / 24.193s20.711s / 21.782s0.658s / 0.663s0.582s / 0.603s3.269s / 3.414s2.823s / 2.805s
Measured on a 6-year-old 6-core Intel Core i5 8400T on Windows 11. The first number in each column represents the codebase before the #include cleanup explained below, and the second one corresponds to this commit. All builds are 64-bit, 32-bit builds were ≈5% slower across the board. I kept the fastest run within three attempts; as Tup parallelizes the build process across all CPU cores, it's common for the long-running full build to take up to a few seconds longer depending on what else is running on your system. Tup's standard output is also redirected to a file here; its regular terminal output and nice progress bar will add more slowdown on top.

The key takeaways:

But how does this compare to DOSBox-X's dynamic core? Dynamic recompilers need some kind of cache to ensure that every block of original ASM gets recompiled only once, which gives them an advantage in long-running processes after the initial warmup. As a result, DOSBox-X compiles and links YuugenMagan in , ≈92% faster than even our optimized MS-DOS Player build. That percentage resembles the slowdown we were initially getting when comparing full rebuilds between DOSBox-X and MS-DOS Player, as if we hadn't optimized anything.
On paper, this would mean that DOSBox-X barely lost any of its huge advantage when it comes to single-threaded compile+link performance. In practice, though, this metric is supposed to measure a typical decompilation or modding workflow that focuses on repeatedly editing a single file. Thus, a more appropriate comparison would also have to add the aforementioned constant 28,130 syscalls that my old build system required to detect that this is the one file/binary that needs to be recompiled/relinked. The video at the top of this blog post happens to capture the best time () I got for the detection process on DOSBox-X. This is almost as slow as the compilation and linking itself, and would have only gotten slower as we continue decompiling the rest of the games. Tup, on the other hand, performs its filesystem scan in a near-constant , matching the claim in Section 4.7 of its paper, and thus shrinking the performance difference to ≈14% after all. Sure, merging the dynamic core would have been even better (contribution-ideas, anyone?), but this is good enough for now.
Just like with Tup, I've also placed this optimized binary directly into the ReC98 repo and added the specific build instructions to the GitHub release page.

I do have more far-reaching ideas for further optimizing Neko Project 21/W's x86 core for this specific case of repeated switches between Real Mode and Protected Mode while still retaining the interpreted nature of this core, but these already strained the budget enough.
The perhaps more important remaining bottleneck, however, is hiding in the actual DOS emulation. Right now, a Tup-driven full rebuild spawns a total of 361 MS-DOS Player processes, which means that we're booting an emulated DOS 361 times. This isn't as bad as it sounds, as "booting DOS" basically just involves initializing a bunch of internal DOS structures in conventional memory to meaningful values. However, these structures also include a few environment variables like PATH, APPEND, or TEMP/TMP, which MS-DOS Player seamlessly integrates by translating them from their value on the Windows host system to the DOS 8.3 format. This could be one of the main reasons why MS-DOS Player is a native Windows program rather than being cross-platform:

However, the NT kernel doesn't actually use drive letters either, and views them as just a legacy abstraction over its reality of volume GUIDs. Converting paths back and forth between these two views therefore requires it to communicate with a mount point manager service, which can coincidentally also be observed in debug builds of Tup.
As a result, calling any path-retrieving API is a surprisingly expensive operation on modern Windows. When running a small sprite through our 📝 sprite converter, MS-DOS Player's boot process makes up 56% of the runtime, with 64% of that boot time (or 36% of the entire runtime) being spent on path translation. The actual x86 emulation to run the program only takes up 6.5% of the runtime, with the remaining 37.5% spent on initializing the multithreaded C++ runtime.

But then again, the truly optimal solution would not involve MS-DOS Player at all. If you followed general video game hacking news in May, you'll probably remember the N64 community putting the concept of statically recompiled game ports on the map. In case you're wondering where this seemingly sudden innovation came from and whether a reverse-engineered decompilation project like ReC98 is obsolete now, I wrote a new FAQ entry about why this hype, although justified, is at least in part misguided. tl;dr: None of this can be meaningfully applied to PC-98 games at the moment.
On the other hand, recompiling our compiler would not only be a reasonable thing to attempt, but exactly the kind of problem that recompilation solves best. A 16-bit command-line tool has none of the pesky hardware factors that drag down the usefulness of recompilations when it comes to game ports, and a recompiled port could run even faster than it would on 32-bit Windows. Sure, it's not as flashy as a recompiled game, but if we got a few generous backers, it would still be a great investment into improving the state of static x86 recompilation by simply having another open-source project in that space. Not to mention that it would be a great foundation for improving Turbo C++ 4.0J's code generation and optimizations, which would allow us to simplify lots of awkward pieces of ZUN code… 🤩


That takes care of building ReC98 on 64-bit platforms, but what about the 32-bit ones we used to support? The previous split of the build process into a Tup-driven 32-bit part and a Makefile-driven 16-bit part sure was awkward and I'm glad it's gone, but it did give you the choice between 1) emulating the 16-bit part or 2) running both parts natively on 32-bit Windows. While Tup's upstream Windows builds are 64-bit-only, it made sense to 📝 compile a custom 32-bit version and thus turn any 32-bit Windows ≥Vista into the perfect build platform for ReC98. Older Windows versions that can't run Tup had to build the 32-bit part using a separately maintained dumb batch script created by tup generate, but again, due to Make being trash, they were fully rebuilding the entire codebase every time anyway.
Driving the entire build via Tup changes all of that. Now, it makes little sense to continue using 32-bit Tup:

This means that we could now only support 32-bit Windows via an even larger tup generated batch file. We'd have to move the MS-DOS Player prefix of the respective command lines into an environment variable to make Tup use the same rules for both itself and the batch file, but the result seems to work…

…but it's really slow, especially on Windows 9x. 🐌 If we look back at the theory behind my previous custom build system, we can already tell why: Efficiently building ReC98 requires a completely different approach depending on whether you're running a typical modern multi-core 64-bit system or a vintage single-core 32-bit system. On the former, you'd want to parallelize the slow emulation as much as you can, so you maximize the amount of TCC processes to keep all CPU cores as busy as possible. But on the latter, you'd want the exact opposite – there, the biggest annoyance is the repeated startup and shutdown of the VDM, TCC, and its DOS extender, so you want to continue batching translation units into as few TCC processes as possible.

CMake fans will probably feel vindicated now, thinking "that sounds exactly like you need a meta build system 🤪". Leaving aside the fact that the output vomited by all of CMake's Makefile generators is a disgusting monstrosity that's far removed from addressing any performance concerns, we sure could solve this problem by adding another layer of abstraction. But then, I'd have to rewrite my working Lua script into either C++ or (heaven forbid) Batch, which are the only options we'd have for bootstrapping without adding any further dependencies, and I really wouldn't want to do that. Alternatively, we could fork Tup and modify tup generate to rewrite the low-level build rules that end up in Tup's database.
But why should we go for any of these if the Lua script already describes the build in a high-level declarative way? The most appropriate place for transforming the build rules is the Lua script itself…

… if there wasn't the slight problem of Tup forbidding file writes from Lua. 🥲 Presumably, this limitation exists because there is no way of replicating these writes in a tup generated dumb shell script, and it does make sense from that point of view.
But wait, printing to stdout or stderr works, and we always invoke Tup from a batch file anyway. You can now tell where this is going. :tannedcirno: Hey, exfiltrating commands from a build script to the build system via standard I/O streams works for Rust's Cargo too!

Just like Cargo, we want to add a sufficiently unique prefix to every line of the generated batch script to distinguish it from Tup's other output. Since Tup only reruns the Lua script – and would therefore print the batch file – if the script changed between the previous and current build run, we only want to overwrite the batch file if we got one or more lines. Getting all of this to work wasn't all too easy; we're once again entering the more awful parts of Batch syntax here, which apparently are so terrible that Wine doesn't even bother to correctly implement parts of it. 😩
Most importantly, we don't really want to redirect any of Tup's standard I/O streams. Redirecting stdout disables console output coloring and the pretty progress bar at the bottom, and looping over stderr instead of stdout in Batch is incredibly awkward. Ideally, we'd run a second Tup process with a sub-command that would just evaluate the Lua script if it changed - and fortunately, tup parse does exactly that. 😌
In the end, the optimally fast and ERRORLEVEL-preserving solution involves two temporary files. But since creating files between two Tup runs causes it to reparse the Lua code, which would print the batch file to the unfiltered stdout, we have to hide these temporary files from Tup by placing them into its .tup/ database directory. 🤪

On a more positive note, programmatically generating batches from single-file TCC rules turned out to be a great idea. Since the Lua code maps command-line flags to arrays of input files, it can also batch across binaries, surpassing my old system in this regard. This works especially well on the debloated and anniversary branches, which replace ZUN's little command-line flag inconsistencies with a single set of good optimization flags that every translation unit is compiled with.

Time to fire up some VMs then… only to see the build failing on Windows 9x with multiple unhelpful Bad command or file name errors. Clearly, the long echo lines that write our response files run up against some length limit in command.com and need to be split into multiple ones. Windows 9x's limit is larger than the 127 characters of DOS, that's for sure, and the exact number should just be one search away…
…except that it's not the 1024 characters recounted in a surviving newsgroup post. Sure, lines are truncated to 1023 bytes and that off-by-one error is no big deal in this context, but that's not the whole story:

: This not unrealistic command line is 137 bytes long and fails on Windows 9x?!
> echo -DA=1 2 3 a/b/c/d/1 a/b/c/d/2 a/b/c/d/3 a/b/c/d/4 a/b/c/d/5 a/b/c/d/6 a/b/c/d/7 a/b/c/d/8 a/b/c/d/9 a/b/c/d/10 a/b/c/d/11 a/b/c/d/12
Bad command or file name

Wait, what, something about / being the SWITCHAR? And not even just that…

: Down to 132 bytes… and 32 "assignments"?
> echo a=0 b=1 c=2 d=3 e=4 f=5 g=6 h=7 i=8 j=9 k=0 l=1 m=2 n=3 o=4 p=5 q=6 r=7 s=8 t=9 u=0 v=1 w=2 x=3 y=4 z=5 a=0 b=1 c=2 d=3 e=4 f=5
Bad command or file name

And what's perhaps the worst example:

: 64 slashes. Works on DOS, works on `cmd.exe`, fails on 9x.
> echo ////////////////////////////////////////////////////////////////
Bad command or file name

My complete set of test cases: 2024-07-09-Win9x-batch-tokenizer-tests.bat So, time to load command.com into DOSBox-X's debugger and step through some code. 🤷 The earliest NT-based Windows versions were ported to a variety of CPUs and therefore received the then-all-new cmd.exe shell written in C, whereas Windows 9x's command.com was still built on top of the dense hand-written ASM code that originated in the very first DOS versions. Fortunately though, Microsoft open-sourced one of the later DOS versions in April. This made it somewhat easier to cross-reference the disassembly even though the Windows 9x version significantly diverged in the parts we're interested in.
And indeed: After truncating to 1023 bytes and parsing out any redirectors, each line is split into tokens around whitespace and = signs and before every occurrence of the SWITCHAR. These tokens are written into a statically allocated 64-element array, and once the code tries to write the 65th element, we get the Bad command or file name error instead.

# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
String echo -DA 1 2 3 a /B /C /D /1 a /B /C /D /2
Switch flag 🚩 🚩 🚩 🚩 🚩 🚩 🚩 🚩
The first few elements of command.com's internal argument array after calling the Windows 9x equivalent of parseline with my initial example string. Note how all the "switches" got capitalized and annotated with a flag, whereas the = sign no longer appears in either string or flag form.

Needless to say, this makes no sense. Both DOS and Windows pass command lines as a single string to newly created processes, and since this tokenization is lossy, command.com will just have to pass the original string anyway. If your shell wants to handle tokenization at a central place, it should happen after it decided that the command matches a builtin that can actually make use of a pointer to the resulting token array – or better yet, as the first call of each builtin's code. Doing it before is patently ridiculous.
I don't know what's worse – the fact that Windows 9x blindly grinds each batch line through this tokenizer, or the fact that no documentation of this behavior has survived on today's Internet, if any even ever existed. The closest thing I found was this page that doesn't exist anymore, and it also just contains a mere hint rather than a clear description of the issue. Even the usual Batch experts who document everything else seem to have a blind spot when it comes to this specific issue. As do emulators: DOSBox and FreeDOS only reimplement the sane DOS versions of command.com, and Wine only reimplements cmd.exe.

Oh well. 71 lines of Lua later, the resulting batch file does in fact work everywhere:

The clear performance winner at 11.15 seconds after the initial tool check, though sadly bottlenecked by strangely long TASM32 startup times. As for TCC though, even this performance is the slowest a recompiled port would be. Modern compiler optimizations are probably going to shave off another second or two, and implementing support for #pragma once into the recompiled code will get us the aforementioned 5% on top.
If you run this on VirtualBox on modern Windows, make sure to disable Hyper-V to avoid the slower snail execution mode. 🐢
Building in Windows XP under Hyper-V exchanges Windows 98's slow TASM32 startup times for slightly slower DOS performance, resulting in a still decent 13.4 seconds.
29.5 seconds?! Surely something is getting emulated here. And this is the best time I randomly got; my initial preview recording took 55 seconds which is closer to DOSBox-X's dynamic core than it is to Windows 9x. Given how poorly 32-bit Windows 10 performs, Microsoft should have probably discontinued 32-bit Windows after 8 already. If any 16-bit program you could possibly want to run is either too slow or likely to exhibit other compatibility issues (📝 Shuusou Gyoku, anyone?), the existence of 32-bit Windows 10 is nothing but a maintenance burden. Especially because Windows 10 simultaneously overhauled the console subsystem, which is bound to cause compatibility issues anyway. It sure did for me back in 2019 when I tried to get my build system to work…

But wait, there's more! The codebase now compiles on all 32-bit Windows systems I've tested, and yields binaries that are equivalent to ZUN's… except on 32-bit Windows 10. 🙄 Suddenly, we're facing the exact same batched compilation bug from my custom build system again, with REIIDEN.EXE being 16 bytes larger than it's supposed to be.
Looks like I have to look into that issue after all, but figuring out the exact cause by debugging TCC would take ages again. Thankfully, trial and error quickly revealed a functioning workaround: Separating translation unit filenames in the response file with two spaces rather than one. Really, I couldn't make this up. This is the most ridiculous workaround for a bug I've encountered in a long time.

echo -c  -I.  -O  -b-  -3  -Z  -d  -DGAME=4  -ml  -nobj/th04/  th04/op_main.cpp  th04/input_w.cpp  th04/vector.cpp  th04/snd_pmdr.c  th04/snd_mmdr.c  th04/snd_kaja.cpp  th04/snd_mode.cpp  th04/snd_dlym.cpp  th04/snd_load.cpp  th04/exit.cpp  th04/initop.cpp  th04/cdg_p_na.cpp  th04/snd_se.cpp  th04/egcrect.cpp  th04/bgimage.cpp  th04/op_setup.cpp  th04/zunsoft.cpp  th04/op_music.cpp  th04/m_char.cpp  th04/slowdown.cpp  th04/demo.cpp  th04/ems.cpp  th04/tile_set.cpp  th04/std.cpp  th04/tile.cpp>obj\batch014.@c
echo th04/playfld.cpp  th04/midboss4.cpp  th04/f_dialog.cpp  th04/dialog.cpp  th04/boss_exp.cpp  th04/stages.cpp  th04/player_m.cpp  th04/player_p.cpp  th04/hud_ovrl.cpp  th04/cfg_lres.cpp  th04/checkerb.cpp  th04/mb_inv.cpp  th04/boss_bd.cpp  th04/mpn_free.cpp  th04/mpn_l_i.cpp  th04/initmain.cpp  th04/gather.cpp  th04/scrolly3.cpp  th04/midboss.cpp  th04/hud_hp.cpp  th04/mb_dft.cpp  th04/grcg_3.cpp  th04/it_spl_u.cpp  th04/boss_4m.cpp  th04/bullet_u.cpp  th04/bullet_a.cpp  th04/boss.cpp  th04/boss_4r.cpp  th04/boss_x2.cpp  th04/maine_e.cpp  th04/cutscene.cpp>>obj\batch014.@c
echo th04/staff.cpp>>obj\batch014.@c
The TCC response file generation code for all current decompiled TH04 code, split into multiple echo calls based on the Windows 9x batch tokenizer rules and with double spaces between each parameter for added "safety". Would this also have been the solution for the batched compilation bugs I was experiencing with my old build system in DOSBox? I suddenly was unable to reproduce these bugs, so we won't know for the time being…

Hopefully, you've now got the impression that supporting any kind of 32-bit Windows build is way more of a liability than an asset these days, at least for this specific project. "Real hardware", "motivating a TCC recompilation", and "not dropping previous features" really were the only reasons for putting up with the sheer jank and testing effort I had to go through. And I wouldn't even be surprised if real-hardware developers told me that the first reason doesn't actually hold up because compiling ReC98 on actual PC-98 hardware is slow enough that they'd rather compile it on their main machine and then transfer the binaries over some kind of network connection. :onricdennat:
I guess it also made for some mildly interesting blog content, but this was definitely the last time I bothered with such a wide variety of Windows versions without being explicitly funded to do so. If I ever get to recompile TCC, it will be 64-bit only by default as well.

Instead, let's have a tier list of supported build platforms that clearly defines what I am maintaining, with just the most convincing 32-bit Windows version in Tier 1. Initially, that was supposed to be Windows 98 SE due to its superior performance, but that's just unreasonable if key parts of the OS remain undocumented and make no sense. So, XP it is.
*nix fans will probably once again be disappointed to see their preferred OS in Tier 2. But at least, all we'd need for that to move up to Tier 1 is a CI configuration, contributed either via funding me or sending a PR. (Look, even more contribution-ideas!)
Getting rid of the Wine requirement for a fully cross-platform build process wouldn't be too unrealistic either, but would require us to make a few quality decisions, as usual:

Y'know what I think would be the best idea for right now, though? Savoring this new build system and spending an extended amount of time doing actual decompilation or modding for a change. :tannedcirno:


Now that even full rebuilds are decently fast, let's make use of that productivity boost by doing some urgent and far-reaching code cleanup that touches almost every single C++ source file. The most immediately annoying quirk of this codebase was the silly way each translation unit #included the headers it needed. Many years ago, I measured that repeatedly including the same header did significantly impact Turbo C++ 4.0J's compilation times, regardless of any include guards inside. As a consequence of this discovery, I slightly overreacted and decided to just not use any include guards, ever. After all, this emulated build process is slow enough, and we don't want it to needlessly slow down even more! :onricdennat: This way, redundantly including any file that adds more than just a few #define macros won't even compile, throwing lots of Multiple definition errors.
Consequently, the headers themselves #included almost nothing. Starting a new translation unit therefore always involved figuring and spelling out the transitive dependencies of the headers the new unit actually wants to use, in a short trial-and-error process. While not too bad by itself, this was bound to become quite counterproductive once we get closer to porting these games: If some inlined function in a header needed access to, let's say, PC-98-specific I/O ports as an implementation detail, the header would have externalized this dependency to the top-level translation unit, which in turn made that that unit appear to contain PC-98-native code even if the unit's code itself was perfectly portable.

But once we start making some of these implicit transitive dependencies optional, it all stops being justifiable. Sometimes, a.hpp declared things that required declarations from b.hpp but these things are used so rarely that it didn't justify adding #include "b.hpp" to all translation units that #include "a.hpp". So how about conditionally declaring these things based on previously #included headers? :tannedcirno:

#if (defined(SUBPIXEL_HPP) && defined(PLANAR_H))
	// Sets the [tile_ring] tile at (x, y) to the given VRAM offset.
	void tile_ring_set_vo(subpixel_t x, subpixel_t y, vram_offset_t image_vo);
#endif
You can maybe do this in a project that consistently sorts the #include lists in every translation unit… err, no, don't do this, ever, it's awful. Just separate that declaration out into another header.

Now that we've measured that the sane alternative of include guards comes with a performance cost of just 5% and we've further reduced its effective impact by parallelizing the build, it's worth it to take that cost in exchange for a tidy codebase without such surprises. From now on, every header file will #include its own dependencies and be a valid translation unit that must compile on its own without errors. In turn, this allows us to remove at least 1,000 #includes of transitive dependencies from .cpp files. 🗑️
However, that 5% number was only measured after I reduced these redundant #includes to their absolute minimum. So it still makes sense to only add include guards where they are absolutely necessary – i.e., transitively dependent headers included from more than one other file – and continue to (ab)use the Multiple definition compiler errors as a way of communicating "you're probably #including too many headers, try removing a few". Certainly a less annoying error than Undefined symbol.


Since all of this went way over the 7-push mark, we've got some small bits of RE and PI work to round it all out. The .REC loader in TH04 and TH05 is completely unremarkable, but I've got at least a bit to say about TH02's High Score menu. I already decompiled MAINE.EXE's post-Staff Roll variant in 2015, so we were only missing the almost identical MAIN.EXE variant shown after a Game Over or when quitting out of the game. The two variants are similar enough that it mostly needed just a small bit of work to bring my old 2015 code up to current standards, and allowed me to quickly push TH02 over the 40% RE mark.
Functionally, the two variants only differ in two assignments, but ZUN once again chose to copy-paste the entire code to handle them. :zunpet: This was one of ZUN's better copy-pasting jobs though – and honestly, I can't even imagine how you would mess up a menu that's entirely rendered on the PC-98's text RAM. It almost makes you wonder whether ZUN actually used the same #if ENDING preprocessor branching that my decompilation uses… until the visual inconsistencies in the alignment of the place numbers and the POINT and ST labels clearly give it away as copy-pasted:

Screenshot of TH02's High Score screen as seen in MAIN.EXE when quitting out of the game, with scores initialized to show off the maximum number of digits and the incorrect alignment of the POINT and ST headersScreenshot of TH02's High Score screen as seen in MAINE.EXE when entering a new high score after the Staff Roll, with scores initialized to show off the maximum number of digits and the incorrect alignment of the POINT header

Next up: Starting the big Seihou summer! Fortunately, waiting two more months was worth it: In mid-June, Microsoft released a preview version of Visual Studio that, in response to my bug report, finally, finally makes C++ standard library modules fully usable. Let's clean up that codebase for real, and put this game into a window.

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Technical debt, part 10… in which two of the PMD-related functions came with such complex ramifications that they required one full push after all, leaving no room for the additional decompilations I wanted to do. At least, this did end up being the final one, completing all SHARED segments for the time being.


The first one of these functions determines the BGM and sound effect modes, combining the resident type of the PMD driver with the Option menu setting. The TH04 and TH05 version is apparently coded quite smartly, as PC-98 Touhou only needs to distinguish "OPN- / PC-9801-26K-compatible sound sources handled by PMD.COM" from "everything else", since all other PMD varieties are OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's AH=09h function. I'll leave a comprehensive, fully documented enum to interested contributors, since that would involve research into basically the entire history of the PC-9800 series, and even the clearly out-of-scope PC-88VA. After all, distinguishing between more versions of the PMD driver in the Option menu (and adding new sprites for them!) is strictly mod territory.


The honor of being the final decompiled function in any SHARED segment went to TH04's snd_load(). TH04 contains by far the sanest version of this function: Readable C code, no new ZUN bugs (and still missing file I/O error handling, of course)… but wait, what about that actual file read syscall, using the INT 21h, AH=3Fh DOS file read API? Reading up to a hardcoded number of bytes into PMD's or MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in TH05… that's kind of weird. About time we looked closer into this. :thonk:

Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one memory segment for these, as especially TH05's code might suggest to anyone unfamiliar with these drivers. :zunpet: Instead, you can customize the size of these buffers on its command line. In GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound effects, and 12 KiB for MMD files in TH02… which means that the hardcoded sizes in snd_load() are completely wrong, no matter how you look at them. :onricdennat: Consequently, this read syscall will overflow PMD's or MMD's song or sound effect buffer if the given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT instead, and it would have been fine. As it also turns out though, PMD has an API function (AH=22h) to retrieve the actual buffer sizes, provided for exactly that purpose. There is little excuse not to use it, as it also gives you PMD's default sizes if you don't specify any yourself.
(Unless your build process enumerates all PMD files that are part of the game, and bakes the largest size into both snd_load() and GAME.BAT. That would even work with MMD, which doesn't have an equivalent for AH=22h.)

What'd be the consequence of loading a larger file then? Well, since we don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single memory segment. As a result, the limit for the combined size of the song, instrument, and sound effect buffer is determined by the amount of code in the driver itself. In PMD86 version 4.8o (bundled with TH04 and TH05) for example, the remaining size for these buffers is exactly 45,555 bytes. Being an actually good programmer who doesn't blindly trust user input, KAJA thankfully validates the sizes given via the /M, /V, and /E command-line options before letting the driver reside in memory, and shuts down with an error message if they exceed 40 KiB. Would have been even better if he calculated the exact size – even in the current PMD version 4.8s from January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect is down to the INT 21h, AH=3Fh implementation in the underlying DOS version. DOS 3.3 treats the destination address as linear and reads past the end of the segment, DOS 5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining space in the segment, and maybe there's even a DOS that wraps around and ends up overwriting the PMD driver code. In any case: You will overwrite what's after the driver in memory – typically, the game .EXE and its master.lib functions.

It almost feels like a happy accident that this doesn't cause issues in the original games. The largest PMD file in any of the 4 games, the -86 version of 幽夢 ~ Inanimate Dream, takes up 8,099 bytes, just under the 8,192 byte limit for BGM. For modders, I'd really recommend implementing this properly, with PMD's AH=22h function and error handling, once position independence has been reached.

Whew, didn't think I'd be doing more research into KAJA's drivers during regular ReC98 development! That's probably been the final time though, as all involved functions are now decompiled, and I'm unlikely to iterate over them again.


And that's it! Repaid the biggest chunk of technical debt, time for some actual progress again. Next up: Reopening the store tomorrow, and waiting for new priorities. If we got nothing by Sunday, I'm going to put the pending [Anonymous] pushes towards some work on the website.

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Whoops, the build was broken again? Since P0127 from mid-November 2020, on TASM32 version 5.3, which also happens to be the one in the DevKit… That version changed the alignment for the default segments of certain memory models when requesting .386 support. And since redefining segment alignment apparently is highly illegal and absolutely has to be a build error, some of the stand-alone .ASM translation units didn't assemble anymore on this version. I've only spotted this on my own because I casually compiled ReC98 somewhere else – on my development system, I happened to have TASM32 version 5.0 in the PATH during all this time.
At least this was a good occasion to get rid of some weird segment alignment workarounds from 2015, and replace them with the superior convention of using the USE16 modifier for the .MODEL directive.

ReC98 would highly benefit from a build server – both in order to immediately spot issues like this one, and as a service for modders. Even more so than the usual open-source project of its size, I would say. But that might be exactly because it doesn't seem like something you can trivially outsource to one of the big CI providers for open-source projects, and quickly set it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular 64-bit Windows 10 and DOSBox running the exact build tools from the DevKit. Ideally, though, such a server should really run the optimal configuration of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step to run natively, which already is something that no popular CI service out there offers. Then, we'd optimally expand to Linux, every other Windows version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd be a lot. An experimental project all on its own, with additional hosting costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest there is once the store reopens (which will be at the beginning of May, at the latest). That aside, it would 📝 also be a great project for outside contributors!


So, technical debt, part 8… and right away, we're faced with TH03's low-level input function, which 📝 once 📝 again 📝 insists on being word-aligned in a way we can't fake without duplicating translation units. Being undecompilable isn't exactly the best property for a function that has been interesting to modders in the past: In 2018, spaztron64 created an ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human multiplayer mode: 2021-04-04-TH03-WASD-2player.zip However, this remapping attempt remained quite limited, since we hadn't (and still haven't) reached full position independence for TH03 yet. There's quite some potential for size optimizations in this function, which would allow more BIOS key groups to already be used right now, but it's not all that obvious to modders who aren't intimately familiar with x86 ASM. Therefore, I really wouldn't want to keep such a long and important function in ASM if we don't absolutely have to…

… and apparently, that's all the motivation I needed? So I took the risk, and spent the first half of this push on reverse-engineering TCC.EXE, to hopefully find a way to get word-aligned code segments out of Turbo C++ after all.

And there is! The -WX option, used for creating DPMI applications, messes up all sorts of code generation aspects in weird ways, but does in fact mark the code segment as word-aligned. We can consider ourselves quite lucky that we get to use Turbo C++ 4.0, because this feature isn't available in any previous version of Borland's C++ compilers.
That allowed us to restore all the decompilations I previously threw away… well, two of the three, that lookup table generator was too much of a mess in C. :tannedcirno: But what an abuse this is. The subtly different code generation has basically required one creative workaround per usage of -WX. For example, enabling that option causes the regular PUSH BP and POP BP prolog and epilog instructions to be wrapped with INC BP and DEC BP, for some reason:

a_function_compiled_with_wx proc
	inc 	bp    	; ???
	push	bp
	mov 	bp, sp
	    	      	; [… function code …]
	pop 	bp
	dec 	bp    	; ???
	ret
a_function_compiled_with_wx endp

Luckily again, all the functions that currently require -WX don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty close: snd_se_reset(void) is one of the functions that require word alignment. Previously, it shared a translation unit with the immediately following snd_se_play(int new_se), which does take a parameter, and therefore would have had its prolog and epilog code messed up by -WX. Since the latter function has a consistent (and thus, fakeable) alignment, I simply split that code segment into two, with a new -WX translation unit for just snd_se_reset(void). Problem solved – after all, two C++ translation units are still better than one ASM translation unit. :onricdennat: Especially with all the previous #include improvements.

The rest was more of the usual, getting us 74% done with repaying the technical debt in the SHARED segment. A lot of the remaining 26% is TH04 needing to catch up with TH03 and TH05, which takes comparatively little time. With some good luck, we might get this done within the next push… that is, if we aren't confronted with all too many more disgusting decompilations, like the two functions that ended this push. If we are, we might be needing 10 pushes to complete this after all, but that piece of research was definitely worth the delay. Next up: One more of these.

📝 Posted:
💰 Funded by:
Ember2528, Yanga
🏷️ Tags:

Well, make that three days. Trying to figure out all the details behind the sprite flickering was absolutely dreadful…
It started out easy enough, though. Unsurprisingly, TH01 had a quite limited pellet system compared to TH04 and TH05:

As expected from TH01, the code comes with its fair share of smaller, insignificant ZUN bugs and oversights. As you would also expect though, the sprite flickering points to the biggest and most consequential flaw in all of this.


Apparently, it started with ZUN getting the impression that it's only possible to use the PC-98 EGC for fast blitting of all 4 bitplanes in one CPU instruction if you blit 16 horizontal pixels (= 2 bytes) at a time. Consequently, he only wrote one function for EGC-accelerated sprite unblitting, which can only operate on a "grid" of 16×1 tiles in VRAM. But wait, pellets are not only just 8×8, but can also be placed at any unaligned X position…

… yet the game still insists on using this 16-dot-aligned function to unblit pellets, forcing itself into using a super sloppy 16×8 rectangle for the job. 🤦 ZUN then tried to mitigate the resulting flickering in two hilarious ways that just make it worse:

  1. An… "interlaced rendering" mode? This one's activated for all Stage 15 and 20 fights, and separates pellets into two halves that are rendered on alternating frames. Collision detection with the Yin-Yang Orb and the player is only done for the visible half, but collision detection with player shots is still done for all pellets every frame, as are motion updates – so that pellets don't end up moving half as fast as they should.
    So yeah, your eyes weren't deceiving you. The game does effectively drop its perceived frame rate in the Elis, Kikuri, Sariel, and Konngara fights, and it does so deliberately.
  2. 📝 Just like player shots, pellets are also unblitted, moved, and rendered in a single function. Thanks to the 16×8 rectangle, there's now the (completely unnecessary) possibility of accidentally unblitting parts of a sprite that was previously drawn into the 8 pixels right of a pellet. And this is where ZUN went full :tannedcirno: and went "oh, I know, let's test the entire 16 pixels, and in case we got an entity there, we simply make the pellet invisible for this frame! Then we don't even have to unblit it later!" :zunpet:

    Except that this is only done for the first 3 elements of the player shot array…?! Which don't even necessarily have to contain the 3 shots fired last. It's not done for the player sprite, the Orb, or, heck, other pellets that come earlier in the pellet array. (At least we avoided going 𝑂(𝑛²) there?)

    Actually, and I'm only realizing this now as I type this blog post: This test is done even if the shots at those array elements aren't active. So, pellets tend to be made invisible based on comparisons with garbage data. :onricdennat:

    And then you notice that the player shot unblit​/​move​/​render function is actually only ever called from the pellet unblit​/​move​/​render function on the one global instance of the player shot manager class, after pellets were unblitted. So, we end up with a sequence of

    Pellet unblit → Pellet move → Shot unblit → Shot move → Shot render → Pellet render

    which means that we can't ever unblit a previously rendered shot with a pellet. Sure, as terrible as this one function call is from a software architecture perspective, it was enough to fix this issue. Yet we don't even get the intended positive effect, and walk away with pellets that are made temporarily invisible for no reason at all. So, uh, maybe it all just was an attempt at increasing the ramerate on lower spec PC-98 models?

Yup, that's it, we've found the most stupid piece of code in this game, period. It'll be hard to top this.


I'm confident that it's possible to turn TH01 into a well-written, fluid PC-98 game, with no flickering, and no perceived lag, once it's position-independent. With some more in-depth knowledge and documentation on the EGC (remember, there's still 📝 this one TH03 push waiting to be funded), you might even be able to continue using that piece of blitter hardware. And no, you certainly won't need ASM micro-optimizations – just a bit of knowledge about which optimizations Turbo C++ does on its own, and what you'd have to improve in your own code. It'd be very hard to write worse code than what you find in TH01 itself.

(Godbolt for Turbo C++ 4.0J when? Seriously though, that would 📝 also be a great project for outside contributors!)


Oh well. In contrast to TH04 and TH05, where 4 pushes only covered all the involved data types, they were enough to completely cover all of the pellet code in TH01. Everything's already decompiled, and we never have to look at it again. 😌 And with that, TH01 has also gone from by far the least RE'd to the most RE'd game within ReC98, in just half a year! 🎉
Still, that was enough TH01 game logic for a while. :tannedcirno: Next up: Making up for the delay with some more relaxing and easy pieces of TH01 code, that hopefully make just a bit more sense than all this garbage. More image formats, mainly.

📝 Posted:
💰 Funded by:
Ember2528
🏷️ Tags:

Sadly, we've already reached the end of fast triple-speed TH01 progress with 📝 the last push, which decompiled the last segment shared by all three of TH01's executables. There's still a bit of double-speed progress left though, with a small number of code segments that are shared between just two of the three executables.

At the end of the first one of these, we've got all the code for the .GRZ format – which is yet another run-length encoded image format, but this time storing up to 16 full 640×400 16-color images with an alpha bit. This one is exclusively used to wastefully store Konngara's sword slash and kuji-in kill animations. Due to… suboptimal code organization, the code for the format is also present in OP.EXE, despite not being used there. But hey, that brings TH01 to over 20% in RE!

Decoupling the RLE command stream from the pixel data sounds like a nice idea at first, allowing the format to efficiently encode a variety of animation frames displayed all over the screen… if ZUN actually made use of it. The RLE stream also has quite some ridiculous overhead, starting with 1 byte to store the 1-bit command (putting a single 8×1 pixel block, or entering a run of N such blocks). Run commands then store another 1-byte run length, which has to be followed by another command byte to identify the run as putting N blocks, or skipping N blocks. And the pixel data is just a sequence of these blocks for all 4 bitplanes, in uncompressed form…

Also, have some rips of all the images this format is used for:

<code>boss8.grz</code>, image 1/16<code>boss8.grz</code>, image 2/16<code>boss8.grz</code>, image 3/16<code>boss8.grz</code>, image 4/16<code>boss8.grz</code>, image 5/16<code>boss8.grz</code>, image 6/16<code>boss8.grz</code>, image 7/16<code>boss8.grz</code>, image 8/16<code>boss8.grz</code>, image 9/16<code>boss8.grz</code>, image 10/16<code>boss8.grz</code>, image 11/16<code>boss8.grz</code>, image 12/16<code>boss8.grz</code>, image 13/16<code>boss8.grz</code>, image 14/16<code>boss8.grz</code>, image 15/16<code>boss8.grz</code>, image 16/16

To make these, I just wrote a small viewer, calling the same decompiled TH01 code: 2020-03-07-grzview.zip Obviously, this means that it not only must to be run on a PC-98, but also discards the alpha information. If any backers are really interested in having a proper converter to and from PNG, I can implement that in an upcoming push… although that would be the perfect thing for outside contributors to do.

Next up, we got some code for the PI format… oh, wait, the actual files are called "GRP" in TH01.