Blog

Showing all posts tagged

📝 Posted:
💰 Funded by:
[Anonymous], Ember2528
🏷️ Tags:

Alright! I've been announcing a big look at TH03's in-game systems all throughout 2025, and I technically still made it before the end of the year. :tannedcirno: TH03's enemy, fireball, and explosion systems are a great fit for this occasion: They fulfill both of the netplay-relevant criteria I mentioned 📝 at the end of the previous blog post, but also unfortunately share the same structure and overload some of their fields with vastly different meanings, much like 📝 TH04's and TH05's custom entities. Hence, they will take rather long to untangle, which ensures that the resulting look will be appropriately big.

Until I noticed that explosions spawn bullets in a not-so-straightforward way that basically requires complete knowledge of the bullet system. What a great discovery to make 2½ pushes into development… Oh well, we have the budget, and bullets also happen to match our netplay-relevant criteria, so let's get those done first.

  1. A cursory look at character-specific attacks
  2. TH03's bullet logic
  3. TH03's bullet renderer
  4. Migrating away from Twitter

As usual, we'd also like to identify and name all character-specific functions in the ASM code so that we can immediately correlate certain interesting features of the bullet system with the characters and attacks that use them. In TH03, this is particularly worthwhile because it's all we need for a 100% complete overview of how bullets are used. Apart from the transferred pellets fired from exploding enemies outside of Gauge or Boss Attacks, every bullet pattern in the game is part of such a hardcoded and character-specific Extra, Gauge, or Boss Attack, since enemy scripts cannot fire bullets in this game.
This quick look also showed how ZUN implemented the 9 characters in a highly consistent manner. Gauge Attacks in particular follow a predictable convention:

The funny part in all of this: All characters follow these conventions, yet ZUN still architected TH03 as if they don't. Each of the two Gauge Attack levels gets a separate per-player function pointer, but every character just uses these two functions to call a single common function with a flag that indicates Level 2 or Level 3. This common function then follows a similar structure for all 9 characters as well. The same trend continues with the Boss Attacks, where we find 9 copies of the more or less unchanged update and rendering boilerplate… so yeah, ZUN basically copy-pasted the same code 9 times with minor variations.
And now we can be very hyped for the future of TH03 decompilation. Lots of duplicates of the same functionality means that I'll basically only have to decompile them once, which means that TH03 decompilation is very likely to progress very quickly in terms of absolute numbers once I get the basic gameplay systems done. And there's not a lot of that code left either: After this delivery, we're left with a mere 128 undecompiled foundational functions that are not related to any specific character. After the next delivery, that number will drop to 95. I'm expecting a return to the glorious days of 2020, where the 3 copies of TH01's foundational graphics code allowed me to decompile 10% of its entire code within weeks. With 9 copies tripling that speed, we may even get to finish this game next year?


Onto bullets then! As you'd expect, TH03's bullet system is based on 📝 TH02's system we looked at earlier this year, which in turn was based on TH01's system. In some respects, it's a minor iteration of TH02's system adapted to the new features in TH03, but some of these new features also form the missing link between TH02 and 📝 TH04. The high-level overview:


Three features of the bullet system deserve a deeper look:

Trail sprites

These work by remembering the last 6 positions of a bullet and rendering the sprites at the 2nd, 4th, and 6th position, respectively:

Obviously, this requires (6 × 2 × 2) = 24 additional bytes per bullet. Adding these to the regular bullet structure would waste (24 × 320) = 7,680 bytes of conventional RAM, which would in no way be justified for a feature that ZUN used in a grand total of two patterns. For once, ZUN agreed, and instead provided a single ring buffer that can hold these 6 additional positions for up to 48 bullets. This allowed ZUN to reduce the per-bullet cost to 3 bytes: 1 byte for the has trail flag, and 2 bytes for a near pointer into the ring buffer. That's still two more bytes than absolutely needed, and the debloated branch will definitely free up these wasted 640 bytes for portability reasons alone.

This trail sprite cap of 48 seems a bit random at first. Unlike the regular bullet cap that the game enforces by just not spawning any new bullets if all 320 slots are occupied, the trail sprite cap is not enforced or even just checked in any way. Due to the circular nature of the buffer, the 49th simultaneously active bullet with a trail sprite will then share its position memory with the 1st trail sprite bullet, leading to one additional position memory update per frame and trail sprites appearing in wrong positions.
Thus, it's the game design's responsibility to make minimal use of trail sprites to avoid these glitches. On the surface, it certainly looks as if ZUN was careful here:

  1. The cap happens to exactly match the 48 bullets fired as part of the ring group in Chiyuri's pattern, which is definitely the more bullet-intensive pattern of the two.
  2. Both trail-using patterns are part of Boss Attacks, and only a single player's Boss Attack can be active at any given time.
  3. The ring groups in Chiyuri's pattern move fast and are separated by a 96-frame delay, as captured in the video above. By the time Chiyuri spawns the next group, every bullet of the previous one should have long been removed due to flying past the edges of the playfield.
  4. Mima fires her alternating 5- and 4-spreads at a much shorter interval, but that interval is still long enough to never leave more than 5 of these 9-bullet subpatterns on screen at once.

Until you test Chiyuri's pattern on the (📝 announced) Boss Attack level 1 and notice that the bullets move slow enough for 2) to no longer apply. The result:

Missing bullets on every group beyond the first, and even sporadic trail sprites at mixed-up X and Y coordinates. This nicely demonstrates how these trail sprites are not just cosmetic, but also take control of clipping and affect gameplay as a result. For obvious optical reasons, trail sprite bullets will only get removed after the 6th remembered position lies outside of the clipping area – i.e., 6 frames later than bullets without trail sprites. If the remembered positions are then shared with a second bullet, the game would also clip that bullet if the clipping condition of the first one is met – regardless of the fact that the second bullet's main sprite might be nowhere close to the boundaries of the playfield. This clipping then either happens on the same frame if the second bullet's slot number within the 320-element bullet array is higher than the slot number of the first one, or on the next frame if the second bullet's slot number is lower.
The mixed-up X and Y positions on frames 139 and 235 can also be explained by clipping. The update function processes the X and Y coordinates independently from each other: It starts with the horizontal clipping checks, updates the position memory for the X coordinate, and then repeats both steps for the Y coordinate, immediately removing the bullet and moving on to the next one if it failed the respective clipping check. If the vertical clipping checks fail in a situation where two bullets share the same position memory, you'll end up with a mismatched X/Y pair where X comes from a clipped bullet and Y comes from an active one… for a single frame, until the same clipping check is applied to the other bullet and removes it as well. Hence, this is the only "fixable" bug in the bullet system that won't affect gameplay, as the mixed-up positions are unrelated to the result of the clipping condition that ultimately removes both bullets.

It's rare for Chiyuri's Boss Attack to launch the same pattern multiple times in a row, and once the (announced) Boss Attack level is ≥3, bullets already move fast enough to prevent this quirk from happening. But it's definitely possible to run into it during regular gameplay.

For a clearer and more extreme demonstration of the resulting glitches, let's turn Mima's trail sprite pattern into a 64-ring:

Explaining every single quirk in this hypothetical video is left as an exercise to the reader.

Rings and other groups

If there's one aspect where TH03's bullet system shows its TH02 lineage most clearly, it's the set of predefined bullet groups. The 2-, 3-, 4-, and 5-spreads with fixed narrow, medium, and wide arc angles, as well as the multi-bullet groups with randomized angles and speeds, are not only available in TH03 once again, but reuse the exact same code from TH02.
Instead, TH03's main innovation can be found in its ring system. Rings can now have any number of bullets between 0 and 255, and are no longer limited to the first six powers of 2. This allowed ZUN to fine-tune most ring groups based on the Gauge or Boss Attack level, and to also just have a few static ring patterns with non-power-of-two bullet counts. Chiyuri's aforementioned 48-ring trail sprite pattern falls in this category, and the rotating 5-ring pattern seen in Kana's Boss Attack is another example.

Screenshot of the fixed 5-ring pattern used in TH03 Kana's Boss Attack

And yes, storing the number of ring bullets in a regular uint8_t field now also allows patterns to spawn 0-rings. And sure enough, the bug that would later cause 📝 Kurumi's Divide Error crash in TH04 was actually introduced in TH03! The underlying code wasn't modified between the two games, which further proves that TH04's bullet system also traces back to TH03 and wasn't rewritten from scratch, at least concerning this aspect. TH03 just doesn't have any (known) way of triggering the bug in the unmodded original game.
Interestingly, TH02's ring system with predefined power-of-2 bullet counts is still part of TH03, and ZUN does use it for some ring groups in a few Boss Attack patterns. Did he do this because it's shorter than adding a second line of code that sets bullet_template.count? Did he deliberately need to preserve the previous value of bullet_template.count across groups? Or did he code these patterns at an earlier time in development when the arbitrary ring system didn't exist yet? Until I've decompiled every single bullet pattern in this game, we can only guess.

However, ZUN also removed two of TH02's group-related features from TH03:

  1. The eight special motion types have been reduced to a single gravity type. While gravity is now a separate flag in the bullet structure and template that can now be applied to any group, this vast removal of options still severely limits the expressivity of bullet patterns in TH03. This means that every non-gravity bullet in the game moves at a constant velocity.
    Gravity is also exclusively used by Kana, in both her Extra attack with the alternative 16×16 bullet sprites as well as in one certain pattern of her Boss Attack, which demonstrates gravity in combination with a ring group.

    One of the rare patterns that arguably looks prettier on Easy, where the slower bullet speeds leave more room for the gravity effect to accelerate the fall.

  2. The auto-stacking system was removed without any direct replacement. With TH03's more 📝 numeric method of defining difficulty, ZUN no longer needed this quick mechanism to 📝 distinguish Easy and Normal from Hard and Lunatic. This was one of the better changes between the two games though; the auto-stacking system added a quite annoying asterisk to the documentation of the random groups that is no longer needed in TH03.
    Manually creating stacks is obviously still possible by spawning separate versions of the same group with gradually reduced speed. This might be considered another practical advantage of the global bullet template, since you only need to mutate a single field before calling bullets_add() again. But really, nothing justifies global data.

Transferred pellets

This is the final gameplay feature that deserves its own section. Let's follow the pellet's X coordinate on its way from the spawn point to its destination on the other playfield:

  1. ZUN calculates the destination coordinate on the target playfield as a random Q12.4 X coordinate between 0 and 288.
  2. This subpixel coordinate is translated to screen space, adding either 16 for the left playfield or 336 for the right one. ZUN does this using the regular pixel-space conversion function that is typically used to calculate blitting coordinates, losing subpixel precision in the process and forming a very minor quirk.
  3. The pellet's movement angle is calculated in screen space, aiming a screen-space version of the pellet's origin point at the coordinate from 2).
  4. The screen-space pixel coordinate from 2) is translated back to a Q12.4 subpixel coordinate on the pellet's originating playfield. The result will deliberately lie outside the boundaries of this playfield: For a pellet flying from left to right, it will be between 320 and 608, while it will be between -320 and -32 for a pellet flying from right to left.
  5. The pellet then flies to this out-of-bounds coordinate while internally staying on the playfield it originated on. This means that neither the update nor the rendering code can clip the pellet at the borders of its originating playfield. Once it flew past the border, it only visually appears on the other playfield because that's what the out-of-bounds X coordinate translates to when the renderer converts it to screen space.
  6. Once the pellet's X coordinate has approached or flown past this relative target coordinate from its respective movement direction, the pellet is removed and respawned as a delay cloud.

This shows that the 32-pixel border between the two playfields is not just visual, but an actual part of the simulated game world. We can visualize this by removing the black cells on the text layer:

Also, we need to clear these 32 border pixels in VRAM on every frame to nicely visualize just these pellet transfers. TH03 obviously doesn't do that for performance reasons and lets partially clipped sprites accumulate below the border, 📝 just like the other shmups do.
This video also demonstrates another minor quirk: Transferred pellets are aimed at a random center Y coordinate between 0 and 16, but the subsequent delay cloud is always spawned at center_y = 2.0.

Speaking of, there's also a lot to cover in…

TH03's bullet renderer

And at first, it looks pretty good! TH03 retains the best idea from TH02 and batches rendering into three passes:

  1. 16×16 bullets, rendered normally via SPRITE16
  2. 32×32 delay clouds, rendered via SPRITE16's monochrome mode
  3. 8 pellets, rendered from hardcoded sprites via the GRCG

The second pass is skipped if the first pass didn't detect at least a single active delay cloud. However, this skip can only remove 320 out of the 960 iterations over the entire bullet array, every frame. Combine that with the most unlucky allocation of registers, and the resulting instructions end up wasting a low 5-digit number of CPU cycles per frame on a 486 in the worst case of no bullets being active. Same game that wrote large parts of its bullet update function in ASM, by the way. :zunpet:
While that number is still an order of magnitude away from causing significant performance problems, this issue became serious enough in TH04 for ZUN to introduce a display list for at least pellets that would cut down the number of iterations.

And then, we look at…

Pellet rendering

… and are greeted by the single strangest set of hardcoded sprites across all of PC-98 Touhou so far. TH03's pellets are not only the first time we see a doubly-preshifted sprite sheet, but 2 of the 16 variants for the transfer pellet sprites are also shifted incorrectly:

You can see this bug all over the video above, for example in frame 97, 105, 107, 109, 111…
TH03's pellet sprite sheet, with byte boundaries as red vertical linesTH03's pellet sprite sheet, with byte boundaries as red vertical lines and the bugged transfer sprites at bit offsets 6 and 14 highlighted in red

📝 Looks familiar? Now we know that ZUN still didn't have a tool for automatically preshifting sprites by the time he developed TH03 – something that 📝 I considered a necessity 5½ years ago. :onricdennat:

The doubly-preshifted nature of this sprite sheet, on the other hand, raises a whole lot of PC-98 blitting performance questions. This only possibly makes sense as an attempt at optimizing away the unaligned 16-bit VRAM writes you'd naturally run into when shifting an 8-wide sprite to cover two bytes.
Let's look at a regular 8-wide pellet sprite that was singly-preshifted to 16 pixels/bits. If we want to blit such a sprite to a left X position of 12 with the minimum amount of instructions, we would perform a single 2-byte write to VRAM address 0x0001, which itself is not divisible by 2:

Visualization of blitting a preshifted 8×8 pellet (16 pixels wide) to horizontal pixel #12, halfway within an odd byte address and crossing a 16-bit word boundary

On most 16-bit architectures, unaligned memory writes like these are either slower than aligned writes or entirely unsupported. The x86 MOV and MOVS instructions fall into the first category, so it makes sense to think that the GRCG might add a performance penalty of its own on top of the already higher latency of these instructions.
The natural workaround, then, is to add a second set of preshifted sprites to cover the remaining 8 possible start bit positions within a 16-pixel VRAM word. This would expand pellet sprites to a total width of 23 pixels. Understandably, ZUN also wanted to optimize for the low instruction counts, so he had to round up the physical width of the sprite to 32 pixels. Then, every preshifted variant could be blitted with a single MOVSD instruction:

Visualization of blitting a doubly-preshifted 8×8 pellet (32 pixels wide) to horizontal pixel #12, replacing the unaligned 16-bit word write with an aligned 32-bit write

But does this actually matter for the PC-98 and the GRCG? Are unaligned writes actually slow enough to justify writing 2× as much sprite data per frame and hardcoding 4× as many bytes? Unfortunately, I don't know of any hardware-level documentation about the GRCG that would conclusively answer this question. All the usual books and text files are disappointingly surface-level and only document the same programmer interfaces over and over, and hardware researchers are still waiting for EGC and GRCG die shots to even get started.
There are a few signs that this might be a good idea:

But without documentation or benchmarks, none of this means anything.
This is also why I haven't yet explored this whole field of optimizing VRAM writes for alignment. It would always involve branching to alignment-respecting code similar to how master.lib does it, but code like this is at odds with the more tangible goal of minimizing instruction counts in the generic case. Not to mention that we'll once again have to test this across every PC-98 hardware generation and possibly even GRCG revision if we ever go down to that level of optimization…

But even if alignment matters, ZUN's unconditional MOVSD instructions approach still appears to be slower on average. Consider the optimal 56.25% of cases where the sprite does lie within a single 16-bit word:

Visualization of blitting a doubly-preshifted 8×8 pellet (32 pixels wide) to horizontal pixel #0, demonstrating how doubly-preshifted sprites wastefully write the second 16-bit word in the optimal 56.25% of casesVisualization of blitting a doubly-preshifted 8×8 pellet (32 pixels wide) to horizontal pixel #8, demonstrating how doubly-preshifted sprites wastefully write the second 16-bit word in the optimal 56.25% of cases
8 start positions within the first byte + 1 start position on the second byte = 9/16 = 56.25%. The 9th variant for (x % 16) == 8 wouldn't be part of a regular singly-preshifted sprite sheet where the renderer blits the (x % 8)th = 0th variant. But it would definitely be worth adding if alignment does matter at all.

Keep in mind that we still use the GRCG here, and that it will also have to perform its fast-but-not-entirely-free four-plane Read-Modify-Write operation for the empty sprite bytes 3 and 4. Unconditional 32-bit writes would only be worth it if the GRCG somehow optimizes away empty writes at the microarchitecture level. That assumption is even more of a stretch, because 📝 why would master.lib even check for emptiness if that were true?

In the end, doubly-preshifted sprites slow down 56.25% of all blitting operations in a dubious attempt to speed up the other 43.75%. Unaligned 16-bit writes would have to be really slow to justify this approach – and judging from the fact that TH04 went back to single-byte preshifting, this is not the case. Maybe I'll write a benchmark for this someday, but honestly, this is the least interesting PC-98 benchmark question I've encountered so far. There are slowdown issues at 📝 our performance target of 66 MHz in Neko Project, but pellet sprite alignment is unlikely to significantly contribute to those.


16×16 bullets are simply rendered using standard SPRITE16 calls, nothing special there. That only leaves…

Pellet delay clouds

Animation frame #1 of TH03's pellet delay cloudsAnimation frame #2 of TH03's pellet delay cloudsAnimation frame #3 of TH03's pellet delay clouds

Just like the 48×48 📝 hit circle, these 32×32 sprites are rendered using SPRITE16's single-color render-path, which uses the EGC's GRCG-equivalent mode. Last year, I took a very brief look at this mode and wondered whether this was actually faster than just using the GRCG. 1½ years and 📝 one benchmark won by the EGC later, it certainly seems so, especially since we want to blit these to unaligned X positions. The EGC's hardware-accelerated pixel shifting seems highly preferable once sprite widths exceed 24 pixels and you can't fit a row of pixels in a 32-bit register anymore.
Stepping through SPRITE16 reveals that this GRCG-equivalent mode matches the GRCG even in how it doesn't read monochrome sprite data from VRAM, but from SPRITE16's 1bpp alpha mask buffer in conventional RAM.

But that only raises the question of why you'd want to use SPRITE16 over the raw EGC. It makes sense why SPRITE16 would have this feature; flashing existing sprites in a single color every once in a while is a useful thing to have in a game-focused rendering API. But using this feature for sprites that are only rendered in this monochrome mode just wastes the VRAM that these sprites occupy in SPRITE16's sprite area. You still blit such a sprite by passing a byte offset into the sprite area, which then gets interpreted as an offset into SPRITE16's alpha mask buffer.
If SPRITE16 had a function for directly blitting from a pointer to 1bpp data, ZUN could have freed up quite a bit of VRAM and maybe even added more sprites for character-specific attacks. Conceptually, it makes sense why SPRITE16 would restrict itself to a single sprite source, but it is quite an unfortunate omission, I'd say.

TH03's SPRITE16 sprite area, with monochrome sprites highlighted
13,568 pixels, to be exact. And yeah, you could technically overwrite the affected portions of VRAM after generating alpha masks via INT 42h, AH=01h. But since SPRITE16 only stores one such alpha mask buffer, you still couldn't reuse this space for other SPRITE16 sprites.

And that was the last PC-98 Touhou bullet system we were still missing! But at a little over 2 pushes, I have to find something else to do to round out the third one… wait, what about that one incident?


Migrating away from Twitter

On November 6, Twitter was hit by an automated ban wave that suspended all accounts that were using the OldTweetDeck extension. After Twitter discontinued the official free TweetDeck frontend on 2023-08-17, I quickly switched to OldTweetDeck – not just because it was free, but because it supported multiple accounts and was simply more performant than X's own premium offering at the time. In return, I gladly paid my €10 a month to dimden instead, who deserved it much more for continuously updating OldTweetDeck to all of Twitter's API changes over the years. It's very impressive how he kept it running for 2⅓ years without any such critical issues and still keeps maintaining it to this day.
Aside from Touhou Patch Center and all of my accounts, the ban wave affected enough people that Twitter decided to gradually revert it a day later. But without any public postmortem or excuse, this feels more like an act of gratitude that we shouldn't take for granted.

Ever since Elon took over, the Internet has been full of sensationalist doomposts about Twitter's imminent downfall any moment now OMG. For the longest time, I could ignore all these pundits because nothing of what they were complaining about was affecting my little corner. But sudden account suspensions are an existential threat to my business, and finally provided the first actual technical and non-political argument to get my data off Twitter in the medium term. I've put too much effort into all of the content there to let it be exclusively controlled by any one company.

Hilariously, things only got worse from there. Until two days ago, Twitter's data download option was inaccessible due to an infinite redirection bug. Call it malice or incompetence, but leaving such an issue unfixed for weeks is a definite sign of a platform in decline. Thus, I had to run the import on an older archive I happened to request on 2023-07-02.
And then I looked inside that archive and noticed that it was missing at least three key pieces of data that Twitter demonstrably stores for my account:

  1. Poll options
  2. Alt text for images. (Also known as the actually most annoying and time-consuming part of every tweet if you actually want to properly explain an image with all its context and implications. AI won't help with that as long as its context window doesn't span every piece of knowledge related to this project. 🤷)
  3. The original version of each uploaded image, which they do have for a fact because it's shown in the /status view. The archive only contains the processed versions shown in the timeline, which were resized to at most 1200 pixels along their larger dimension and which may or may not have been converted to JPEG based on rules I didn't bother to reverse-engineer.

That's a rather selective interpretation of Art. 20 GDPR. If the argument is that you can just scrape that data out of the HTML yourself, why are they even bothering with sending me anything more than a nested list of tweet IDs, then? 🤨 Someone with more time and care could probably turn this into a lawsuit…
Presenting all media in its original quality is one of the more important reasons for moving to a self-hosted service as far as I'm concerned, especially since mainstream media conversion pipelines are infamous for destroying pixel art. So I went through my hard drives and replaced Twitter's images with the original versions of all 167 non-retweeted images I had uploaded to Twitter until July 2023. The videos also desperately needed to be replaced with their original AV1 versions; Twitter's enforced x264 YUV420P format has been the single worst implementation detail of that entire platform…

Trying a Bluesky PDS

So, time to complete the 📝 Bluesky self-hosting plan from last year? Setting up a PDS isn't all too annoying, but the import process leaves a lot to be desired:

Figuring out and confirming that first issue required remote debugging of the PDS server written in Node.js. Visual Studio Code's LSP quickly ran up against my server's low amount of RAM, which forced me to upgrade my server just to efficiently navigate through the source code…
Typical Node.js criticisms aside, the architecture of the PDS server is quite bizarre. A whole lot of the apparent API surface is never directly called, but generically proxied to some other node in the AT Protocol network at the byte level. If you log into your PDS via bsky.app, it seems as if the AppView calls API endpoints like /xrpc/app.bsky.unspecced.getPostThreadV2 on your PDS, but good luck meaningfully intercepting any of these requests, or even just getting your debugger to break on them.
Together with lots of bulky API schema descriptions in the form of lexicons, all this XRPC code makes up a big proportion of the code in the @atproto/pds package. But for… what exactly? Why would the PDS server need a thick layer of type safety and validation for payloads it doesn't look at, and that the relays will have to verify anyway? Why do they install all this dead code that will confuse most people who are trying to understand this system? :thonk:
In the end, we just can't thoroughly backdate our imported posts because the crawl timestamps are set by the relays, whose code we have no control over. Now, I could ignore all these issues and still upload some sort of full archive to the platform that now houses 1/6 of my following, but this just doesn't match the quality I expect from the canonical, definitive source of my short-form news posts.

Trying Mastodon

That leaves the Fediverse as the only remaining alternative for a service where people can still follow, like, and repost my content using relatively commonly used clients. Among the various ActivityPub implementations, Misskey is particularly popular among the Japanese Touhou community, but I've only heard bad things about its resource usage. Mastodon isn't the most lightweight option either – as aptly implied by its name – but you can make the argument that it's become the default option across the Fediverse over the years. Thus, there'll be at least a slight chance that people will be familiar with the web UI of what I'm about to self-host.

Too bad that I didn't even get through the first page of the setup guide before being stumped by obscure asset precompilation errors that apparently no one else has ever faced. In a way, it's commendable that a project would exclusively explain a bare-metal from-source setup in the Docker-dominated DevOps seascape of 2025. But why would you want to do this for a project that requires servers to be infested with npm and Postgres and a bleeding-edge self-compiled version of Ruby and several -dev packages for C dependencies of certain Ruby gems? Unsurprisingly, Japanese Python behaves just like Dutch Ruby in how the community effectively treats every minor version as a major version because there are no adults left in the room to put all the children and Ph.Ds in their place…

Fortunately, ActivityPub is relatively simple to implement and there are plenty of existing servers that are better suited to the kind of PR channel I'm actually looking for. After a very quick search, I settled on…

GoToSocial

…which immediately impresses in pretty much every single area:

And just in case you ever want to import a Twitter archive onto a GoToSocial instance, here is the no-nonsense importer I used:

So if you've got an account on Misskey, Mastodon, or another ActivityPub server, please follow @rec98@nmlgc.net. I'll keep posting everything to both Twitter and Bluesky for the time being, but will no longer advertise either of them. If they ever go down, I'll make no attempt at restoring them.

And that was 2025! It surely brought lots of words, breaking even last year's record by an additional 37% of blog post content. 😮 Here's to 2026 bringing more of the actual reverse-engineering we've been sorely lacking with all the modding and porting projects over the past few years. And with at least four TH03 gameplay pushes queued up, things are already looking quite promising…
Next up: Enemies! Formation scripts! Fireballs! Explosions! Combos, or at least the first part of them! And a slightly more common glitch that players have been wondering about for many years…

📝 Posted:
💰 Funded by:
[Anonymous], Ember2528
🏷️ Tags:

Part 4! Let's get this over with, fix all these landmines, and conclude 📝 this 4-post series about the big 2025 PC-98 Touhou portability subproject. Unless I find something big in TH03, this is likely to be the final long blog post for this year. I wanna code again!

  1. The scope
  2. Retaining menu backgrounds in conventional RAM
  3. Resolving screen tearing landmines
  4. Intermission: Handling unused code on fork branches
  5. Merging TH02-TH05's OP.EXE and MAINE.EXE/MAINL.EXE
  6. Replicating TH02-TH05's GAME.BAT in C++
  7. Topping it off with an actual feature

As you can already tell by this table of contents, this "initial" cleanup work was quite larger in scope than its counterpart for 📝 the first TH01 Anniversary Edition release. Even that already took unexpectedly long 2½ years ago, and now imagine doing that across four games simultaneously while keeping all the little required inconsistencies in place. Then you'll get why this has taken over four months…
With an overall goal of "general portability", it's very tempting to escalate the scope towards covering everything in these menu and cutscene binaries. So I had to draw at least some boundaries:

Even then, this was way premature. Not only because we still need to maintain the memory layout of TH02's and TH03's MAIN.EXE, but also because of all the undecompilable ASM code in all four games that blocks certain architectural simplifications.
The biggest problem, however: I haven't quite decided on how to use static libraries within my build environment yet. Since Turbo C++ 4.0J's linker just blindly includes every explicitly named object file without eliminating dead code, static libraries are essential for reducing bloat by providing a layer of optional files to be included on demand. However, the Windows and DOS versions of TLIB are easily confused, TLIB's usual paradigm of mutating existing library files goes against Tup's explicit dependency graph, and should we really depend on an ancient proprietary tool for a job that I could reimplement in a few hundred lines? Famous last words, I know. But since I 📝 didn't want to do any dedicated build system work this year, I also didn't want to sort out these questions in a 12th or even 13th push. Leaving the build environment woefully ill-equipped for the complexity of this task was probably a mistake; while the resulting workaround of feature bundles does the job, it's very silly and hopefully won't last very long. I'm definitely going to spend some time sorting out the static library situation before I ever attempt something like this again. Or at some general point before we hit the overall 100% finalization mark, because we've still got that long-awaited librarization of ZUN's master.lib fork ahead of us.


Let's get to it then, starting with the feature that will remove lag in menus by removing PC-98-specific page-flipping and EGC code:

Retaining menu backgrounds in conventional RAM

At first, this seems to be no problem. We just swap out master.lib's .PI functions with our forked PiLoad and our generic blitter, and make sure to keep the images allocated. master.lib's graph_pi_load_pack() has always loaded .PI images into one big contiguous buffer in conventional RAM, so this shouldn't negatively affect the heap layout. If anything, we'd be saving memory by 📝 not allocating these extra two rows, right?
Unfortunately, it's that second goal that would turn out to be a massive problem. 📝 The end of part 1 already hinted at how the majority of menu backgrounds are only rendered to VRAM a single time before ZUN immediately frees them from memory. These cases are so common that I defined a macro for them:

#define pi_fullres_load_palette_apply_put_free(slot, fn) { \
	pi_load(slot, fn); \
	pi_palette_apply(slot); \
	pi_put_8(0, 0, slot); \
	pi_free(slot); \
}
At the current state of decompilation, this macro is used 16 times across TH02-TH05, and it will appear an additional 12 times by the time decompilation is done.

In these cases, the games only need that single 128 KiB block temporarily, and then get to reuse that memory for other, more dynamic graphics. Consequently, ZUN probably dimensioned the master.lib heap sizes for TH02-TH05 to leave ample headroom with this fact in mind. I wasn't so sure about deliberately limiting the amount of heap memory 📝 in late 2021 when I fixed the one out-of-memory landmine that remained in TH04, but I've begun to appreciate these memory limits quite a lot as the scope of my research has deepened. Specify the right amount of bytes, perform the single allocation from the DOS heap at startup, and if that allocation succeeds, you've removed an entire class of out-of-memory bugs from consideration. Sure, modders might prefer mem_assign_all() for simplicity during development, but it does make sense to return to a static limit when shipping. For once, ZUN was right, and there is no excuse. :godzun:

On the surface, this macro is equivalent to PiLoad's original direct-to-VRAM approach. And indeed, we can replace this code with a call to PiLoad's original code path in the few cases where we just want to show a static image without unblitting any of its regions later on, removing even the requirement for that temporary 128 KiB block in the process. But in the majority of cases, we do need these images in RAM, and ZUN's original heap sizes simply weren't intended for that.
But how much of a problem are ZUN's limits in practice? Well, there's at least one instance where retaining all images would require significantly more memory than ZUN anticipated. TH02's OP.EXE requests a 256,000-byte master.lib heap, but then wants to do this:

If you step through this video, you'll see that this effect indeed page-flips between the later menu background (128,000 bytes) and the three images with the resized text (3 × 54,720 = 164,160 bytes). Hence, we must have already loaded all four of them into the heap before this animation starts. While the menu's main text is rendered on the text layer, its shadow is part of the graphics layer and must be unblitted when switching to the option menu and back, thus requiring this image in at least some portion of memory.
Since VRAM page 1 always shows the unmodified menu background image, we could cheat, use PiLoad's direct-to-VRAM code path, and then do a second load of the image to conventional RAM on frame 19, before the white-in palette effect. 📝 Frame-perfection rule #2 would also allow us to do that. But adding a second load time is lame, especially because the white-in effect wouldn't even hide that many expensive calls in the debloated build anymore. After replacing the original final graph_pack_put_8() call for the main menu image with a faster planar blit, it only hides the draw calls for the ©ZUN text and some minor file and port I/O to load and apply the menu's final 48-byte palette from OP.RGB.

There's really no reason against just increasing the size of the master.lib heap to incorporate all of these four images at the same time. But how much additional memory do we actually need? Obviously, these four images are not the only allocations on the heap, which also needs to fit at least the following buffers:

We can bypass the super_buffer allocation through a dumb trick. But the single worst aspect hides between all these individual allocations:

Fighting heap fragmentation

Let's look at TH05's OP.EXE, whose heap limit of 336,000 bytes is much more lenient. This limit should be more than enough to fit the new additional 128,000-byte buffer for the background image in addition to the original heap contents on every menu screen, and we can indeed enter the main menu without any issue. But then, we're still greeted with an out-of-memory crash after entering and leaving the Music Room? Let's take a look into the master.lib heap, with the retained background images we'd like to have:

Step Heap layout Total remaining Largest free block Fragmentation loss
Entering the main menu op1.pi (128,000) sft*.cd2 + car.cd2 (7,360) sl*.cdg (90,240) scnum.bft + hi_m.bft (7,680) 76,352 66,240 10,112
Leaving the main menu scnum.bft + hi_m.bft (7,680) 301,984 234,608 67,376
Entering the Music Room music.pi (128,000) 📝 nopoly_B (32,000) scnum.bft + hi_m.bft (7,680) 139,536 66,240 73,296
Leaving the Music Room music.pi (!) (128,000) scnum.bft + hi_m.bft (7,680) 165,552 81,920 83,632
Re-entering the main menu .PI load buffer (16,384) sft*.cd2 + car.cd2 (7,360) sl*.cdg (90,240) scnum.bft + hi_m.bft (7,680) 179,760 115,648 64,112

Something gradually shreds our heap into tiny pieces that ultimately prevent us from allocating the main menu background image a second time. We can surely blame this on ZUN's suboptimal order of load calls that doesn't prioritize larger images over smaller ones, or on the 16 KiB .PI load buffer that we maybe should have allocated statically. But the biggest hidden offender turns out to be… master.lib's packfile implementation?!

Yup. Every time you load a file out of an archive, master.lib heap-allocates a 31-byte state structure and a file read buffer that master.lib originally dimensioned at 520 bytes. TH03's MAIN.EXE and MAINL.EXE then increased the size of that buffer to 4,104 bytes, before TH04 and TH05 went up to 8,200 bytes. :zunpet:
Ultimately though, it's not the size that's the problem here, but the fact that we repeatedly allocate any memory that could have been allocated once when setting up the INT 21h handlers. You'd think that master.lib went for dynamic allocations in order to support the fact that the INT 21h file API lets you open multiple file handles simultaneously, which would point to different RLE-compressed files within the archive. But no, master.lib doesn't even support this case, and even incorrectly returns FileNotFound if you attempt to open a second file from an archive before closing the first one! :onricdennat:

After I identified this whole issue, I immediately wanted to replace master.lib's packfile code with the much saner and more explicit C++ implementation from TH01. Sooner or later, we'll have to do this anyway because we can't just hook file syscalls on other operating systems the way we can hook them on DOS.
However, TH01's implementation would quickly turn out to have its own share of heap fragmentation issues. Every time the game loads an RLE-compressed file from 東方靈異.伝, the archived file is completely decompressed into a newly allocated temporary buffer, from where the game then copies out parts into the actual game structures. The resulting fragmentation is at least easily fixable though, and that's what the TH01 part of the very first push assigned to part 1 went to. Switching to a zero-copy architecture basically only required persisting the RLE state and brought a significant improvement: 15,776 bytes of heap memory during Stage 1 freed up by that switch alone, as reported by the coreleft() output seen in debug mode?! That much for just removing temporary allocations that the game was freeing anyway?
Let's check the Borland C++ DOS Reference for how this value is actually calculated. Turns out that it is simply intended to be a measure of unused RAM memory, and sure enough:

In the large data models, coreleft returns the amount of memory between the highest allocated block and the end of memory.

That's a reasonably meaningful measurement that can be determined in constant time, compared to the 𝑂(𝑛) operation of finding the true total size of available heap memory by walking over every node.

In the end though, rolling out this C++ implementation to the other four games was way premature and would have pushed this delivery way above 12 pushes. After all, both ZUN's and master.lib's code is still full of INT 21h file syscalls that would all need to be replaced. Conditionally, even, given the two binaries that are not yet position-independent…
Fortunately, .PI loading itself is just as much of an issue and can be worked around in the much simpler way I already spoiled when explaining 📝 the API changes I made to PiLoad: We simply hold on to the menu background pixel buffer for as long as possible. Ideally, we only allocate these 128 KiB once, decode every new menu background into that same buffer, and only explicitly free it when we really need to. That's why the platform layer logic requires full control over .PI buffer allocation.
This was enough to keep the required amount of additional heap memory to a more than acceptable level:

And after a few rewritten function calls, we've indeed removed every single EGC-powered inter-page copy from all menus and cutscenes of TH02-TH05! On to the next goal…


Resolving screen tearing landmines

…which requires individual solutions for every case that merely follow a common pattern. If things are already done close to a VSync wait loop and just in a slightly wrong order, the solution is easy and we just have to shift around a few function calls.
But what can we do if ZUN mutates visible pixels at some place far away from the last VSync wait loop? After all, a lot of these landmines result from confusing the current CRT beam position across multiple functions. Often, it's impossible to see at a glance where these menu-specific subfunctions are called within a frame without tracing execution back to the last VSync delay loop at the call site. For starters, it would be nice to clearly formalize that a specific section of code must be run in VBLANK.

master.lib's vsync_Proc function pointer already gets us most of the way there. Its VSync subsystem automatically calls any non-nullptr shortly after the VSync interrupt fires, and our task function would then set vsync_Proc back to a nullptr to ensure the intended one-shot behavior.
However, this approach can at best defer a task to the next VBLANK interval, which might leave us one frame behind the original game and hurt our frame-perfection goals. What we actually want is a conditional approach for timing-sensitive tasks, as a common operation that only requires a single line of code:

Now we're only missing that crucial one bit of information, which is delivered by Bit 5 of the graphics GDC's status register at I/O port 0xA0. In fact, ZUN uses the same bit in all hand-written VSync wait code throughout TH01 and in the bouncing-ball ZUN Soft logo:

void vsync_wait_via_gdc_polling(void)
{
	// Bit 5 of the graphics GDC register indicates VBLANK. Wait until this bit
	// is set.
	while((inportb(0xA0) & 0x20) != 0) {
	}

	// Once Bit 5 is no longer set, the CRT has started drawing the next frame.
	// I have no idea why you would ever want to throw away all your precious
	// vertical retrace time, but ZUN does this all throughout TH01.
	while((inportb(0xA0) & 0x20) == 0) {
	}
}

Of course, this only solves the problem in theory, as the tasks themselves don't come with any real-time guarantees. It's entirely possible for the resulting vblank_run() function to get called near the end of VBLANK, start the task immediately, and return long after the CRT beam has started drawing again. Heck, if the system is slow enough, the task might not even complete within the VBLANK interval if we run it immediately after VSync. But this is a much more complex problem to solve, requiring upfront measurements of both the VBLANK interval and the execution time for each potential task, which can then be factored into the run-now-or-defer decision. We definitely don't need to go there as long as we're mainly targeting emulated 66 MHz systems.

In easy cases, vblank_run() can then resolve screen tearing landmines completely by itself. Towards the end of PC-98 Touhou, ZUN's menus made more and more use of master.lib's blocking palette fading functions, which delay themselves to the next VSync signal and thus avoid any tearing issues. Hence, TH04's and TH05's screen tearing landmines are limited to the very few sudden palette changes that remained in these games:

void return_from_other_screen_to_main_menu(void)
{
	// Loads the .PI image into our persistent menu image background buffer,
	// and overwrites master.lib's 8-bit palette. Takes a few frames and
	// probably won't return during VBLANK.
	GrpSurface_LoadPI(bgimage, &Palettes, "op1.pi");

	graph_accesspage(0);
	bgimage.write(0, 0); // Planar blit
	PaletteTone = 100;   // Use original brightness in palette_show()

-	// ZUN landmine: Updating the hardware palette right now will most likely
-	// cause screen tearing.
-	palette_show();
+	vblank_run(palette_show);

	[…]
}

The fixes for the landmines in TH03 and TH02, however, require much more thought and care to stay as close to ZUN's defined logical frame sequence as possible. TH03's character selection screen, which prompted this whole subproject in the first place, houses one of the harder groups of landmines:

Recorded on DOSBox-X with 375,000 cycles, since exact machine specifications are not important to demonstrate these landmines.

Very finicky work, where every single branch has the potential to introduce an off-by-one-frame error, and vblank_run() doesn't help at all.

And then you reach TH02, which asks for way too much to happen within a single frame, in plain sight, and with no palette tricks to hide it. The screen transitions into and out of its HiScore screen are by far the worst example:

Recorded at 1.9968 × 17 = 33.9456 MHz for a change to magnify the jank that you perhaps wouldn't see at higher clock speeds.

These screen transitions exhibit no less than 6 landmines and 2 bugs:

  1. 💣 Frame 1 shows how TRAM (containing the actual menu text as gaiji) gets cleared immediately, but VRAM (containing the shadow) remains untouched as ZUN decides to load HUUHI.DAT first.

  2. 💣 While the following VRAM clear appears to produce a well-defined black frame 2, it's anything but well-defined, as the load operation only happens to conclude within VBLANK by sheer chance in this recording.

  3. 💣 Frame 4 is wild. First of all, the code still hasn't waited for a single VBLANK signal ever since entering the menu, and therefore shouldn't be writing to TRAM to begin with.
    But even then, you wouldn't expect to see only the name and nothing else on the scanlines of a score record in such a partial rendering. How can TRAM operations possibly be that slow? This almost seemed as if I was missing some crucial timing-related detail about the hardware. But in the end, what we're seeing here is simply Neko Project not actually using scanline rendering for the text layer. If you write to a TRAM cell, Neko Project just marks the entire 16-pixel row to be redrawn during the next screen refresh event.

  4. 🐞 Frame 5, then, is the first well-defined frame that actually renders the way it's defined in the code. The green 東方封魔録 logo is indeed only meant to be visible from the next frame onwards. This certainly meets all criteria for a bug, but the debloated build isn't allowed to fix those. In fact, it needs a dedicated conditional branch to preserve this bug.

  5. Once you leave the menu, you'll first have to sit through a stylistic and non-productive 20-frame delay, before… the screen switches back to the last frame rendered before the delay on frame 77?
    💣 By that point, we're technically already back to the main menu, where the first thing ZUN does is to switch from double-buffering back to single-buffering with VRAM page 0 shown. If you happened to leave the menu by hitting a key on the 50% of frames where VRAM page 1 is shown, the screen will therefore flip back to the frame rendered before the 20-frame delay, and keep it visible while master.lib decodes the title screen image.

  6. 💣 This decoding process finishes after ~20.4 frames in this recording, near the middle of frame 98. Clearly, we then have to immediately switch the hardware palette to the one we just loaded. Let's completely disregard that we're probably not in VBLANK, or that the screen is still showing the last High Score menu frame… :zunpet:

  7. Then, we need to get the image onto both VRAM pages. 📝 As we found out in Part 2, a low-clocked 386 is pretty much the most suboptimal system for master.lib's packed→planar conversion code, and 12 frames exactly match the performance we would expect from Neko Project at 33 MHz.
    💣 But that only rendered the image to the invisible VRAM page 1. We could now temporarily show page 1 after the next VSync signal to hide the pretty much guaranteed multi-frame VRAM writes… but nah, who cares except for some researcher 28 years later. By leaving VRAM page 0 on screen, ZUN doesn't even attempt to hide the jank that is about to occur. Once again, he reaches for master.lib's graph_copy_page(), 📝 whose slowness I already talked about in Part 2. At 33 MHz, Neko Project takes 3 frames to copy one page after another, leaving us with two frames of mixed pixels. This can be even worse on real hardware: On 📝 spaztron64's K6-2-upgraded and southbridge-bottlenecked PC-9821V166 model, this copy took 100 ms. I was able to watch every single bitplane getting individually copied in the recording. Unpacking the .PI image a second time would have been faster on that machine. :onricdennat:

  8. 🐞 Also, ZUN should have definitely cleared TRAM before the page copy instead of deferring this responsibility to the main menu rendering code. Since we then return to the main menu's VSync-timed loop and regularly wait for VSync while the scoreboard remains on screen and part of the current logical frame, this is not a landmine.

Compare this with the debloated version:

Then, I did that 13 more times for the other screen tearing landmines fixed in this build. And no, these new builds don't even fix every instance of this issue…


Intermission: Handling unused code on fork branches

Given that all of these improvements are taking place on the debloated branch, it's time to decide on how to handle the biggest unneeded obstacle in the way of our portability efforts, after 📝 I procrastinated this question 2½ years ago.
In Shuusou Gyoku, I've been trying to retain every single line of unused code in a dedicated directory, not least because that game has 📝 some very wild effects that should be reasonably preserved. The problem with this approach is that all this unused code quickly stopped compiling as I started to refactor the game into its current cross-platform state. For discoverability, this is still better than outright deleting the code and expecting people to read pbg's original codebase, but it's not all too practical either. :thonk:

In the ReC98 codebase, we have a different situation: All the unused code doesn't just exist at some old commit that maybe won't even compile going forward, but is an integral part of the master branch. Therefore, removing this code from fork branches is not only in line with their goals, but also completely non-destructive, since its compilable form on master keeps getting maintained for a handful of building platforms.
Then again, I like the added overview and discoverability of the Shuusou Gyoku approach. So let's meet in the middle: From now on, the debloated branch will only keep unused code in the form of its declarations and some short explanatory comments, in files within the unused/ directory whose names point to the actual implementations on the master branch.

Funnily enough, unused code wasn't even the main reason why TH01's ANNIV.EXE lost 10,834 bytes between the previous and current builds. Although TH01 is the one game with by far the most unused engine code, that code only made up 3,728 bytes of that difference. The rest came from the work surrounding the zero-copy unpacker and the few portability features that already made sense to be rolled out for this game. Yes, TH01 really is that bloated.


Merging TH02-TH05's OP.EXE and MAINE.EXE/MAINL.EXE

Onto the second most exciting feature, 📝 as motivated by the blog post from May! A true single-executable build 📝 never looked that viable for TH04 and TH05 to begin with, so let's just go for the one viable partial merge that makes sense for all of the four games. With all of MAINE.EXE/MAINL.EXE being position-independent, the remaining bunch of ASM code there isn't much of an obstacle either.

And once again, this merge means that we have to resolve all 📝 binary-specific inconsistencies at once. While ZUN thankfully eliminated most of them by the end of the PC-98 era, the scorefile code remained inconsistent until the very end, 📝 as Part 3 already mentioned. Hopefully, this is the second-to-last time I have to mention these formats…
Funnily enough, all of their most noteworthy inconsistencies are found in how these formats deal with corrupted files:

The actual merge then indeed delivers what we were hoping for: In three of the four games, the added unique code from OP.EXE and MAINE.EXE/MAINL.EXE comes in at far below the 20,512 bytes we freed by removing 📝 Borland's C++ exception handler, both in the binaries themselves and in their loaded in-memory state.
But it's TH05 where both OP.EXE's expanded Music Room and MAINE.EXE's Staff Roll and All Cast sequence add so much unique data that the initial merge ended up slightly larger than the size of the original MAINE.EXE. Getting the binary and run-time size of the new DEBLOAT.EXE below that point required every trick in the book and then some. The more critical tricks were good ideas in their own right:

But the rest of them definitely crossed over into silly micro-optimization territory:


Alright, another idealistic bonus goal reached! That means we're only missing a single aspect to reach feature parity with the debloated TH01 build:

Replicating TH02-TH05's GAME.BAT in C++

In TH03, this is slightly more involved. We not only need to launch PMD using this technique, but also apply it to the INTvector set program and SPRITE16. 📝 You know the way this goes:

…actually, wait a moment! TH02-TH05 don't even use the C heap, and the master.lib heap works in a completely different way. Borland implemented their heap in a traditional sbrk(2) style that dynamically resizes a program's main DOS memory block as needed, which is how we end up with the whole concept of the heap growing in one direction. master.lib, however, needs to place its heap entirely within a single pre-allocated and fixed-size block of memory. And since mem_assign_dos() simply allocates this block via the usual DOS INT 21h, AH=48h API, it doesn't matter where on the DOS heap this block is located. This means that we don't even have to do the error-prone witchcraft of pushing these TSRs to the top of conventional memory that we had to do for TH01! Right?

Just like last time, these visualizations are not even remotely to scale.
Also, note the small gap between SPRITE16 and PMD. That's supposed to be PMD's environment segment, and DOS does allocate it, but PMD explicitly frees it before going resident. Sure, it's just a few bytes on real DOS, but still, very considerate!

Unfortunately though, the fixed position of all these TSRs would still prevent the game allocation from being replaced with a binary that asks for more memory than the one this block was initially allocated for. In TH01, this would have been a minor issue because it only applied to hot-reloading the single DEBLOAT.EXE or ANNIV.EXE that contains all game code. For the other four games, however, we still keep the larger MAIN.EXE as a separate binary, and most likely will do so for the foreseeable future. And we're surely not getting into the business of moving already allocated TSRs…
So we're back to the technique from two years ago after all. Let's precalculate the size of each TSR, push that TSR to the top of conventional RAM by temporarily claiming all free memory minus its expected size, and then we get…

…a ZUN.COM spawn failure from DOS as we try to start the ZUNINIT sub-binary. :zunpet:
Yup. Thanks to ZUN's fantastic idea of bundling these small utility tools and TSRs into a single binary that's larger than each individual TSR, we can't just reuse the strategy that worked for TH01. DOS must load the entirety of ZUN.COM into conventional RAM before the bundling code gets a chance to shift the selected sub-binary to the top of the program's memory block and then reduce the size of that block.
So how are we going to solve this?

Sure, we could allocate the resident structure into the gap left by ZUNINIT's instance of ZUN.COM, but that's just a few bytes. Keep in mind that the (env.) blocks are much smaller than they appear here. Thus, the free blocks are also a lot larger in reality, but still not large enough to fit PMD and its music buffer.

But if we can't get rid of ZUN.COM's high load-time memory requirements, how about using that memory more productively? Is there a way we could maybe spawn the other TSRs into the hole left by ZUN.COM after it went resident?
Let's take a step back from individual TSRs and instead look at the full picture of spawning a bundle of TSRs in a defined order. First, we determine both the binary size (file size of the .COM binary + Program Segment Prefix + 256 bytes of stack) and the resident size (the size of its memory block after it goes resident) of each TSR. With these metrics, we can calculate a minimum and resident size for the full bundle by simulating the TSR spawns in order:

uint32_t bundle_size_min = 0;
uint32_t bundle_size_resident = 0;

for(const auto& tsr : tsrs) {
	// Since DOS has freed all excess binary memory before we get to spawn a new TSR, the new
	// one will end up next to the previous resident allocations. We only need to consider
	// the previous minimum size because it might be larger than the one we calculate here.
	bundle_size_min = std::max((bundle_size_resident + tsr.size_binary), bundle_size_min);

	bundle_size_resident += tsr.size_resident;
}

Let's step through the bundle construction for TH03:

TSR Binary Resident Bundle minimum Bundle resident Naive
ZUNINIT (ZUN.COM) 23,276 1,056 23,276 1,056 23,276
SPRITE16 (ZUN.COM) 23,276 36,528 24,332 37,584 59,804
PMD86.COM 29,295 30,144 66,879 67,728 89,948

Then, we only need to resize our main memory block a single time to leave a gap at the top of conventional RAM whose size matches the larger of the minimum or resident bundle sizes. If we then spawn the TSRs into this gap, we indeed save 22,220 bytes over the naive approach! Let's visualize the resulting memory layout with TH02 because there's a nice detail with MMD and PMD:

MMD also frees its environment segment before going resident, so we can subtract the size of one environment segment from the reserved heap size if we spawn both PMD and MMD.
Unfortunately, one of these gaps will always remain with this approach. We could get rid of it by spawning MMD and PMD first, which would merge it into the remaining free memory. But then, we'd have nothing to spawn into the hole left by the ZUN.COM binary for ZUNINIT, and we'd be back to the naive fragmented situation above. Thus, this trick remains unique to TH02.

However, there's one crucial detail in all of this that would prove to be more complicated:

Calculating correct resident sizes

In TH01, this was no big deal. MDRV98 was the only TSR we had to care about, and there was no reason not to just replicate its simple resident size calculation within the code. After all, people would either run the version bundled with the game or the smaller previous version if they played on a real-hardware CanBe model. No one really cares about MDRV98 beyond that level; the driver is almost universally disliked for just not being PMD, which managed to attract a sizable community, documentation, and even new developments over the years. A PMD port of TH01 has been one of the most common mod requests as well.
The TSRs in later games, however, are much more flexible. We compile both ZUNINIT and SPRITE16 from source and should therefore expect people to mod them, but these two in particular might just be considered uninteresting and static enough to justify hardcoding their sizes. But this approach utterly breaks with PMD, whose chip-specific variants come in multiple versions depending on the game:

:th02: TH02 :th03: TH03 :th04: TH04 :th05: TH05
PMD.COM 4.8l (1996-12-28)
14,336 bytes
PMDB2.COM
(ADPCM)
4.8l (1996-12-28)
18,496 bytes
4.8o (1997-06-19)
18,592 bytes
PMD86.COM
(86PCM)
4.8l (1996-12-28)
19,904 bytes
4.8o (1997-06-19)
19,984 bytes
PMDPPZ.COM
(PPZ8/CanBe)
4.8l (1996-12-28)
20,768 bytes
4.8o (1997-06-19)
21,024 bytes
The PMD versions that ZUN shipped with each game. The byte size refers to the in-memory TSR size without any music, voice, or effect data added on top.

In theory, nothing stops us from hardcoding these sizes for each game as well. But these physical details about specific PMD versions are even less of a property of the game. There's no reason why modders shouldn't be able to replace any of the hardware-specific driver versions with any other – and given the sizable PMD composer and arranger community, this is a much more likely kind of mod to happen. SSG-EG, anyone?

But how could we figure out the required resident size of arbitrary PMD versions without hardcoding anything? From the outside, we can only really know for sure by running the driver and seeing how much memory it keeps resident…
…so that's exactly what we need to do. The merged binaries spawn each driver three times during setup – once to figure out its size, a second time to remove this test TSR, and a third time to respawn the TSR at its designated place at the top of conventional memory. And if we have such a system in place, nothing stops us from applying it to all other TSRs as well, removing the need to precalculate or hardcode any size… well, except for SPRITE16, which still needs a hack to factor in its extra two blocks on the DOS heap. In TH03, these 2×3 additional processes do slow down startup by about 6 frames on our target 66 MHz Neko Project configuration when compared to the batch file, which should still be tolerable relative to the .PI load times we removed by switching to PiLoad.

The whole feature has a few other nice properties as well:

And then you test with the actual ZUN.COM and notice that you're still not done:

TSR Binary Resident Bundle minimum Bundle resident
ZUNINIT (ZUN.COM) 13,394 784 13,394 784
PMD86.COM 29,383 30,224 30,167 31,008
TSR Binary Resident Bundle minimum Bundle resident
ZUNINIT (ZUN.COM) 65,968 784 65,968 784
PMD86.COM 29,383 30,224 66,752 31,008

It's this TH04 issue that raises the question of whether this whole TSR bundling solution was even worthwhile in the first place. It sure was an interesting problem to solve, but it'd be much simpler and less bloated to just integrate the INTvector set program into every binary. For TH03, we could similarly integrate all SPRITE16 functionality directly into DEBLOATM.EXE/ANNIVM.EXE and still end up with a smaller-than-original binary after removing Borland's C++ exception handler. That would leave PMD and MMD as the only TSRs we'd need to spawn from C++, and those do have good reasons to be separate from game code.
Oh well, gotta get TH03's MAIN.EXE position-independent first…

Also, the usual caveats from two years ago still apply. This whole trick of pushing TSRs to the top of conventional RAM still relies on witchcraft that may not work on certain DOS kernels. For developers, tinkerers, and people who know what they're doing, it does succeed at nicely decluttering the game directory. But for… ahem, distributors, I still recommend shipping the modified version of GAME.BAT and GAMECB.BAT in the package below to defend against any potential stability issues.


Finally, if the performance improvements aren't enough of a reason to upgrade to these new builds, how about an actual new feature? TH03's Anniversary Edition now lets you quit out of the VS Start menu via either ESC or a new menu item, without going through the Select screen. 🙌

Screenshot of the VS Start menu in the P0323 build of TH03's Anniversary Edition, showing off the new Quit option and the version text
Matching the style of the version text to the style of ☪ The Phantasmagoria of Dim. Dream on the other side seemed like the least bad option here. That outline is indeed created by rendering every line 9 times…

And with that, I'm finally done with 2025's most indulgent subproject! Let's quickly check the overall impact on the codebase:

$ git diff --stat debloated~193 debloated -- . ":(exclude)Tupfile.lua" ":(exclude)build_dumb.bat" ":(exclude)unused/"
[…]
259 files changed, 4145 insertions(+), 8099 deletions(-)

That's almost 4,000 lines of ad-hoc PC-98-native graphics code, bloat, landmines, bloat- and landmine-documenting comments, and binary-specific inconsistencies removed from game code, in exchange for…

git diff --stat master~203 master -- platform/
[…]
28 files changed, 2213 insertions(+), 258 deletions(-)

…not even half that many additional lines in the platform layer. And here's what all of this compiles to:

Richard Stallman cosplaying as a shrine maiden ReC98 (version P0323) 2025-09-29-ReC98.zip

After the Shuusou Gyoku debacle and the many last-minute fixes that cropped up while I was writing this post, I'm not particularly confident in these builds, despite the weeks of testing that went into them. Still, we've got to start somewhere. At least for TH03, we're bound to quickly find any issues that slipped through the cracks while I'm implementing netplay into the Anniversary Edition.

Next up: The very quick round of 📝 Shuusou Gyoku maintenance and forward compatibility I announced in April, to clear out the backlog a bit. This whole series also really stretched the concept of what 11 pushes should be, so I'll charge 2 pushes for that maintenance round to compensate. In exchange, I'll also incorporate a small bit of new Windows 98 feature work, since it fits nicely with the cleanup work.

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous], Yanga, Ember2528
🏷️ Tags:

Part 3 of 📝 the 4-post series about the big 2025 PC-98 Touhou portability subproject, and we actually get to move some percentages on the front page with this one! For once, there truly isn't a lot to mention about most of these five disconnected small-feature decompilations, so let's go for more of a touhou-memories style and string together a few shorter bullet points and paragraphs. For even greater brevity, I'll also use the ZUN code issue emoji you might already know from Twitter or Bluesky: 🐞 denotes a bug, 💣 denotes a landmine, and 🎺 denotes a quirk.

  1. Revising TH02's main menu
  2. Finishing TH03's High Score menu
  3. TH04's title screen animation
  4. TH05's All Cast sequence
  5. Finishing TH03/TH04/TH05 scorefiles
  6. A typo in TH04 Reimu A's Good Ending

Revising TH02's main menu

This was one of those old decompilations from 2015 that I really wanted to bring up to current standards before the debloated branch would roll out the new more portable and performant blitting code. Replacing the magic-number coordinates with constants and calculations revealed 📝 the usual off-by-one text positioning bugs in the Option menu, despite ZUN still using monospaced text in this game…
As for more unique and exciting details in this screen: ZUN's defined gaiji strings contain an unused adaptation of TH01's blinking HIT KEY text. On screen, it might have looked something like this:

Mockup of the TH01's &HIT KEY& string in TH02
This string is so unused that we don't even know its intended position, though.

Finishing TH03's High Score menu

At the end of 2021, 📝 I already decompiled most of this menu, but left two functions in ASM due to push scope constraints. Originally, I thought that this menu would need a few changes to address a certain scorefile inconsistency I'll mention in Part 4, but I ended up finding a better solution. Still, we got one interesting discovery per function out of it:


TH04's title screen animation

This decompilation was necessary because its palette manipulation code did the very dubious thing of accessing the palette in a freed .PI slot. I don't think that the stylish effect of separately whiting in the image's black outlines is appreciated enough. And yes, that formally was the last non-RE'd tiny bit of any OP.EXE binary!

Also note that single black pixel in Reimu's gohei. :zunpet:

TH05's All Cast sequence

This sequence contained the last not yet decompiled instance of 📝 masked crossfading, which the debloated branch wants to replace with our single optimized implementation.
Most picture and text cues in this sequence are synced to the BGM, using PMD's AH=05h function to retrieve the current measure. And yes, that's measures, which is indeed the only time unit you get from PMD. The cues appear to be timed based on beats rather than measures, but the secret there is that ZUN simply wrote Peaceful Romancer in the internal time signature of 1/4. Just in case anyone tries to mod this BGM and starts wondering why the sequence suddenly progresses more slowly. I'll just use beats below since it's shorter.
Any cues that don't appear synced only do so because of – you guessed it – weird ZUN code issues.


Finishing TH03/TH04/TH05 scorefiles

Well, at least as far as decompilation is concerned. Cleaning up all these binary-specific inconsistencies on the debloated branch will be just as annoying as reconstructing them in the first place, and I won't even get it all the way done within these 11 pushes. TH05 made this even worse by continuing its general trend of taking TH04's slightly bloated but overall fine C++ code and needlessly rewriting it in micro-optimized and only semi-decompilable ASM. If you still believe that the master branch is a good foundation for any kind of serious work, this file should convince you otherwise.
Two more discoveries here:


Finally, I stumbled over a script bug in TH04's Good Ending for Reimu A:

Screenshot of the &,4魔理沙& script bug in TH04's Good Ending for Reimu A
The 2014 static English patch fixes this issue. That's probably why this isn't talked about anywhere.

This looks unintentional, and the same line in Reimu B's Good Ending confirms that this is indeed a typo:

\p,ed07.pi \=0,4 魔理沙:なんだよ、そりゃ\ga9\s160\c
\p,ed07.pi \==0,4 魔理沙:なんだよ、そりゃ\ga9\s160\c

The 📝 cutscene command reference tells us that the line in the Reimu B variant is preceded by \==, the picture crossfading command, followed by both possible parameters, 0 and 4. Reimu A's script, however, lacks that second = and instead spells out \=, the immediate picture display command, which doesn't take a second parameter. Thus, the command stops reading after the 0 and leaves the trailing ,4 as text to be displayed in the newly started box. The line break is then ignored as usual, causing 魔理沙 to be displayed right next to these two characters.

Whew! Once again, this did turn into more of the typical ReC98 research by the end after all. :godzun: And that was just 75% of the pushes assigned to this post, because the rest already went towards the debloating work. Next up: Concluding this series and actually applying all this research to the games.

📝 Posted:
💰 Funded by:
Splashman, Ember2528
🏷️ Tags:

Talk about a nerd snipe! I just wanted to take the first meaningful step towards getting PC-98 Touhou portable. But then, that step massively escalated and resulted in not only the single biggest subproject of 2025, but also in the most productive dev cycle this project has seen since the beginning of the crowdfunding era. 405 commits over 11 pushes, and touching on so many topics that writing a single blog post would have been way too much for even me to handle. So let's try something new and split this delivery into four "smaller" and thematically more focused posts that I'll release in quick succession:

  1. Deciding on a porting strategy
  2. Accurate slowdown
  3. Picking a CPU clock speed for emulators
  4. Frame-perfect
  5. Port implementation thoughts

So, how do we get the PC-98 Touhou codebase into a portable state? That entirely depends on what kind of port we want in the first place, and how much of ZUN's code we are willing to change. Three particularly efficient options immediately come to mind:

  1. On one end of the spectrum, we have a preconfigured PC-98 emulator with disabled configuration options and a stripped-down UI that tricks people into believing they're playing a port and prevents them from accidentally breaking the working configuration.
    This might sound like a joke, but it's unironically the most efficient and pragmatic solution that will be good enough for the overwhelming majority of players. If you ask people what they expect from a port, they primarily name ease of use and not having to configure emulators. Both of these can be solved with a preconfigured emulator and thus don't justify the monumental engineering effort of the more complex porting methods described below. That effort also wouldn't be justified if people just wanted a port and had no standards regarding its technical implementation, besides maybe no input lag. Someone has to put in the effort to solve every little challenge on the way from PC-98 to modern systems, and if that effort is not appreciated…

    By the way, I have no idea what people are talking about when they claim that PC-98 Touhou has input lag, because there sure is nothing like that in the code that would indicate anything above 1 frame / 17.7 ms for the in-game portions. Any investigation into these issues would therefore have to come from someone else, I'm afraid. Everything points to input lag being the result of misconfigured emulators.

    This is not like Shuusou Gyoku, where a port to modern APIs made sense because almost every subsystem still performs suboptimally on modern Windows even after you set up DxWnd, a better MIDI synth, and whatever people are using to make modern gamepads work with ancient DirectInput these days. If you correctly set up a PC-98 emulator, the games do run at full speed, and are highly likely to continue running fine after emulator and operating system version updates.
    Thus, can we conclude that wishing for ports is primarily a symptom of the Touhou community's past failure and negligence to spread preconfigured emulators to people? Because this surely shouldn't be a problem in this day and age anymore? While I did my part way back in 2013, it would take until spaztron64's 2021 package for the community at large to finally wake up and realize that this was a problem. Nowadays though, we have at least three decent packages made by separate people that have my personal seal of approval. And yes, this even includes the offering you can obtain at a certain mountaintop place of worship. That site used to be infamous for pushing out slop that violated their own mission statement and externalized costs to the tech support departments of their supply chain, but I'm glad to announce that they've leveled up and now provide a decent solution. And once they remove that archive inside their archive, it will be even better!
    Still, if your emulator configuration guides are presented more prominently than your preconfigured emulator downloads, you're doing a disservice to the community. Make guides available, yes, but clearly label them as background information for people who already played the games and then got curious about this old Japanese computer architecture.

  2. OK, but what if you do have standards and would appreciate a technically more solid port that removes layers and maybe even improves the games beyond the limits of the PC-98's architecture? If we take a single step towards native code and native performance, we end up with what people call a "static recompilation" these days. As I explained in the FAQ entry I wrote last year, this kind of port would still emulate the graphics, sound, input, and memory subsystems of a PC-98, but it would cut out CPU emulation.
    For PC-98 Touhou, this is actually quite a huge deal: CPU speed is the single biggest point of contention when configuring PC-98 emulators for Touhou, and the vastly different x86 cores of each emulator result in vastly different performance characteristics once you start to benchmark them all more thoroughly. With no more CPU cycles to count, we'd also lose all the VRAM access latencies that emulators typically strive to replicate, and thus pretty much guarantee 0% slowdown in the resulting port. While the aforementioned kind of modded emulator could theoretically also remove cycle counting and VRAM latencies, it would still interpret x86 instructions and thus have a harder time actually reaching the native performance required for 0% slowdown.

    This kind of port would also find immediate acceptance within the gameplay community. Since it would only take ZUN's original binaries as input and ignore our reconstructed source, we're guaranteed to retain the exact gameplay logic. The entire instruction translation process would be automated, leaving no room for modernizing the codebase by hand 📝 and accidentally breaking gameplay. We'd still have to defuse at least a few landmines to get the port running without issue, but those would be limited to things like filename casing, for example. Nothing even remotely close to gameplay code.

  3. On the other end of the spectrum, we have something like uth05win: A fully native rewrite of the graphics code that takes every liberty and cuts every corner it needs to rework the game into something that naturally renders within a modern graphics API of our choice. Unlike uth05win, however, our ports will be based on complete decompilations and thus retain the original gameplay code instead of freely rewriting certain parts because they look strange. In turn, we would basically scrap all of ZUN's menu and cutscene code and write quirk-free and sane replacements. Part 4 will drive home just how much more relaxing this course of action would have been…
    There's certainly an argument to be had that a modern port should reimagine the game to look and feel as modern as you can get within the original assets, and not stick to PC-98 limitations. After all, the unmodified PC-98 version is always there for you to play on your correctly configured emulator, right? In fact, if we ever wanted to port the games to weaker systems or consoles, this kind of port would be our only option.

But as you might have guessed, we're not going for either of these options:

  1. The first option doesn't even need anything from ReC98. Even the sleekest imaginable release could be done by anyone who either knows about PC-98 emulation or keeps in contact with someone who does, and is comfortable messing around with emulator source code. In fact, I'm not even a particularly qualified person for this job; I frequently mess with emulator configurations for research reasons, and then forget the correct values for certain obscure settings. :tannedcirno:
    This is such an obvious and efficient move that I seriously wonder why nobody has done it so far… but then again, I thought the same about every other idea I ended up doing myself in this space over the past 15 years. If that idea sounds great to you, feel free to go ahead – it represents the opposite of what this project is about, so the resulting fame is yours for the taking. If y'all see "ports" popping up from a place that isn't this project in the not-too-distant future, you can be pretty sure that their developers followed this strategy.

  2. The second option would indeed be an interesting project in its own right, as I've stated in the FAQ entry. But if you remember 📝 the last time I thought about static recompilation, I was way more excited for recompiling the old compiler we use for the PC-98 code rather than the games themselves. Ironically, this is primarily because of how much a recompilation would complicate the new features we plan to add to the games. Since I can only develop new features on top of a previous reverse-engineering effort, they will necessarily remain tied to the PC-98-native version of the codebase at first. How would we port them, then?

    • Do I continue developing these features for the PC-98 and then simply recompile them along with the rest of the game? The issue with that approach is that most features won't have a version that could work with the original ZUN codebase that we'd prefer to recompile. For everyone's sanity, most features will only exist as part of a respective game's anniversary branch, which in turn is based on the rearchitected and de-landmined debloated branch. Recompiling these branches would undermine the entire selling point of delivering the pure, untainted ZUN code that would have probably convinced the gameplay community to invest in this strategy in the first place. It might be good enough for the rest of the community, but if I'm going to rearchitect the PC-98 codebase anyway, would there even be a point in developing the required recompilation techniques on the side? Would this give us ports faster than following a more classical approach?
    • Then again, I could still try slicing out the code for these features in a way that would allow them to be shared between the rearchitected PC-98 and recompiled ZUN codebases. But that's bound to create an unnatural and awkward mess that's probably even worse than the way I have to arrange ZUN's code on the unmodified master branch. I'd definitely charge extra for that.
    • Do I just copy-paste and maintain two versions of the feature code for both platforms, manually transferring all required reverse-engineering to the recompilation? That might feel very dull, but it's probably more efficient than any attempt at sharing that code.
    • Or do I just abandon the PC-98-native codebase? In favor of a pseudo-PC-98 codebase that still very much assumes PC-98 hardware but doesn't actually run on real or conventionally emulated PC-98 hardware… :thonk:

    The last point in particular demonstrates just how little of a help a recompilation would actually be. Since it would continue to emulate the PC-98's graphics system, I'd still have to write any new graphics code against the PC-98's planar and two-page VRAM. Automatically porting the games to a friendlier and more generic rendering paradigm is infeasible for even an advanced recompiler: Every part of the original game expects PC-98 hardware, and a generic rewrite requires engineering decisions at a much higher level than the individual x86 instructions a recompilation operates at.
    And ultimately, it's these individual features that people should be (and mostly are) hyped for. Community-usable replays, translations, and TH03 netplay can all be implemented natively on PC-98. Sure, netplay would be easier to develop and easier to use within a TH03 recompilation since we can just use the native network stack of your host OS 📝 without any intermediaries. But developing both a recompiler and netplay would still take longer than 📝 following through with our current PC-98-native plan.

  3. The third option is actually quite popular, or would at least be acceptable to the majority of the general fandom. This is what non-technical people have in mind anyway when they think about ports, even if they don't confuse ports with remakes.
    To find out just how acceptable such a port would be, I picked screen fade effects as a representative detail for the corners that such a port would cut, and asked how people judge the natural alpha-blended implementation in uth05win against the palette-based method you'd use on a PC-98. Surprisingly, a whopping 79% of respondents don't have any problem with a port using whatever is most natural for the system it runs on. And that's 79% of my audience, which certainly is at least somewhat aware of PC-98 hardware details and the limitations that shaped these games into what they are. Of course, the 21% of die-hard PC-98 supremacists would then loudly complain that such a choice would make the port literally unplayable, but we could easily dismiss them by pointing to the poll where the community decided in favor of the smoother option. After all, ZUN's intention was to have a fade, and manipulation of a 12-bit color palette was simply the only tool he had on a PC-98.

    However, the gameplay community has much higher hopes for ReC98. Both them and I don't just want to supplement the original PC-98 versions with something that's playable on modern systems, but

    > replace the need for the proprietary, PC-98-exclusive original releases and their emulation for even the most conservative fan

    as I wrote back in 2014. Sure, the community can manage spreading pre-configured emulators for a few more years, but wouldn't it be great if they could stop doing that at some point in the far future?

So if all the "easy" solutions either don't have much of a purpose or disappoint in some way, we're only left with the hard one: A classic, manual port done primarily for the sake of solving an engineering challenge. But hey, this means that it'll also produce tons of blog posts for all of you to read, which apparently is at least equally as popular as actually playing the games. :onricdennat:
Here's what we're going to do:

Sure, the main drawback here is the immense development effort required. But in exchange, the port retains readable and moddable code and continues to deliver the insights that this project has always stood for. Imagine stepping through gameplay code using a native C/C++ debugger at your native screen resolution!


But before we can get to how I'm going to do all that, there are two popular misconceptions I have to address.

Accurate slowdown

The initial version of Maribel Hearn's new emulator guide for PC-98 Touhou had the following sentence that spaztron64 and I successfully lobbied against:

Note that none of the emulators have accurate slowdown; the slowdown will not match real hardware.

Objectively, this is a true statement. Neko Project's i386 core is the closest thing to cycle-accurate PC-98 emulation we have, as its per-instruction cycle counts match Intel's documentation. But even its performance characteristics are wildly inaccurate compared to a real PC-98 system with a 386, as we're going to see in the next blog post.
The problem I have with this sentence is that it's very misleading in this specific context. The mere mention of accurate slowdown in a beginner's guide on PC-98 emulation paints said slowdown as something desirable and worthy of preservation. It evokes stories of console speedrunners and emulator developers who deal with fixed, well-defined hardware where the concept of accurate slowdown makes sense. Stories that probably originated from a time before decompilations of classic games became commonplace, when it was hard to say whether a particular instance of slowdown was intended or not. And even with a decompilation, these things remain a matter of interpretation if you can't ask the original developer. Thus, it's completely understandable why observable behavior of real hardware remains the one benchmark of accuracy and quality that people can understand and rally around.
The PC-98, however, is very much not that kind of fixed system, but a computer architecture that spanned 18 years of hardware evolution, from 1982 to 2000. Even if we reduce this list of models to the ones that match ZUN's stated minimum system requirements, we're still looking at 7 years of hardware, running different microarchitectures at different clock speeds and with different resulting bottlenecks. If there's such a big variety of systems, which particular slowdown behavior should the ports even preserve?

The obvious answer is "the one from the exact system ZUN wrote these games on", but we don't know that system. 📝 Last year, I claimed that ZUN developed these games on a PC-9821Xa7, but I didn't add a citation back then and can't find one now. The closest piece of related known info is this note on the Amusement Makers page that hosts the official downloads for the trial versions, listing three PC-98 models that they confirmed to run the games without issues:

なお当サークルでは ・ NEC PC-9821Xs i486DX2 66MHz ・ NEC PC-9821La13 Pentium Processor (P54C) 133MHz ・ EPSON PC-486MS AMD 5x86-P133 換装 などで正常に動くことを確認しています

These models are one whole CPU generation apart and their clock speed differs by 100%. Which one of these is supposed to have the accurate slowdown?
But even if we knew, it doesn't matter. The README is clear about ZUN's intentions:

  PC98、またはその互換機専用です。(EGC搭載機種)   386以上で動作しますが、486ぐらい無いときついかも知れません。実際は   VRAMアクセスが速いことが重要です。   オプションで、処理の重い演出を減らすことも出来ます。   また、MSDOSが必要です   CPUは486(66MHz)でしか動作確認を取っておりませんので、あんま   り遅い機種ですと不幸かもしれません。   ちなみに、486(66MHz)ですといっさい処理落ちや、欠けなどは出ませ   ん。
  PC98、またはその互換機専用です。(EGC搭載機種)   CPU:486(66MHz)以上推奨      (386でも動作はしますがゲームにならないでしょう。       ただし、386命令を使っているので286は不可です。       また、低クロックの486でもかなり処理落ちするかも知れません)   実際は、CPUの他にもVRAMアクセスが速いことも重要です。
  PC98(PC98-NX除く)、またはその互換機専用です。   (EGC搭載機種)   CPU:486(66MHz)以上推奨      (386でも動作はしますがゲームにならないでしょう。       ただし、386命令を使っているので286は不可です。       はっきり言って、486でも66MHz位はないとかなり       処理落ちするかも知れません)   実際は、CPUの他にもVRAMアクセスが速いことも重要です。
  ●PC98(PC98-NX除く)、またはその互換機専用です。    (EGC搭載機種)   ●CPU:486(66MHz)以上推奨      (386でも動作はしますがゲームにならないでしょう。       ただし、386命令を使っているので286は不可です。       はっきり言って、486でも66MHz位はないとかなり       処理落ちするかも知れません)       実際は、CPUの他にもVRAMアクセスが速いことも重要です。

If ZUN recommends a 486 or faster to avoid slowdown, this necessarily means that any unintentional slowdown is indeed unwanted.
Also, note how only TH02's README claims that the game was exclusively tested on a 66 MHz model, which is highly likely to be that PC-9821Xs listed on the Amusement Makers page. Did ZUN switch to a faster PC-98 model for the development of the last three games? That late into the architecture's lifespan? Or did he merely test the game on faster models while the main development still took place on his 66 MHz model?

Picking a CPU clock speed for emulators

Of course, this now creates a problem for everyone wanting to configure emulators for PC-98 Touhou. If the ideal Touhou machine is infinitely fast, we should always pick the fastest possible emulated CPU speed, right? Historically, this has been bad advice: Most emulators will then stick to exactly the amount of cycles per emulated second you specified in the menu, slowing down the emulated system as a result. It's this kind of emulator behavior that gets players to manually look for "the sweet spot" – the maximum possible explicitly specified CPU clock speed that still manages to render without slowdown on their system. This is a tragedy for many reasons:

Ever since 2019, however, SimK has been developing an Async CPU mode for Neko Project 21/W, which finally got stabilized in ver0.86 rev.93, back in April of this year. Activate this mode with the Screen → CPU clock stabilizer and Screen → Dynamic CPU clock adjustment options, and then you should theoretically be able to finally stop worrying: Just specify the maximum possible clock speed in the usual configuration menu, and Neko Project will dynamically reduce the emulated clock speed to the fastest speed your system can handle.

Screenshot of Neko Project 21/W's Async CPU options

Then, the games are supposed to run similarly to how a correctly configured Anex86 has been running them all along, but with an additional 21 years of emulation accuracy improvements.
Sadly, this mode still needs a bit of work. Excessively high clock speeds will result in wildly fluctuating frame rates and even BGM tempos during the first few seconds of a game session as Neko Project 21/W apparently takes a while to find the optimal clock speed. Even afterwards, emulation remains noticeably slower than Anex86:

This is Neko Project 21/W ver.0.86 rev.95 configured with a clock speed of 1 GHz, running on an Intel Core i5-8400T. The fluctuations are not nearly as intense during the rest of a game session, but remain noticeable throughout.

But what about DOSBox-X, the other good emulator recommended these days? This Async CPU mode is very similar to the cycles=max option that DOSBox-X has supported all along. If you try running my 📝 past and future blitting benchmarks using this option, you can observe how DOSBox-X also starts with a low cycle count and then gradually speeds up to accommodate the actual processing load.
In the much less synthetic test case of running PC-98 Touhou, however, DOSBox-X's cycle adjustment reveals itself as much more sophisticated than Neko Project 21/W's implementation. The showdetails=true option reveals that the cycle count does fluctuate quite heavily, which does translate into minor BGM dropouts particularly near the start of a session. But these dropouts are tiny in comparison to what you'd get on Neko Project 21/W, and the framerate remains stable throughout.

As for overall performance, DOSBox-X's simple interpreter core is not nearly as optimized as Neko Project 21/W's interpreter and peaks at roughly half of its speed. The dynamic_nodhfpu core, however, solidly beats Neko Project 21/W by the same 50%. And it's this added bit of performance that makes all the difference: It eradicates slowdown in most of the usual spots in PC-98 Touhou where emulators and even Anex86 typically struggle, and turns DOSBox-X into the first emulator to finally beat Anex86's performance on the same hardware in all the workloads that matter. The dynamic core still doesn't quite reach the speeds of the hypothetical infinitely fast PC-98 on my outdated system, but it remains the most reliable configuration option when it comes to delivering ZUN's intended vision. If we ignore the BGM dropouts. :thonk:
Just make sure to explicitly select the dynamic_nodhfpu variant, not the regular dynamic core. The latter is infamous for recompilation errors in FPU code that break TH01 gameplay. While that specific issue is ostensibly fixed, I still managed to occasionally run into smaller FPU-related bugs in current DOSBox-X versions. Unfortunately, I didn't manage to capture them on video; I would have reopened the issue on the spot if I did.

Screenshot of the ReC98 Shuusou Gyoku build running in a 640×480 window on a Windows 98 desktop, next to the System Properties window (as usual for backport screenshots) and the D3DWindower window (demonstrating how the game was windowed). The game window shows the DOSBox-X with `cycles=max` and `core=simple`Screenshot of the ReC98 Shuusou Gyoku build running in a 640×480 window on a Windows 98 desktop, next to the System Properties window (as usual for backport screenshots) and the D3DWindower window (demonstrating how the game was windowed). The game window shows the Neko Project 21/W in Async CPU mode with a maximum clock speed of 2.4576 × 407 = 1000.2432 MHzScreenshot of the ReC98 Shuusou Gyoku build running in a 640×480 window on a Windows 98 desktop, next to the System Properties window (as usual for backport screenshots) and the D3DWindower window (demonstrating how the game was windowed). The game window shows the DOSBox-X with `cycles=max` and `core=dynamic`Screenshot of the ReC98 Shuusou Gyoku build running in a 640×480 window on a Windows 98 desktop, next to the System Properties window (as usual for backport screenshots) and the D3DWindower window (demonstrating how the game was windowed). The game window shows the Anex86 in Sync 1 mode
Of course, any performance measurement of an emulator with dynamic cycle adjustment can only ever represent a snapshot of the ever-changing adjustment state, and should therefore be taken with a grain of salt. Hence, these screenshots are purely decorative; I just added them because I'm sure that someone would have asked for exact numbers otherwise. Also, the exact relations between emulators are highly dependent on the workload…
And yes, that's a new benchmark! More about this one 📝 in part 2.

(Still, it's remarkable how close Anex86 gets despite its interpreter core, and how it even beats DOSBox-X in MOVS performance. I looked at Anex86's disassembly for 10 minutes and saw big tables of tiny per-instruction functions with custom calling conventions that make remarkably efficient use of the few registers you get in x86. Also, negative offsets? They must have written this entire x86-on-x86 core in ASM.)

While this is great news for players, the whole situation remains very unsatisfying at a technical level. Even if you don't care about the remaining BGM dropouts, running these games at the highest possible emulated clock speed means that you constantly spend 100% of all CPU cores assigned to your emulator just to avoid slowdown and lag in a few particularly CPU-intensive sections. Power saving might be the single best practical argument in favor of a port. :thonk:

Also, all this complexity involved in dynamic cycle adjustment raises one question you might have had all along. Why don't we just leave our emulated CPUs at 66 MHz? After all, ZUN said that 66 MHz is enough to eliminate all slowdown in at least TH02 and TH03, so how about just living with whatever slowdown we'd still experience in TH04 and TH05? This is certainly a healthier approach, much more appropriate for these silly little indie games that were never meant to be obsessed about at this level, and we get rid of those last few BGM dropouts in DOSBox-X!
Well, if that statement was ever correct to begin with, it would have only applied to real hardware and not to emulators. mu021 reported that the final phase of TH02's Mima fight slowed down even at 78 MHz in Neko Project, and part 4 will contain 📝 even more examples of how 66 MHz slows down several effects in menus and cutscenes, and thus paints a wrong picture of them. Hence, choosing 66 MHz for a preconfigured emulator package might have a particularly annoying side effect: If people get used to how slow these effects run on emulators, they might be rather irritated once the modern ports will invariably run them at their intended speed denoted in the code. I can already imagine them yelling too fast!, inaccurate!, and literally unplayable!, oblivious to the fact that they had the wrong idea about these effects all along.
Or maybe it'll all be fine once part 4 has documented these issues in depth. I certainly wouldn't criticize a package for choosing 66 MHz. All choices are unsatisfying at some level…

If only we could optimize the games enough to remove any unwanted slowdown at 66 MHz. Then, people could freely choose one emulator over another for reasons unrelated to performance, because even cycle-limited emulators could then actually deliver on ZUN's statements in the README files… :thonk:
And since we've defined debloating as an integral part of port development earlier, that's exactly what we're going to do.


But can we even do that within our high standards? Obviously, our ports should remain…

Frame-perfect

Since all five games are explicitly timed around VSync, it's immediately clear what we mean by this term:

Everything rendered to a single page of VRAM between two VSync wait loops defines one single logical frame.

If we are double-buffering correctly and the PC-98 system running the game is fast enough to finish rendering such a logical frame to VRAM within two VSync signals, everything is fine: The sequence of frames you can observe on your screen matches the logical sequence of internal frames, and we can easily record this sequence and compare the port against it.
But what about unintentional slowdown? In these cases, ZUN asks the system to do way more work than it can execute between two VSync signals. Notably, this also includes most loading times: Once we add disk access into the mix, we can't guarantee hitting any VSync deadlines anymore, and decompressing all these 640×400 images is quite expensive as well. Obviously, we don't want to abandon our goal of frame-perfection and the comparability of ports just because of this variability, so let's add another rule:

Individual defined frames may be shown on screen for any integer multiple of the frame time.

The reason for the integer restriction is obvious: If we start drawing to the screen in the middle of a frame, we get screen tearing and thus a non-perfect frame – not just because tearing looks bad, but also because the position of the tearing line always depends on the overall performance of the system you run the game on.
The combination of these two rules leads to an immediate consequence:

The games must only ever display complete logical frames.

And now we have a problem. Our rules have just outlawed screen tearing, but nearly every menu and cutscene screen in ZUN's original code has some kind of screen tearing issue. 📝 The Music Room of TH02-TH04 represents probably the worst example as it suffers from screen tearing on every single frame:

Screenshot of TH04's Music Room, demonstrating the screen tearing landmine that the original game exhibits on every frameColored visualization of which lines correspond to which frame in the TH04 Music Room screenshot
Which frame are we even on? 😵 This landmine is the reason why the first rule explicitly restricts rendering to a single VRAM page. 📝 Check the Music Room blog post for an explanation of what went wrong here.

Also, how would you possibly preserve these tearing lines once you've ported the game? After all, modern platforms not only imply much faster CPUs, but also completely different rendering methods, especially once we add scaling into the mix.
This can only mean one thing:

It is fundamentally impossible to port the unmodified codebase of PC-98 Touhou and remain frame-perfect to the original release.

You could maybe get there by throwing out the integer multiple rule and accepting teared frames as legitimate. But then you'd have to decide on a particular model whose slowdown behavior you'd want to replicate and lock down exactly – and as I've stated in the section above, that's quite a silly and impractical proposition.

Resolving screen tearing

So, how do we get back to a comparable sequence of well-defined frames? This can only work if we leave the confines of real hardware and instead reach for the infinitely fast PC-98 that ZUN wanted to have anyway. Such a system would never exhibit screen tearing because it would naturally complete all rendering within the vertical blanking interval preceding each displayed frame. Once our code then ends a frame by entering a busy-waiting loop for the next VSync signal, the screen would then get to draw static and well-defined VRAM contents. This behavior is the whole reason why I get to classify screen tearing issues as landmines that must always be fixed, as opposed to bugs that a port could potentially retain.
If we actually had such an infinitely fast PC-98, we could just run ZUN's unmodified code on that system and be done now. But as we've seen above, not even DOSBox-X's dynamic core manages to run PC-98 Touhou at the infinitely fast level we'd need. Also, we wanted to get rid of relying on specific emulators and have already planned to optimize all this code anyway…

So let's defuse each screen tearing landmine one by one by rewriting its code to match the output of an infinitely fast PC-98. This is a lot more feasible than it sounds because these landmines aren't actually caused by a lack of CPU power. Every screen tearing issue comes down to ZUN misplacing certain screen-affecting operations within the hellscape of imperative hardware state mutations that is his menu and cutscene code. You can either hide the issue by throwing an infinite amount of processing power at the problem so that the order of mutations no longer observably matters, or you can just write good code.
In theory, we only have to follow a few rules:

The difficulty of actually pulling this off, however, can range anywhere from Easy to Lunatic, depending on the screen, because of course every one of them is different. Even after these 11 pushes, I'll be far from done. But in the end, we'll have perfect and easily verifiable frame parity between the PC-98 versions and the future ports, even though we had to bend the code a little. Or a lot. Oh well.


If you only opened this post for the required reading part, you can stop reading now. I've got a few more technical thoughts about a few implementation details of the future ports that tend to come up in discussions, but these aren't as essential as the high-level issues above.

So we've now decided on what to do in order to make the ports good, but what are the basic challenges we have to solve in order to port these games to modern systems in the first place? Let's start with a perhaps surprising list of non-issues that some people might perceive as challenges:

In short: If the feature in question is consistently used through an API, it's not a challenge in itself. The hard parts are all the opposite cases – when ZUN suddenly starts writing to VRAM segments or I/O ports in the middle of gameplay code, like he does all over TH01. All of these instances need to be manually cleaned up and abstracted away. Conversely, this is also why 📝 TH02 remains by far the easiest individual game to port – it has the least amount of hand-written blitting code and mostly sticks to master.lib functions.

Instead, the biggest immediate challenge is something far more basic:

🎨 Palettized and planar graphics 🎨

After all, PC-98 Touhou doesn't just view the PC-98's graphics subsystem as an obstacle to overcome, but occasionally makes creative use of both palettes and individual bitplanes. How would we possibly cover these effects in a modern graphics API that will be far removed from these concepts? Three challenges immediately come to mind in that regard:

  1. The whole concept of enforcing a single 16-color palette across the entire screen in a world where 32-bit RGBA is the only reliably available texture format. Shaders offer a simple solution: We simply wouldn't use traditional textures, and just write our own sampler that takes both the original palettized 16-color+alpha image and the global palette as input, and performs a lookup for each texel. But what are we supposed to do in SDL_Renderer's fixed-function pipeline? Use the CPU to update all loaded textures on every palette color change? Split each sprite into a separate texture for each color and consume 16× the amount of VRAM just so that we can use vertex colors for each individual color layer? :onricdennat: Or break down every sprite into a point list to save the VRAM? :tannedcirno::onricdennat:
  2. Any kind of sprite-shaped palette color bit flipping effect, such as 📝 the falling polygons in the Music Room. Effects like these could potentially be hardware-rendered even in a fixed-function pipeline if we split the background image into two and render the polygons using regular triangles with their UV coordinates matched to the pixel coordinates on every frame. But would all the involved interpolation reliably give us the original sharp edges without reaching for a shader to ensure that it does? In any case, this solution would need a completely different implementation for a modern port than it currently uses in ZUN's PC-98-native code, which gets by with less per-frame redraw than you'd think that this effect would need.
    uth05win didn't even get to port the Music Room, which is probably not without reason.
  3. TH01's square-shaped inverting effects used during bomb and entrance animations. Flipping a given bit of a pixel's palette index? Based on what's there before? No way around a shader for this one…
    Note how the flipped cards rip holes into the square trails. I'm not even sure what the TH01 Anniversary Edition would change about the effect, or whether it even should change anything about it. Good luck porting this effect pixel-perfectly without pixel-level access.

However, writing all this custom graphics code for the modern port would run against my previously stated goal of sharing as much code as possible between PC-98 and modern platforms. While shaders are the conceptually simpler solution for all of these challenges, they aren't easy in practical terms, and I already 📝 decided against using them for Shuusou Gyoku for good reasons. Also, is all of this really worth the effort if these games demonstrably don't even need the performance of GPU rendering?
But that only leaves one conclusion:

The future ports of PC-98 Touhou to modern systems will software-render the graphics layer on the CPU.

I know, that sounds very shocking and probably disappointing at first. But at a closer look, it's really not all that bad. These games have been software-rendered all along by not only PC-98 emulators, but by real hardware at mid-90's CPU speeds. You might point to the GRCG and EGC chips as evidence for at least some capacity of hardware acceleration, but I see them more as workarounds for the unfortunate planar nature of VRAM on this Japanese business computer architecture. In the end, "software rendering" only means that the CPU receives access to every pixel in the framebuffer. Once all graphical functionality is neatly abstracted away and the game no longer directly accesses the four physical bitplanes, the ports can store sprites and the rendered graphics layer in the most performant way.
Also, note how I only said "graphics layer". Besides 📝 the obvious candidate of framebuffer scaling, the ports will use the GPU for two more important aspects:

As a result, the software renderer of our hand-crafted ports would still internally produce a graphics and text layer that persists across frames and receives minimal redraws, just like the PC-98 originals did. In fact, it would have to produce the exact same graphics layer if we wanted to port the non-Anniversary Edition, including the tile source area. There's no technical need to keep tiles on the graphics layer in a port, but certain intense shake effects temporarily reveal individual tiles below the HUD:

This definitely counts as a bug to be fixed in this game's Anniversary Edition, but how would we fix this one on PC-98 where we do need the tile area in VRAM? Moving the tiles to another place and patching the 📝 .MAP at runtime?

Applying the palette to produce the final rendered image then raises another set of exciting engineering questions. Would we actually use a palettized 4bpp buffer in memory, storing two pixels in a byte? Perhaps with an 8-bit palette that maps each possible pair of pixels to a pair of 32-bit RGBA values, halving the amount of per-frame palette lookups? Or would we always store an RGBA image and merely offer a palettized API around it? As far as I'm concerned, these challenges are way more exciting than the prospect of locking ourselves into some shader language.

📃 Page flipping 📃

But wait. If the port produces a persistent graphics layer, shouldn't it produce two, one for each VRAM page on the PC-98? From the point of view of a modern port, we really don't need to. We only ever upload one "VRAM page" to the GPU anyway, which is then scrolled and scaled onto one of the GPU's backbuffers inside the swapchain. Then, the game can immediately continue drawing onto the same software-rendered VRAM buffer in the next frame without affecting the GPU output.

Obviously, this rendering paradigm doesn't translate back to the PC-98. There, we must render each frame to either the invisible or the visible page. Also, minimal redraw is crucial because we can neither afford the memory nor the performance to regularly copy an entire 128 KB of pixel data from whatever place to VRAM. As a result, page flips are a common sight in even the highest levels of menu and cutscene code, adding yet another unsightly piece of state you have to keep track of while reviewing and modding the code. I've grown to hate them quite a lot over the past four months because of just how often they are associated with bad code: In most menu and cutscene screens, ZUN just uses the second VRAM page as pixel storage for inter-page copies using the EGC, 📝 whose slowness is a regular topic on this blog. Once you've replaced these copies with optimized blits from conventional RAM, you've not only removed all these page flips and clearly revealed these screens as the single-buffered affairs they've always been, but you've also accelerated them enough to remove any screen tearing issues they might have had at 66 MHz.

Unfortunately, things are not that easy everywhere:

Can we rewrite all of these cases in a way that high-level game code no longer has to care about pages? Can we perhaps even banish page flipping to a new lower level of the architecture that all menus and cutscenes are built on top of, and thus unconditionally double-buffer every screen while still maintaining minimal redraw? Or is none of this worth it and we'll just live with two VRAM pages on all platforms? I'm honestly not sure. And that's just a small preview of the porting challenges that still await us and were far beyond the scope of even these 11 pushes…


As for the commits that are formally assigned to this blog post: It was all maintenance, build system setup, and some debloating work on TH01 around its packfile support that I thought would be necessary but thankfully didn't yet need after all. More about that in, you guessed it, 📝 part 4.

Alright! Improving performance, fixing screen tearing issues, establishing better cross-platform interfaces, and cleaning up ZUN's code to facilitate all of that… I've got a lot to do now. Next up: Getting closer to our performance goals by optimizing all PC-98-native code surrounding the .PI files used for backgrounds and cutscene pictures, since we later want to draw our TH03 netplay menus on top.

📝 Posted:
💰 Funded by:
32th System, [Anonymous], iruleatgames, Blue Bolt
🏷️ Tags:

Surprise! The last missing main menu in PC-98 Touhou was, in fact, not that hard. Finishing the rest of TH03's OP.EXE took slightly shorter than the expected 2 pushes, which left enough room to uncover an unexpected mystery and take important leaps in position independence…

  1. TH03's main/option menu
  2. Picking opponents for Story Mode
  3. Demo Play difficulty?
  4. More variables from TH02's Stage 3
  5. Technical position independence for TH03's MAINL.EXE

For TH03, ZUN stepped up the visual quality of the main menu items by exchanging TH02's monospaced font with fixed, pre-composited strings of proportional text. While TH04 would later place its menu text in VRAM, TH03 still wanted to stay with TH02's approach of using gaiji to display the menu items on the PC-98 text layer. Since gaiji have a fixed size of 16×16 pixels, this requires the pre-composited bitmaps to be cut into blocks of that size and padded with blank pixels as necessary:

The proportional text section from TH03's MIKOFT.BFT, with the 16×16 gaiji grid overlaid

If your combined amount of text is short enough to fit into the PC-98's 256 gaiji slots, this is a nice way of using hardware features to replace the need for a proportional text renderer. It especially simplifies transitions between menus – simply wiping the entire TRAM is both cheap and certainly less error-prone than (un)blitting pixels in VRAM, which 📝 ZUN was always kind of sloppy at.
However, all this text still needs to be composited and cut into gaiji somewhere. If you do that manually, it's easy to lose sight of how the text is supposed to appear on screen, especially if you decide to horizontally center it. Then, you're in for some awkward coordinate fiddling as you try to place these 16-pixel bricks into the 8-pixel text grid to somehow make it all appear centered:

TH03's main menu box as it appears in the original gameTH03's main menu box with opaque gaiji to highlight their exact locations in TRAMTH03's main menu box with correct centering
TH03's Option menu box as it appears in the original gameTH03's Option menu box with opaque gaiji to highlight their exact locations in TRAMTH03's Option menu box with correct centering
The VS Start menu actually is correctly centered.

Then again, did ZUN actually want to center the Option menu like this? :thonk: Even the main menu looks kind of uncanny with perfect centering, probably because I'm so used to the original. Imperfect centering usually counts as a bug, but this case is quirky enough to leave it as is. We might want to perfectly center any future translations, but that would definitely cost a bit as I'd then actually need to write that proportional text renderer.
Apart from that, we're left with only a very short list of actual bugs and landmines:

While the rest of the code is not free of the usual nitpicks, those don't matter in the grand scheme of things. The code for the sliding 東方時空 animation is even better: it makes decent use of the EGC and page flipping, and places the 📝 loading calls for the character selection portraits at sensible points where the animation naturally wants to have a delay anyway. We're definitely ending the main menus of PC-98 Touhou on a high note here.


You might have already spotted some unfamiliar text in the gaiji above, and indeed, we've got three pieces of unused text in these two menus! Starting from the top, the Watch label is entirely unused as none of its gaiji IDs are referenced anywhere in the final code. The label's placement within the gaiji IDs would imply that this option was once part of the main menu, but nothing in the game suggests that the main menu ever had a bigger box that could fit a 7th element. On the contrary, every piece of menu code assumes that the box sprites loaded from OPWIN.BFT are exactly 128 pixels high:

TH03's OPWIN.BFT
Fun fact: The code doesn't even use the 16 pixels in the middle, and instead just assumes that the pixels between the X coordinates of [8; 16[ and [32; 40[ are identical.

The unused MIDI music option has already been widely documented elsewhere. Changing the first byte in YUME.CFG to 02 has no functional effect because ZUN removed most MIDI-related code before release. He did forget a few instances though, and the surviving dedicated switch case in the Option menu is now the entire reason why you can reveal this text without modifying the binary. Changing the option will always flip its value back to either off or FM(86).
Last but not least, we have the Type label and its associated numbers. These are the most interesting ones in my eyes; nobody talks about them, even though we have definite proof that they were used for the KeyConfig options at some earlier point in development:

But how exactly can we prove this? The code does string together the respective gaiji IDs and defines the resulting arrays right next to the final KeyConfig types, but doesn't use these arrays anywhere. By itself, this only means that these labels were associated with some option that may have existed at one point in development. The proof must therefore come from outside the code – and in this case, it comes from both 夢時空.TXT and 時空_1.TXT, which still refer to the KeyConfig options as numbered types:

 ■6.操作方法 […]    FM音源のジョイスティックが無い場合は、TYPE1にしてください。    ○TYPE1 Key vs Key […]    ○TYPE2 Joy vs Key […]    ○TYPE3 Key vs Joy

That's all I've got about the menus, so let's talk characters and gameplay! When playing Story Mode, OP.EXE picks the opponents for all stages immediately after the 📝 Select screen has faded out. Each character fights a fixed and hardcoded opponent in Stage 7's Decisive Match:

PlayerStage 7 opponent
Reimu ReimuMima Mima
Mima MimaReimu Reimu
Marisa MarisaReimu Reimu
Ellen EllenMarisa Marisa
Kotohime KotohimeReimu Reimu
Kana KanaEllen Ellen
Rikako RikakoKana Kana
Chiyuri ChiyuriKotohime Kotohime
Yumemi YumemiRikako Rikako

The opponents for the first 6 stages, however, are indeed completely random, and picked by master.lib's reimplementation of the Borland RNG. The game only needs to ensure that no character is picked twice, which it does like this:

const int stage_7_opponent = HARDCODED_STAGE_7_OPPONENT_FOR[playchar];
bool opponent_seen[7] = { false };

for(int stage = 0; stage < 6; stage++) {
	int candidate;
	do {
		// Pick a random character between Reimu and Rikako
		candidate = (irand() % 7);
	} while(opponent_seen[candidate] || (stage_7_opponent == candidate));
	opponent_seen[candidate] = true;
	story_opponent[stage] = candidate;
}
Characters are numbered from 0 (Reimu Reimu) to 8 (Yumemi Yumemi), following the order in the Stage 7 table above.

Yup. For every stage, ZUN re-rolls until the RNG returns a character who hasn't yet been seen in a previous stage – even in Stage 6 where there's only one possible character left. :zunpet: Since each successive stage makes it harder for the inner loop to find a valid answer, you start to wonder if there is some unlucky combination of seed and player character that causes the game to just hang forever.
So I tested all possible 232 seed values for all 9 player characters and… nope, Borland's RNG is good enough to eventually always return the only possible answer. The inner loop for Stage 6 does occasionally run for a disproportionate number of iterations, with the worst case being 134 re-rolls when playing Rikako Rikako's Story Mode with a seed value of 0x099BDA86. But even that is many orders of magnitude away from manifesting as any kind of noticeable delay. And on average, it just takes 17.15 iterations to determine all 6 random opponents.


The attract demos are another intriguing aspect that I initially didn't even have on my radar for the main menu. touhou-memories raises an interesting question: The demos start at Gauge and Boss Attack level 9, which would imply Lunatic difficulty, but the enemy formations don't match what you'd normally get on Lunatic. So, which difficulty were they recorded on?
Our already RE'd code clears up the first part of that question. TH03's demos are not recordings, but simply regular VS rounds in CPU vs. CPU mode that automatically quit back to the title screen after 7,000 frames. They can only possibly appear pre-recorded because the game cycles through a mere four hardcoded character pairings with fixed RNG seeds:

Demo #P1P2Seed
1Mima MimaReimu Reimu600
2Marisa MarisaRikako Rikako1000
3Ellen EllenKana Kana3200
4Kotohime KotohimeMarisa Marisa500
Certainly an odd choice if your game already had the feature to let arbitrary CPU-controlled characters fight each other. That would have even naturally worked for the trial version, which doesn't contain demos at all.

Then again, even a "random" character selection would have appeared deterministic to an outside observer. As usual for PC-98 Touhou, the RNG seed is initialized to 0 at startup and then simply increments after every frame you spend on the title screen and inside the top-level main, Option, and character selection menus – and yes, it does stay constant inside the VS Start menu. But since these demos always start after waiting exactly 520 frames on the title screen without pressing any key to enter the main menu, there's no actual source of randomness anywhere. ZUN could have classically initialized the RNG with the current system time, which is what we used to do back in the day before operating systems had easily accessible APIs for true randomness, but he chose not to, for whatever reason.

The difficulty question, however, is not so easy to answer. The demo startup code in the main menu doesn't override the configured difficulty, and neither does any other of the binaries depending on the demo ID. This seems to suggest that the demos simply run at the difficulty you last configured in the Option menu, just like regular VS matches. But then, you'd expect them to run differently depending on that difficulty, which they demonstrably don't. They always start on Gauge and Boss Attack level 9, and their last frame before the exit animation is always identical, right down to the score, reinforcing the pre-recorded impression:

Screenshot of the last frame (#7,000) of TH03's first demo (Mima vs. Reimu).Screenshot of the last frame (#7,000) of TH03's second demo (Marisa vs. Rikako).Screenshot of the last frame (#7,000) of TH03's third demo (Ellen vs. Kana).Screenshot of the last frame (#7,000) of TH03's fourth demo (Kotohime vs. Marisa).
Note that it takes much longer than the expected 2:04 minutes for the game to reach this end state. Each WARNING!! You are forced to evade / Your life is in peril popup freezes gameplay for 26 frames which don't count toward the demo frame counter. That's why these popups will provide such a great 📝 resynchronization opportunity for netplay. It's almost as if Versus Touhou was designed from the start with rollback netcode in mind! :godzun:

With quite a bit of time left over in the second push, it made sense to look at a bit of code around the Gauge and Boss Attack levels to hopefully get a better idea of what's going on there. The Gauge Attack levels are very straightforward – they can range from 1 to 16 inclusive, which matches the range that the game can depict with its gaiji, and all parts of the game agree about how they're interpreted:

The 16 proportional digit gaiji from TH03's GAMEFT.BFT
Stored in GAMEFT.BFT.

The same can't be said about the Boss Attack level though, as the gauge and the WARNING!! popup interpret the same internal variable as two different levels?

Darkened screenshot of a TH03 Boss Attack fired off near the beginning of a match at Lunatic difficulty, highlighting the discrepancy between the Boss Attack level shown in the gauge at the bottom of each playfield (10) and the one shown in the WARNING!! popup (9)

This apparent inconsistency raises quite a few questions. After all, these gaiji have to be addressed by adding an offset from 0 to 15 to the ID of the 1 gaiji, but the levels are supposed to range from 1 to 16. Does this mean that one of these two displays has an off-by-one error? You can't fire a Level 0 Boss Attack because the level always increments before every attack, but would 0 still be a technically valid Boss Attack level?
Decompiling the static HUD code debunks at least the first question as ZUN resolves the apparent off-by-one error by explicitly capping the displayed level to 16. And indeed, if a round lasts until the maximum Boss Attack level, the two numbers end up matching:

Darkened screenshot of a TH03 Boss Attack fired off near the end of a match, highlighting how both the gauge and the WARNING!! popup agree on the level once it reaches 16

This suggests that the popup indicates the level of the incoming attack while the gauge indicates the level of the next one to be fired by any player. That said, this theory not only needs tons of comments to explain it within the code, but also contradicts 夢時空.TXT, which explicitly describes the level next to the gauge as the 現在のBOSSアタックのレベル. Still, it remains our best bet until we've decompiled a few of the Boss Attacks and saw how they actually use this single variable.


So, what does this tell us about the demo difficulty? Now that we can search the code for these variables, we quickly come across the dedicated demo-specific branch that initializes these levels to the observable fixed values, along with two other variables I haven't researched so far. This confirms that demos run at a custom difficulty, as the two other variables receive slightly different values in regular gameplay.

However, it's still a good idea to check the code for any other potential effects of the difficulty setting. Maybe they're just hard to spot in demos? Doesn't difficulty typically affect a whole lot of other things in Touhou game code? Well, not in TH03 – MAIN.EXE only ever looks at the configured difficulty in three places, and all of them are part of the code that initializes a new round.
This reveals the true nature of difficulty in TH03: It's exclusively specified in terms of these five variables, and the Easy/Normal/Hard/Lunatic/"Demo" settings can be thought of as simply being presets for them. Story Mode adds 📝 the AI's number of safety frames to the list of variables and factors the current stage number into their values, but the concept stays the same. In this regard, TH03's design is unusually clean, making it perhaps the only Touhou game with not even a single "if difficulty is this, then do that" branch in script code. It's certainly the only PC-98 Touhou game with this property.

But it gets even better if we consider what this means for netplay. We now know that the configured difficulty is part of the match-defining parameters that must be synced between both players, just like the selected characters and the RNG seed. But why stop there? How about letting players not just choose between the presets, but allowing them to customize each of the five variables independently? Boom, we've just skyrocketed the replay value of netplay. 🚀 It's discoveries like these that justify my decision to start the road toward netplay by decompiling all of OP.EXE: In-engine menus are the cleanest and most friendly way of allowing players to configure all these variables, and now they're also the easiest and most natural choice from a technical point of view.


But wait, there's still some time left in that second push! The remaining fraction of the OP.EXE reverse-engineering contribution had repeating decimals, so let's do some quick TH02 PI work to remove the matching instance of repeating decimals from the backlog. This was very much a continuation of 📝 last year's light PI work; while the regular TH02 decompilation progress has focused and will continue to focus on the big features, it still left plenty of low-hanging PI fruit in boss code.
The first animation frame of TH02's Stage 3 midboss, taken from STAGE2.BMT Back then, we left with the positions of the Five Magic Stones, where ZUN's choice of storing them in arrays was almost revolutionary compared to what we saw in TH01. The same now applies to the state flags and total damage amount of not just the boss of Stage 3, but also the two independently damageable entities of the stage's midboss. In total, all of the newly identified arrays made up 3.36% of all memory references in TH02, and we're not even done with Stage 3. The first animation frame of TH02's Stage 3 midboss, taken from STAGE2.BMT


Actually, you know what, let's round out that second push with even more low-hanging PI fruit and ensure 📝 technical position independence for TH03's MAINL.EXE. This was very helpful considering that I'm going to build netplay into the anniversary branch, whose debloated foundation 📝 aims to merge every game into as few executables as possible. Due to TH03's overall lower level of bloat and the dedicated SPRITE16-based rendering code in MAIN.EXE, it might not make as much sense to merge all three of TH03's .EXE binaries as it did for TH01, and MAIN.EXE's lack of position independence currently prevents this anyway. However, merging just OP.EXE and MAINL.EXE makes tremendous sense not just for TH03, but for the other three games as well. These binaries have a much smaller ratio of ZUN code to library code, and use the same file formats and subsystems.
But that's not even the best part! Once we've factored out all the invisible inconsistencies between the games, we get to share all of this code across all of the four games. Hence, technical position independence for TH03's MAINL.EXE also was the final obstacle in the way of a single consistent and ultimately portable version of all of this code. 🙌

So, how do we go from here to 📝 the short-term half-PC-98/half-modern netplay option that Ember2528 is now funding? Most of the netcode will be unrelated to TH03 in particular, but we'd obviously still want to reverse-engineer more of MAIN.EXE to ensure a high-quality integration. So how about alternating the upcoming deliveries between pure RE work and any new or modded code? Next up, therefore, I'll go for the latter and debloat OP.EXE so that I can later add the netplay features without pulling my hair out. At that point, it also makes sense to take the first steps into portability; I've got some initial ideas I'm excited to implement, and Congrio's tiny bit of funding just begs to be removed from the backlog. :tannedcirno:
(And I'm definitely going to defuse all the tearing landmines because my goodness are they infuriating when slowing down the game or working with screen recordings.)

📝 Posted:
💰 Funded by:
Yanga, iruleatgames, nrook, [Anonymous]
🏷️ Tags:

Sometimes, the gameplay community will come up with the most outlandish theories before they even begin to consider the idea that certain safespots might not be intentional and only work by accident to begin with. Want more details? Read on…

  1. Overview of TH02's bullet system
  2. The TH02 Boss Decompilation Order Announcement
  3. Hitboxes
  4. The fundamental inaccuracy in PC-98 Touhou trigonometry
  5. The myth of death-induced hitbox shifting

So, TH02's bullet system! At a high level, it marks an interesting transitional point: It's still very much based on TH01's design with its predefined static or aimed spreads, but also introduces a few features that would later return in TH04 and TH05. By transplanting the TH01 system into a double-buffered environment, ZUN eliminated the 📝 worst 📝 unblitting-related parts that plagued TH01, ending up with the simplest and cleanest implementation of bullets I've seen so far. That's not to say it's good-code – far from it – but it also hasn't reached the messy levels that TH04 and especially TH05 would bring later. Of course, there's still TH03's system left to be done until I can say for sure, but TH02's is a pretty strong contender.

The more detailed overview of the system:


To find out where all these bullet types are used, I of course had to label all the individual pattern functions and assign them to their (mid)boss owners. As a side effect, we now also know the preferred boss decompilation order for this game!

  1. Marisa
  2. Mima
  3. Evil Eye Σ
  4. Meira
  5. Rika
  6. 5 Magic Stones

Quite a satisfying order, if I may say so myself – burning off the big fireworks right in the beginning, getting slightly more unexciting later on, but then ending on arguably the best Touhou character ever conceived. :onricdennat:
Each of these decompilations will be preceded by the stage's respective midboss. This includes the Extra Stage – you might not think that this stage has a midboss, but it technically does, in the form of this combination of patterns:

Lasting exactly these 420 frames.

There's nothing in TH02's code that mandates midbosses to have sprite-like entities or even something like an HP bar. Instead, the code-level definition of a midboss is all about these properties:

Stage 5, on the other hand, indeed doesn't have anything that can be interpreted as a midboss.


Finally, and probably most importantly, hitboxes! The raw decompilation of TH02's bullet collision detection code looks like this:

// 8×8 pellets
(pellet_left >= (player_left +  7)) &&
(pellet_left <  (player_left + 17)) &&
(pellet_top  >= (player_top  + 12)) &&
(pellet_top  <= (player_top  + 22))

// 16×16 bullets
(bullet_left >= (player_left -  3)) &&
(bullet_left <  (player_left + 19)) &&
(bullet_top  >= (player_top  +  4)) &&
(bullet_top  <= (player_top  + 24))

However, if you aren't deeply familiar with the sizes of all involved sprites, these top-left positions slightly obscure the actual position of the hitbox. That top-left point might also not be where you think it is:

Sprite #0 of TH02's MIKO.BFTSprite #2 of TH02's MIKO.BFTSprite #3 of TH02's MIKO.BFT
It's the red point.

So let's transform these checks to a more useful comparison of the respective center points against each other, and also fix that inconsistency of the right coordinates being compared with < instead of <= like the other values:

// 8×8 pellets
(pellet_center_x >= (player_center_x - 5)) &&
(pellet_center_x <= (player_center_x + 4)) &&
(pellet_center_y >= (player_center_y - 8)) &&
(pellet_center_y <= (player_center_y + 2))

// 16×16 bullets
(bullet_center_x >= (player_center_x - 11)) &&
(bullet_center_x <= (player_center_x + 10)) &&
(bullet_center_y >= (player_center_y - 12)) &&
(bullet_center_y <= (player_center_y +  8))
Now also revealing the horizontal asymmetry that ZUN's code was sneakily hiding.

TH02 has only 5 different bullet shapes and no directional or vector bullets, so we can exactly visualize all of them:

Hitbox of TH02's 8×8 pelletsHitbox of TH02's 16×16 ball bulletsHitbox of TH02's 呪 bulletsHitbox of TH02's billiard bulletsHitbox of TH02's star bullets
📝 As 📝 usual, a bullet sprite has to be fully surrounded by the blue box for a hit to be registered.

Yup. Quite asymmetric indeed, and probably surprising no one.


While experimenting with the various hardcoded group types, I stumbled over a quite surprising quirk that you might have already noticed in the spread showcase video further above. For some reason, none of these spreads are perfectly symmetric, what the…?

A 2-spread pattern using a base angle of 0x40, TH02's hardcoded medium spread angle, and spawned at the center of the playfield to trap the player at its spawn pointVisualization of the asymmetry in this 2-spread pattern if the right lane moved at the correct angle; the cyan area shows the symmetric triangle the pattern is expected to form, and the red area shows the inaccurate extra amount of space covered by the left laneVisualization of the asymmetry in this 2-spread pattern if the left lane moved at the correct angle; the cyan area shows the asymmetric triangle being formed by the pattern as it is, and the red area shows the extra amount of space missing from the right lane to make the pattern actually symmetric
By the time the bullets have reached the bottom of the playfield, the inaccuracy has compounded so much that the right lane ends up 6 pixels closer to the player's center position than the left lane. Depending on which of the two lanes actually gets the correct angle, this either means that the left lane is moving too far (2️⃣) or that the right lane is not moving far enough (3️⃣).

This is very weird because the angles that go into the velocity calculations are demonstrably correct. You'd therefore get this asymmetry for not only the hardcoded spreads, but also for code that does its own angle calculations and spawns each bullet manually. It's not something that can arise from the other known issue of 📝 Q12.4 quantization either, because that would affect all parts of a pattern equally.
Instead, the inaccuracy originates in the conversion from the polar coordinates of angles and speeds into the per-frame X/Y pixel velocities that the game uses for actual movement. The integer math algorithm that ZUN uses here is pretty much the single most fundamental piece of code shared by all 5 games:

// Using 📝 typical 8-bit angles.
int16_t polar_x(int16_t center, int16_t radius, uint8_t angle)
{
	// Ensure that the multiplication below doesn't overflow
	int32_t radius32 = radius;

	// Get the cosine value from master.lib's lookup table, which scales the
	// real-number range of [-1; +1] to the integer range of [-256; +256].
	int16_t cosine = CosTable8[angle];

	// The multiplication will include master.lib's 256× scaling factor, so
	// divide the result to bring it within the intended radius.
	return (((radius * cosine) >> 8) + center);
}
This exact algorithm is even recommended in the master.lib manual.

The pattern above uses TH02's medium delta angle for 2-spreads and moves at a Q12.4 subpixel speed of 2.5, which corresponds to a radius of 40 in the context of polar coordinate calculation. Let's step through it:

Angle Cosine Multiplied In hex Shift result In decimal In Q12.4
(0x40 - 6) 38 1520 000005F0 00000005 5 0.3125
(0x40 + 6) -38 -1520 FFFFFA10 FFFFFFFA -6 -0.3750

Whoa, talk about getting a basic lesson about how computers work! PC-98 Touhou has just taught us that signedness-preserving arithmetic bitshifts are not equivalent to the apparently corresponding division by a power of two, because the typical two's complement representation of negative numbers causes the result to effectively get rounded away from zero rather than toward zero like the corresponding positive value. In our example, this means that the right lane is correct and moves at the angle we passed in, while the left lane moves 1/16 pixels per frame further to the left than intended. Since we're talking about the most basic piece of trigonometry code here, this inaccuracy also applies to every other entity in PC-98 Touhou that moves left relative to its origin point – and/or up, because Y coordinates are calculated analogously. Imagine that… it's been 10 years since I decompiled the first variant of this function, and I'm only now noticing how fundamentally broken it is.:godzun:
It's understandable why master.lib's manual recommends bitshifts instead of the more correct division here. On a 486, a single 32-bit IDIV takes a whopping >33 cycles, and it would have been even slower on the 286 systems that master.lib is geared toward. But there's no need to go that far: By simply rounding up negative numbers, we can emulate the rounding behavior of regular division while still using a bitshift:

int16_t polar_x(int16_t center, int16_t radius, uint8_t angle)
{
	int32_t ret = (static_cast<int32_t>(radius) * CosTable8[angle]);
+	if(ret < 0) {
+		// Round the multiplication result so that the shift below will yield a number
+		// that's 1 closer to 0, thus rounding toward zero rather than away from zero as
+		// bitshifts with negative numbers would usually do. This ensures that we return
+		// the same absolute value after the bitshift that we would return if [ret] were
+		// positive, thus repairing certain broken symmetries in PC-98 Touhou.
+		ret += 255;
+	}
	return ((ret >> 8) + center);
}
You could also do this in a branchless way, which is coincidentally very close to what current Clang would generate if you just wrote a regular division by 256. This branchless way does seem slightly slower on a 486 though, as it adds a constant >8 cycles worth of instructions. The branching implementation only adds >4 cycles for positive numbers and >3 for negative ones.

But that would be deep quirk-fixing territory. uth05win just uses floating-point math for this transformation, exchanging master.lib's 8-bit lookup tables for the C library's regular sin() and cos() functions, but bypassing the issue like this also forms the single biggest source of porting inaccuracy. Can't really win here… 🤷
Now it will be interesting to see whether ZUN worked around this inaccuracy in certain places by using slightly lower left- or up-pointing angles…


Alright, but aren't we still missing the single biggest quirk about bullets in TH02? What's with Reimu's hitbox misaligning when dying? I can't release a blog post about TH02's bullet system without solving the single most infamous bullet-related mystery that this game has to offer. So, time to start a third push for looking at all the player movement, rendering, and death sequence code…

If you remember the code above, there is no way that a hitbox defined using hardcoded numbers can ever shift in response to anything. Any so-called hitbox misalignment would therefore be a player position misalignment, which sounds even harder to believe. :thonk: And sure enough, after decompiling all of it, there's nothing of that sort to be found in the player code either.
If we take player position misalignment literally, we're only left with one other place where it could possibly somehow come from: the strange vertical shaking you can observe right in the first few frames of most stages. So let's visualize the hitbox and… nope, the shaking is purely a scrolling bug, nothing about it changes the internal player position used for collision detection.

So, uh, what are people even talking about? It doesn't help that no one cites any source for this claim and just presents it as a natural and seemingly self-evident fact, as if it was the most obvious and most easily verified property about the game.
Thankfully though, there have been two relatively recent videos about the issue, but both of them only showcase the supposed hitbox shifting in relation to a specific safespot at the end of the Extra Stage midboss. So is that what's been going on here? The community taking the game's behavior in just a single instance of collision detection within a single stage, and extending it to a general claim about the game as a whole? :thonk:
But indeed, the described behavior cleanly reproduces every time. Enter the spot with 2 remaining lives and you survive, but enter with 1 remaining life and you die:

Whatever this is about, it's not due to a difference in hitboxes because Reimu's position demonstrably stays identical. But if we switch between these two videos, we can easily spot that it's the patterns that are different! With 1 life left, the pattern moves at an ever so slightly slower speed, which apparently adds up to a life-or-death difference at that specific spot.
And that's what the supposed hitbox shifting ultimately boils down to: The natural impact of rank on patterns, adjusting bullet speed with a factor of ((playperf + 48) / 48) times 1/16 pixels. And nothing else.
Let's visualize the hitbox and also track one of the bullets:

If we look at the respective frames in the playperf = +2 case, we see that the bullet misses the hitbox by either one or two pixels on three successive frames:

Frame 120 of the playperf = +2 video above, cropped to Reimu's position at the bottom-right edge of the playfieldFrame 121 of the playperf = +2 video above, cropped to Reimu's position at the bottom-right edge of the playfieldFrame 122 of the playperf = +2 video above, cropped to Reimu's position at the bottom-right edge of the playfieldFrame 123 of the playperf = +2 video above, cropped to Reimu's position at the bottom-right edge of the playfield
That's not a safespot, that's Reimu barely surviving only thanks to rounding.

So, for once, this is not a quirk, and doesn't even qualify as a "funny ZUN code moment" if you ask me. This is the game working exactly as designed, and it's the players who are instead making wild assumptions about safespots that only hold when the rank system plugs very specific numbers into the game's fixed-point math.
If anything, you could make the stronger case that this safespot should not work under any circumstance. If the game tested the whole parallelogram covered by a bullet's trajectory between two successive frames instead of just looking at a bullet's current position, it would consistently detect this collision regardless of rank. But even the later games don't go to these lengths.

Visualization of potential collision detection with parallelograms
By testing with parallelograms, the game would not only look at the distinct bullet positions in green, but also detect that the bullet traveled through the position highlighted in cyan, which does lie fully within the hitbox.

Amusingly, if you die twice before this pattern and reach a rank of -2, bullet speed drops enough for the safespot to work again:

It's even the same bullet that fails to hit Reimu, although coming in 5 frames later.

If you're now sad because you liked the idea of ZUN deliberately putting hitbox-shifting code into the game, you don't have to be! You might have already noticed it in the 1-life videos above, but TH02 does have one funny but inconsequential instance of death-induced player position shifting. In the 19 frames between the end of the animation and Reimu respawning at the bottom of the playfield, ZUN just adds 4 pixels to Reimu's Y position. You don't really notice it because the game doesn't render Reimu's sprite during these frames, but this modified position still partakes in collision detection, causing bullets to be removed accordingly.

Hilariously, ZUN was well aware that this shift could move the player's Y position beyond the bottom of the playfield, and thus cause sparks to be spawned at Y coordinates larger than 400. So he just… wrapped these spark spawn coordinates back into the visible range of VRAM, thus moving them to the top of the playfield… :zunpet:
The off-center spawn point of these sparks was the only actual bug in this delivery, by the way.

To round out the third push, I took some of the Anything budget towards finalizing random bits of previously RE'd TH04 and TH05 code that wouldn't add anything more to this blog post. These posts aren't really meant to be a reference – that's the job of the code, the actual primary source of the facts discussed here – but people have still started to use them as such. So it makes sense to try focusing them a bit more in the future, and not bundle all too many topics into a single one.
This finalization work was mostly centered on some tile rendering and .STD file loading boilerplate, but it also covered some of TH05's unfortunately undecompilable HUD number display code. The irony is that it's actually quite good ASM code that makes smart register choices and uses secondary side effects of certain instructions in a way that's clever but not overly incomprehensible. Too bad that these optimizations have no right to exist in logic code that is called way less than once per frame…

Next up: An unexpected quick return to the Shuusou Gyoku Linux port, as Arch Linux is bullying us onto SDL 3 faster than I would have liked.

📝 Posted:
💰 Funded by:
[Anonymous], 32th System
🏷️ Tags:

Remember when ReC98 was about researching the PC-98 Touhou games? After over half a year, we're finally back with some actual RE and decompilation work. The 📝 build system improvement break was definitely worth it though, the new system is a pure joy to use and injected some newfound excitement into day-to-day development.
And what game would be better suited for this occasion than TH03, which currently has the highest number of individual backers interested in it. Funding the full decompilation of TH03's OP.EXE is the clearest signal you can send me that 📝 you want your future TH03 netplay to be as seamlessly integrated and user-friendly as possible. We're just two menu screens away from reaching that goal anyway, and the character selection screen fits nicely into a single push.

  1. TH03's character selection screen
  2. Improved blog navigability

The code of a menu typically starts with loading all its graphics, and TH03's character selection already stands out in that regard due to the sheer amount of image data it involves. Each of the game's 9 selectable characters comes with

  1. a 192×192-pixel portrait (??SL.CD2),
  2. a 32×44-pixel pictogram describing her Extra Attack (in SLEX.CD2), and
  3. a 128×16-pixel image of her name (in CHNAME.BFT). While this image just consists of regular boldfaced versions of font ROM glyphs that the game could just render procedurally, pre-rendering these names and keeping them around in memory does make sense for performance reasons, as we're soon going to see. What doesn't make sense, though, is the fact that this is a 16-color BFNT image instead of a monochrome one, wasting both memory and rendering time.

Luckily, ZUN was sane enough to draw each character's stats programmatically. If you've ever looked through this game's data, you might have wondered where the game stores the sprite for an individual stat star. There's SLWIN.CDG, but that file just contains a full stat window with five stars in all three rows. And sure enough, ZUN renders each character's stats not by blitting sprites, but by painting (5 - value) yellow rectangles over the existing stars in that image. :tannedcirno:

TH03's SLWIN.CDG, showing off how ZUN baked all 15 possible stat stars into the image
The only stat-related image you will find as part of the game files. The number of stat stars per character is hardcoded and not based on any other internal constant we know about.

Together with the EXTRA🎔 window and the question mark portrait for Story Mode, all of this sums up to 255,216 bytes of image data across 14 files. You could remove the unnecessary alpha plane from SLEX.CD2 (-1,584 bytes) or store CHNAME.BFT in a 1-bit format (-6,912 bytes), but using 3.3% less memory barely makes a difference in the grand scheme of things.
From the code, we can assume that loading such an amount of data all at once would have led to a noticeable pause on the game's target PC-98 models. The obvious alternative would be to just start out with the initially visible images and lazy-load the data for other characters as the cursors move through the menu, but the resulting mini-latencies would have been bound to cause minor frame drops as well. Instead, ZUN opted for a rather creative solution: By segmenting the loading process into four parts and moving three of these parts ahead into the main menu, we instead get four smaller latencies in places where they don't stick out as much, if at all:

  1. The loading process starts at the logo animation, with Ellen's, Kotohime's, and Kana's portraits getting loaded after the 東方時空 letters finished sliding in. Why ZUN chose to start with characters #3, #4, and #5 is anyone's guess. :zunpet:
  2. Reimu's, Mima's, and Marisa's portraits as well as all 9 EXTRA🎔 attack pictograms are loaded at the end of the flash animation once the full title image is shown on screen and before the game is waiting for the player to press a key.
  3. The stat and EXTRA🎔 windows are loaded at the end of the main menu's slide-in animation… together with the question mark portrait for Story Mode, even though the player might not actually want to play Story Mode.
  4. Finally, the game loads Rikako's, Chiyuri's, and Yumemi's portraits after it cleared VRAM upon entering the Select screen, regardless of whether the latter two are even unlocked.

I don't like how ZUN implemented this split by using three separately named standalone functions with their own copy-pasted character loop, and the load calls for specific files could have also been arranged in a more optimal order. But otherwise, this has all the ingredients of good-code. As usual, though, ZUN then definitively ruins it all by counteracting the intended latency hiding with… deliberately added latency frames:

Sure, maybe loading the fourth part's 69,120 bytes from a highly fragmented hard drive might have even taken longer than 30 frames on a period-correct PC-98, but the point still stands that these delays don't solve the problem they are supposed to solve.


But the unquestionable main attraction of this menu is its fancy background animation. Mathematically, it consists of Lissajous curves with a twist: Instead of calculating each point as x = sin((fx·t)+ẟx) y = sin((fy·t)+ẟy) , TH03 effectively calculates its points as x = cos(fx·((t+ẟx) % 0xFF)) y = sin(fy·((t+ẟy) % 0xFF)) , due to t and being 📝 8-bit angles. Since the result of the addition remains 8-bit as well, it can and will regularly overflow before the frequency scaling factors fx and fy are applied, thus leading to sudden jumps between both ends of the 8-bit value range. The combination of this overflow and the gradual changes to fx and fy create all these interesting splits along the 360° of the curve:

At a high level, there really is just one big curve and one small curve, plus an array of trailing curves that approximate motion blur by subtracting from ẟx and ẟy.

In a rather unusual display of mathematical purity, ZUN fully re-calculates all variables and every point on every frame from just the single byte of state that indicates the current time within the animation's 128-frame cycle. However, that beauty is quickly tarnished by the sheer cost of fully recalculating these curves every frame:

This is decidedly more than the 1.17 million cycles we have between each VSync on the game's target 66 MHz CPUs. So it's not surprising that this effect is not rendered at 56.4 FPS, but instead drops the frame rate of the entire menu by targeting a hardcoded 1 frame per 3 VSync interrupts, or 18.8 FPS. Accordingly, I reduced the frame rate of the video above to represent the actual animation cycle as cleanly as possible.
Apparently, ZUN also tested the game on the 33 MHz PC-98 model that he targeted with TH01, and realized that 4,096 points were way too much even at 18.8 FPS. So he also added a mechanism that decrements the number of trailing curves if the last frame took ≥5 VSync interrupts, down to a minimum of only a single extra curve. You can see this in action by underclocking the CPU in your Neko Project fork of choice.

But were any of these measures really necessary? Couldn't ZUN just have allocated a 12 KiB ring buffer to keep the coordinates of previous curves, thus reducing per-frame calculations to just 512 points? Well, he could have, but we now can't use such a buffer to optimize the original animation. The 8-bit main angle offset/animation cycle variable advances by 0x02 every frame, but some of the trailing curves subtract odd numbers from this variable and thus fall between two frames of the main curves.
So let's shelve the idea of high-level algorithmic optimizations. In this particular case though, even micro-optimizations can have massive benefits. The sheer number of points magnifies the performance impact of every suboptimal code generation decision within the inner point loop:

Multiplied by the number of points, even these low-hanging fruit already save a whopping ≥729,088 cycles per frame on an i486, without writing a single line of ASM! On Pentium CPUs such as the one in the PC-9821Xa7 that ZUN supposedly developed this game on, the savings are slightly smaller because far calls are much faster, but still come in at a hefty ≥466,944 cycles. Thus, this animation easily beats 📝 TH01's sprite blitting and unblitting code, which just barely hit the 6-digit mark of wasted cycles, and snatches the crown of being the single most unoptimized code in all of PC-98 Touhou.
The incredible irony here is that TH03 is the point where ZUN 📝 really 📝 started 📝 going 📝 overboard with useless ASM micro-optimizations, yet he didn't even begin to optimize the one thing that would have actually benefitted from it. Maybe he 📝 once again went for the 📽️ cinematic look 📽️ on purpose?

Unlike TH01's sprites though, all this wasted performance doesn't really matter much in the end. Sure, optimizing the animation would give us more trailing curves on slower PC-98 models, but any attempt to increase the frame rate by interpolating angles would send us straight into fanfiction territory. Due to the 0x02/2.8125° increment per cycle, tripling the frame rate of this animation would require a change to a very awkward (log2384) = 8.58-bit angle format, complete with a new 384-entry sine/cosine lookup table. And honestly, the effect does look quite impressive even at 18.8 FPS.


There are three more bugs and quirks in this animation that are unrelated to performance:

Now with the full 18 curves, a direction change of the smaller trailing curves at the end of the loop that only looks slightly odd, and a reversed and more natural plotting order.

If you want to play with the math in a more user-friendly and high-res way, here's a Desmos graph of the full animation, converted to 360° angles and with toggles for the discontinuity and trail count fixes.


Now that we fully understand how the curve animation works, there's one more issue left to investigate. Let's actually try holding the Z key to auto-select Reimu on the very first frame of the Story Mode Select screen:

The confirmation flash even happens before the menu's first page flip.

Stepping through the individual frames of the video above reveals quite a bit of tearing, particularly when VRAM is cleared in frame 1 and during the menu's first page flip in frame 49. This might remind you of 📝 the tearing issues in the Music Rooms – and indeed, this tearing is once again the expected result of ZUN landmines in the code, not an emulation bug. In fact, quite the contrary: Scanline-based rendering is a mark of quality in an emulator, as it always requires more coding effort and processing power than not doing it. Everyone's favorite two PC-98 emulators from 20 years ago might look nicer on a per-frame basis, but only because they effectively hide ZUN's frequent confusion around VRAM page flips.
To understand these tearing issues, we need to consider two more code details:

  1. If a frame took longer than 3 VSync interrupts to render, ZUN flips the VRAM pages immediately without waiting for the next VSync interrupt.
  2. The hardware palette fade-out is the last thing done at the end of the per-frame rendering loop, but before busy-waiting for the VSync interrupt.

The combination of 1) and the aforementioned 30-frame delay quirk explains Frame 49. There, the page flip happens within the second frame of the three-frame chunk while the electron beam is drawing row #156. DOSBox-X doesn't try to be cycle-accurate to specific CPUs, but 1 menu frame taking 1.39 real-time frames at 56.4 FPS is roughly in line with the cycle counting we did earlier.
Frame 97 is the much more intriguing one, though. While it's mildly amusing to see the palette actually go brighter for a single frame before it fades out, the interesting aspect here is that 2) practically guarantees its palette changes to happen mid-frame. And since the CRT's electron beam might be anywhere at that point… yup, that's how you'd get more than 16 colors out of the PC-98's 16-color graphics mode. 🎨
Let's exaggerate the brightness difference a bit in case the original difference doesn't come across too clearly on your display:

Frame 97 of the video above, with a brighter initial palette to highlight the mid-frame palette change
Probably not too much of a reason for demosceners to get excited; generic PC-98 code that doesn't try to target specific CPUs would still need a way of reliably timing such mid-frame palette changes. Bit 6 (0x40) of I/O port 0xA0 indicates HBlank, and the usual documentation suggests that you could just busy-wait for that bit to flip, but an HBlank interrupt would be much nicer.

This reproduces on both DOSBox-X and Neko Project 21/W, although the latter needs the Screen → Real palettes option enabled to actually emulate a CRT electron beam. Unfortunately, I couldn't confirm it on real hardware because my PC-9821Nw133's screen vinegar'd at the beginning of the year. But just as with the image loading times, TH03's remaining code sorts of indicate that mid-frame palette changes were noticeable on real hardware, by means of this little flag I RE'd way back in March 2019. Sure, palette_show() takes >2,850 cycles on a 486 to downconvert master.lib's 8-bit palette to the GDC's 4-bit format and send it over, and that might add up with more than one palette-changing effect per frame. But tearing is a way more likely explanation for deferring all palette updates until after VSync and to the next frame.

And that completes another menu, placing us a very likely 2 pushes away from completing TH03's OP.EXE! Not many of those left now…


To balance out this heavy research into a comparatively small amount of code, I slotted in 2024's Part 2 of my usual bi-annual website improvements. This time, they went toward future-proofing the blog and making it a lot more navigable. You've probably already noticed the changes, but here's the full changelog:

Speaking of microblogging platforms, I've now also followed a good chunk of the Touhou community to Bluesky! The algorithms there seem to treat my posts much more favorably than Twitter has been doing lately, despite me having less than 1/10 of mostly automatically migrated followers there. For now, I'm going to cross-post new stuff to both platforms, but I might eventually spend a push to migrate my entire tweet history over to a self-hosted PDS to own the primary source of this data.

Next up: Staying with main menus, but jumping forward to TH04 and TH05 and finalizing some code there. Should be a quick one.

📝 Posted:
💰 Funded by:
Blue Bolt, JonathKane, [Anonymous]
🏷️ Tags:

TH03 gameplay! 📝 It's been over two years. People have been investing some decent money with the intention of eventually getting netplay, so let's cover some more foundations around player movement… and quickly notice that there's almost no overlap between gameplay RE and netplay preparations?

  1. The plan for TH03 netplay
  2. TH03's single hit circle

That makes for a fitting opportunity to think about what TH03 netplay would look like. Regardless of how we implement them into TH03 in particular, these features should always be part of the netcode:

The integration of all of this into TH03 can be approached from several angles. Of course, as long as we don't port the game, netplay will still require a PC-98 emulator to run on modern systems. PC-98 emulation is typically regarded as difficult to set up and the additional configuration required for some of these methods would only make it harder. However, yksoft1 demonstrates that it doesn't have to be: By compiling the (potentially modified) PC-98 emulator to WebAssembly, running any of these non-native methods becomes as simple as opening a website. To stay legally safe, I wouldn't host the game myself, so you'd still have to drag your th03.hdi onto that browser tab. But if you're happy with playing in a browser, this would be as user-friendly as it gets.

Here's an overview of the various approaches with their most important pros and cons:

Requirements Pretty in-game menus? Supports non-Touhou PC-98 games? Works on PC-98 hardware? Netcode location Timeframe
Generic PC-98 netcode for emulators Modded emulator + good CPU/RAM ✔️ Emulator Months
Emulator-level netcode with game-specific hooks Modded emulator ✔️ ✔️ Emulator Months
Pipes + standalone netplay tool
  • PC-98 DOS with EMS/XMS
  • Emulator with named pipe support + separate netplay tool running on host or real PC-98 + null modem cable + separate PC with netplay tool
✔️ ✔️ External bridge Months
Native PC-98 Windows 9x netcode
  • PC-98 Windows 9x
  • Emulator with network support + TAP driver running on host or real PC-98 + network card
✔️ ✔️ Windows 9x bridge Months
Native PC-98 DOS netcode
  • PC-98 DOS with EMS/XMS
  • Emulator with network support + TAP driver running on host or real PC-98 + network card
✔️ ✔️ Game logic Months
Porting the game first Any halfway modern Windows or Linux system ✔️ Game logic Years
Depending on what the backers prefer, we can go for one, a few, or all of these.
  1. Generic PC-98 netcode for one or more emulators

    This is the most basic and puristic variant that implements generic netplay for PC-98 games in general by effectively providing remote control of the emulated keyboard and joypad. The emulator will be unaware of the game, and the game will be unaware of being netplayed, which makes this solution particularly interesting for the non-Touhou PC-98 scene, or competitive players who absolutely insist on using ZUN's original binaries and won't trust any of my modded game builds.
    Applied to TH03, this means that players would select the regular hot-seat 1P vs 2P mode and then initiate a match through a new menu in the emulator UI. The same UI must then provide an option to manually remap incoming key and button presses to the 2P controls (newly introducing remapping to the emulator if necessary), as well as blocking any non-2P keys. The host then sends an initial savestate to the guest to ensure an identical starting state, and starts synchronizing and rolling back inputs at VSync boundaries.

    This generic nature means that we don't get to include any of the TH03-specific rollback optimizations mentioned above, leading to the highest CPU and memory requirements out of all the variants. It sure is the easiest to implement though, as we get to freely use modern C++ WebRTC libraries that are designed to work with the network stack of the underlying OS.
    I can try to build this netcode as a generic library that can work with any PC-98 emulator, but it would ultimately be up to the respective upstream developers to integrate it into official releases. Therefore, expect this variant to require separate funding and custom builds for each individual emulator codebase that we'd like to support.

  2. Emulator-level netcode with game-specific hooks

    Takes the generic netcode developed in 1) and adds the possibility for the game to control it via a special interrupt API. This enables several improvements:

    • Online matches could be initiated through new options in TH03's main menu rather than the emulator's UI.
    • The game could communicate the memory region that should be backed up every frame, cutting down memory usage as described above.
    • The exchanged input data could use the game's internal format instead of keyboard or joypad inputs. This removes the need for key remapping at the emulator level and naturally prevents the inherent issue of remote control where players could mess with each other's controls.
    • The game could be aware of the rollbacks, allowing it to jump over its rendering code while processing the queue of remote inputs and thus gain some performance as explained above.
    • The game could add synchronization points that block gameplay until both players have reached them, preventing the rollback queue from growing infinitely. This solves the issue of 1) not having any inherent way of working around desyncs and the resulting growth of the rollback queue. As an example, if one of the two emulators in 1) took, say, 2 seconds longer to load the game due to a random CPU spike caused by some bloatware on their system, the two players would be out of sync by 2 seconds for the rest of the session, forcing the faster system to render 113 frames every time an input prediction turned out to be incorrect.
      Good places for synchronization points include the beginning of each round, the WARNING!! You are forced to evade / Your life is in peril popups that pause the game for a few frames anyway, and whenever the game is paused via the ESC key.
    • During such pauses, the game could then also block the resuming ESC key of the player who didn't pause the game.
  3. Emulated serial port communicating over named pipes with a standalone netplay tool

    This approach would take the netcode developed in 2) out of the emulator and into a separate application running on the (modern) host OS, just like Ju.N.Owen or Adonis. The previous interrupt API would then be turned into a binary protocol communicated over the PC-98's serial port, while the rollback snapshots would be stored inside the emulated PC-98 in EMS or XMS/Protected Mode memory. Netplay data would then move through these stages:

    🖥️ PC-98 game logic ⇄ Serial port ⇄ Emulator ⇄ Named pipe ⇄ Netcode logic ⇄ WebRTC Data Channel ⇄ Internet 🛜
    All green steps run natively on the host OS.

    Sending serial port data over named pipes is only a semi-common feature in PC-98 emulators, and would currently restrict netplay to Neko Project 21/W and NP2kai on Windows. This is a pretty clean and generally useful feature to have in an emulator though, and emulator maintainers will be much more likely to include this than the custom netplay code I proposed in 1) and 2). DOSBox-X has an open issue that we could help implement, and the NP2kai Linux port would probably also appreciate a mkfifo(3) implementation.
    This could even work with emulators that only implement PC-98 serial ports in terms of, well, native Windows serial ports. This group currently includes Neko Project II fmgen, SL9821, T98-Next, and rare bundles of Anex86 that replace MIDI support with COM port emulation. These would require separately installed and configured virtual serial port software in place of the named pipe connection, as well as support for actual serial ports in the netplay tool itself. In fact, this is the only way that die-hard Anex86 and T98-Next fans could enjoy any kind of netplay on these two ancient emulators.

    If it works though, it's the optimal solution for the emulated use case if we don't want to fork the emulator. From the point of view of the PC-98, the serial port is the cheapest way to send a couple of bytes to some external thing, and named pipes are one of many native ways for two Windows/Linux applications to efficiently communicate.
    The only slight drawback of this approach is the expected high DOS memory requirement for rollback. Unless we find a way to really compress game state snapshots to just a few KB, this approach will require a more modern DOS setup with EMS/XMS support instead of the pre-installed MS-DOS 3.30C on a certain widely circulated .HDI copy. But apart from that, all you'd need to do is run the separate netplay tool, pick the same pipe name in both the tool and the emulator, and you're good to go.

    Screenshot of Neko Project 21/W's Serial option menu, with COM1 being configured to send over a named pipe

    It could even work for real hardware, but would require the PC-98 to be linked to the separately running modern system via a null modem cable.

  4. Native PC-98 Windows 9x netcode (only for real PC-98 hardware equipped with an Ethernet card)

    Equivalent in features to 2), but pulls the netcode into the PC-98 system itself. The tool developed in 3) would then as a separate 32-bit or 16-bit Windows application that somehow communicates with the game running in a DOS window. The handful of real-hardware owners who have actually equipped their PC-98 with a network card such as the LGY-98 would then no longer require the modern PC from 3) as a bridge in the middle.
    This specific card also happens to be low-level-emulated by the 21/W fork of Neko Project. However, it makes little sense to use this technique in an emulator when compared to 3), as NP21/W requires a separately installed and configured TAP driver to actually be able to access your native Windows Internet connection. While the setup is well-documented and I did manage to get a working Internet connection inside an emulated Windows 95, it's definitely not foolproof. Not to mention DOSBox-X, which currently emulates the apparently hardware-compatible NE2000 card, but disables its emulation in PC-98 mode, most likely because its I/O ports clash with the typical peripherals of a PC-98 system.

    And that's not the end of the drawbacks:

    • Netplay would depend on the PC-98 versions of Windows 9x and its full network stack, nothing of which is required for the game itself.
    • Porting libdatachannel (and especially the required transport encryption) to Windows 95 will probably involve a bit of effort as well.
    • As would actually finding a way to access V86 mode memory from a 32-bit or 16-bit Windows process, particularly due to how isolated DOS processes are from the rest of the system and even each other. A quick investigation revealed three potential approaches:
      • A 32-bit process could read the memory out of the address space of the console host process (WINOA32.MOD). There seems to be no way of locating the specific base address of a DOS process, but you could always do a brute-force search through the memory map.
      • If started before Windows, TSRs will share their resident memory with both DOS and Win16 processes. The segment pointer would then be retrieved through a typical interrupt API.
      • Writing a VxD driver 😩
    • Correctly setting up TH03 to run within Windows 95 to begin with can be rather tricky. The GDC clock speed check needs to be either patched out or overridden using mode-setting tools, Windows needs to be blocked from accessing the FM chip, and even then, MAIN.EXE might still immediately crash during the first frame and leave all of VRAM corrupted:
      Screenshot of the TH03 crash on a Windows 95 system emulated in Neko Project 21/W ver0.86 rev92β3
      This is probably a bug in the latest ver0.86 rev92β3 version of Neko Project 21/W; I got it to work fine on real hardware. 📝 StormySpace did run on the same emulated Windows 95 system without any issues, though. Regardless, it's still worth mentioning as a symbol of everything that can go wrong.
    • A matchmaking server would be much more of a requirement than in any of the emulator variants. Players are unlikely to run their favorite chat client on the same PC-98 system, and the signaling codes are way too unwieldy to type them in manually. (Then again, IRC is always an option, and the people who would fund this variant are probably the exact same people who are already running IRC clients on their PC-98.)
  5. Native PC-98 DOS netcode (only for real PC-98 hardware equipped with an Ethernet card)

    Conceptually the same as 4), but going yet another level deeper, replacing the Windows 9x network stack with a DOS-based one. This might look even more intimidating and error-prone, but after I got ping and even Telnet working, I was pleasantly surprised at how much simpler it is when compared to the Windows variant. The whole stack consists of just one LGY-98 hardware information tool, a LGY-98 packet driver TSR, and a TSR that implements TCP/IP/UDP/DNS/ICMP and is configured with a plaintext file. I don't have any deep experience with these protocols, so I was quite surprised that you can implement all of them in a single 40 KiB binary. Installed as TSRs, the entire stack takes up an acceptable 82 KiB of conventional memory, leaving more than enough space for the game itself. And since both of the TSRs are open-source, we can even legally bundle them with the future modified game binaries.
    The matchmaking issue from the Windows 9x approach remains though, along with the following issues:

    • Porting libdatachannel and the required transport encryption to the TEEN stack seems even more time-consuming than a Windows 95 port.
    • The TEEN stack has no UI for specifying the system's or gateway's IP addresses outside of its plaintext configuration file. This provides a nice opportunity for adding a new Internet settings menu with great error feedback to the game itself. Great for UX, but it's another thing I'd have to write.
    • The LGY-98 is not the only network card for the PC-98. Others might have more complicated DOS drivers that might not work as seamlessly with the TEEN stack, or have no preserved DOS drivers at all. Heck, the most time-consuming part of the DOS setup was finding the correct download link for the LGY-98 packet driver, as the one link that appears in a lot of places only throws an access denied error these days. Edit (2024-04-30): spaztron64 is now hosting both the LGY-98 packet driver and the entire TEEN bundle on his homepage.
      If you're interested in funding this variant and are using a non-LGY-98 card on real hardware, make sure you get general Internet working on DOS first.
  6. Porting the game first

    As always, this is the premium option. If the entire game already runs as a standalone executable on a modern system, we can just put all the netcode into the same binary and have the most seamless integration possible.

That leaves us with these prerequisites:

Once we've reached any of these prerequisites, I'll set up a separate campaign funding method that runs parallel to the cap. As netplay is one of those big features where incremental progress makes little sense and we can expect wide community support for the idea, I'll go for a more classic crowdfunding model with a fixed goal for the minimum feature set and stretch goals for optional quality-of-life features. Since I've still got two other big projects waiting to be finished, I'd like to at least complete the Shuusou Gyoku Linux port before I start working on TH03 netplay, even if we manage to hit any of the funding goals before that.


For the first time in a long while, the actual content of this push can be listed fairly quickly. I've now RE'd:

It's also the third TH03 gameplay push in a row that features inappropriate ASM code in places that really, really didn't need any. As usual, the code is worse than what Turbo C++ 4.0J would generate for idiomatic C code, and the surrounding code remains full of untapped and quick optimization opportunities anyway. This time, the biggest joke is the sprite offset calculation in the hit circle rendering code:

_BX = (circle->age - 1);
_BX >>= 2;
_BX *= 2;
uint16_t sprite_offset_in_sprite16_area = (0x1910 + _BX + _BX + _BX);
A multiplication with 6 would have compiled into a single IMUL instruction. This compiles into 4 MOVs, one IMUL (with 2), and two ADDs. :zunpet: This surely must have been left in on purpose for us to laugh about it one day?

But while we've all come to expect the usual share of ZUN bloat by now, this is also the first push without either a ZUN bug or a landmine since I started using these terms! 🎉 It does contain a single ZUN quirk though, which can also be found in the hit circles. This animation comes in two types with different caps: 12 animation slots across both playfields for the enemy circles shown in alternating bright/dark yellow colors, whereas the white animation for the player characters has a cap of… 1? P2 takes precedence over P1 because its update code always runs last, which explains what happens when both players get hit within the 16 frames of the animation:

If they both get hit on the exact same frame, the animation for P1 never plays, as P2 takes precedence.
If the other player gets hit within 16 frames of an active white circle animation, the animation is reinitialized for the other player as there's only a single slot to hold it. Is this supposed to telegraph that the other player got hit without them having to look over to the other playfield? After all, they're drawn on top of most other entities, but below the player. :onricdennat:

SPRITE16 uses the PC-98's EGC to draw these single-color sprites. If the EGC is already set up, it can be set into a GRCG-equivalent RMW mode using the pattern/read plane register (0x4A2) and foreground color register (0x4A6), together with setting the mode register (0x4A4) to 0x0CAC. Unlike the typical blitting operations that involve its 16-dot pattern register, the EGC even supports 8- or 32-bit writes in this mode, just like the GRCG. 📝 As expected for EGC features beyond the most ordinary ones though, T98-Next simply sets every written pixel to black on a 32-bit write. :tannedcirno: Comparing the actual performance of such writes to the GRCG would be 📝 yet another interesting question to benchmark.

Next up: I think it's time for ReC98's build system to reach its final form. For almost 5 years, I've been using an unreleased sane build system on a parallel private branch that was just missing some final polish and bugfixes. Meanwhile, the public repo is still using the project's initial Makefile that, 📝 as typical for Makefiles, is so unreliable that BUILD16B.BAT force-rebuilds everything by default anyway. While my build system has scaled decently over the years, something even better happened in the meantime: MS-DOS Player, a DOS emulator exclusively meant for seamless integration of CLI programs into the Windows console, has been forked and enhanced enough to finally run Turbo C++ 4.0J at an acceptable speed. So let's remove DOSBox from the equation, merge the 32-bit and 16-bit build steps into a single 32-bit one, set all of this up in a user-friendly way, and maybe squeeze even more performance out of MS-DOS Player specifically for this use case.

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷️ Tags:

Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.

But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!

  1. The Music Room background masking effect
  2. The GRCG's plane disabling flags
  3. Text color restrictions
  4. The entire messy rest of the Music Room code
  5. TH04's partially consistent congratulation picture on Easy Mode
  6. TH02's boss position and damage variables

As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.

The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:

TH02's Music Room background in its on-disk state TH03's Music Room background in its on-disk state TH04's Music Room background in its on-disk state TH05's Music Room background in its on-disk state
Let's call this "the blank image".

That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right? :thonk:
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:

TH02's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH03's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH04's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH05's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.

On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.


Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:

// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);

// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);

// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);

// We're done, turn off the GRCG.
outportb(0x7C, 0x00);

This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.

Three overlapping Music Room polygons rendered using master.lib's grcg_polygon_c() function with a disabled GRCGThree overlapping Music Room polygons rendered as in the original game, with the GRCG enabled
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.

As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:

The contents of nopoly_B with each game's first track selected.

Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:

And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier… :onricdennat:


For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".

With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌


The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.

Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there! Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.

Screenshot of TH02's Five Magic Stones, with the first two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the second two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the last one (both internally and in the order you fight them in) alive and activated

This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!

These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.

Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous], Yanga, Splashman
🏷️ Tags:

And we're back to PC-98 Touhou for a brief interruption of the ongoing Shuusou Gyoku Linux port. Let's clear some of the Touhou-related progress from the backlog, and use the unconstrained nature of these contributions to prepare the 📝 upcoming non-ASCII translations commissioned by Touhou Patch Center. The current budget won't cover all of my ambitions, but it would at least be nice if all text in these games was feasibly translatable by the time I officially start working on that project.

At a little over 3 pushes, it might be surprising to see that this took longer than the 📝 TH03/TH04/TH05 cutscene system. It's obvious that TH02 started out with a different system for in-game dialog, but while TH04 and TH05 look identical on the surface, they only actually share 30% of their dialog code. So this felt more like decompiling 2.4 distinct systems, as opposed to one identical base with tons of game-specific differences on top.

The table of contents was pretty popular last time around, so let's have another one:

  1. Overview of TH04's dialog system
  2. Changes introduced in TH05
  3. Command reference for the TH04 and TH05 systems
  4. Overview of TH02's dialog system
  5. TH02's face portrait images
  6. Bugs during TH02's dialog box slide-in animation
  7. Bugs and quirks in Mima's defeat dialog (might be lore-relevant)
  8. TH03 win messages

Let's start with the ones from TH04 and TH05, since they are not that broken. For TH04, ZUN started out by copy-pasting the cutscene system, causing the result to inherit many of the caveats I already described in the cutscene blog post:

Then, however, he greatly simplified the system. Mainly, this was done by moving text rendering from the PC-98 graphics chip to the text chip, which avoids the need for any text-related unblitting code, but ZUN also added a bunch of smaller changes:

While it would seem that TH05 has no issues with ASCII 0x20 spaces, the text as a whole is still blindly processed two bytes at a time, and any commands can only appear at even byte positions within a line. I dimmed the VRAM pixels to 25% of their original brightness to make the text easier to read.
The same text backported to TH04, additionally demonstrating how that game's dialog system inherited the whitespace skipping behavior of TH03's cutscene system. Just like there, ASCII 0x20 spaces only work at odd byte positions because the game treats them as the trailing byte of a full-width Shift-JIS codepoint. I don't know how large the budget for the upcoming non-ASCII translations will be, but I'm going to fix this even in the very basic fully static variant. I dimmed the VRAM pixels to 25% of their original brightness to make the text easier to read.
Demonstrating the lack of automatic line or box breaks in TH05's dialog systemDemonstrating the lack of automatic line or box breaks in TH04's dialog system, in addition to its lack of support for ASCII 0x20 spaces carried over from TH03's cutscene system

TH05 then moved from TH04's plaintext scripts to the binary .TX2 format while removing all the unused commands copy-pasted from the cutscene system. Except for a single additional command intended to clear a text box, TH05's dialog system only supports a strict subset of the features of TH04's system.
This change also introduced the following differences compared to TH04:

Writing the 0x02 byte to text RAM results in an SX character, which is simply the PC-98 font ROM's glyph for that Shift-JIS codepoint.
Also note how each face change is now preceded by two frames of delay.
No problem in TH04. Note how the dialog also runs a bit faster – TH04 only adds the aforementioned one frame of delay to each face change, and has fewer two-byte chunks of text to display overall.

For modding these files, you probably want to use TXDEF from -Tom-'s MysticTK. It decodes these files into a text representation, and its encoder then takes care of the character-specific byte offsets in the 10-byte header. This text representation simplifies the format a lot by avoiding all corner cases and landmines you'd experience during hex-editing – most notably by interpreting the box-starting 0x0D as a command to show text that takes a string parameter, avoiding the broken calls to script commands in the middle of text. However, you'd still have to manually ensure an even number of bytes on every line of text.

In the entry function of TH05's dialog loop, we also encounter the hack that is responsible for properly handling 📝 ZUN's hidden Extra Stage replay. Since the dialog loop doesn't access the replay inputs but still requires key presses to advance through the boxes, ZUN chose to just skip the dialog altogether in the specific case of the Extra Stage replay being active, and replicated all sprite management commands from the dialog script by just hardcoding them.
And you know what? Not only do I not mind this hack, but I would have preferred it over the actual dialog system! The aforementioned sprite management commands effectively boil down to manual memory management, deallocating all stage enemy and midboss sprites and thus ensuring that the boss sprites end up at specific master.lib sprite IDs (patnums). The hardcoded boss rendering function then expects these sprites to be available at these exact IDs… which means that the otherwise hardcoded bosses can't render properly without the dialog script running before them. :zunpet:
There is absolutely no excuse for the game to burden dialog scripts with this functionality. Sure, delayed deallocation would allow them to blit stage-specific sprites, but the original games don't do that; probably because none of the two games feature an unblitting command. And even if they did, it would have still been cleaner to expose the boss-specific sprite setup as a single script command that can then also be called from game code if the script didn't do so. Commands like these just are a recipe for crashes, especially with parsers that expect fullwidth Shift-JIS text and where misaligned ASCII text can easily cause these commands to be skipped.

But then again, it does make for funny screenshot material if you accidentally the deallocation and then see bosses being turned into stage enemies:

TH04's dialog before the Stage 4 Marisa fight without deallocating the stage sprites inside the script, causing Marisa to be turned into one of the stage enemiesTH04's dialog before the Stage 6 Yuuka fight without deallocating the stage sprites inside the script, causing Yuuka to be turned into two different cels of the same stage enemyTH05's dialog before the Louise fight without deallocating the stage sprites inside the script, causing Louise to be turned into one of the ice enemies from TH05's Stage 2TH05's dialog before the Louise fight without deallocating the stage sprites inside the script, causing Mai and Yuki to be turned into a windmill and fairy/demon enemy, respectively
Some of the more amusing consequences of not calling the sprite-deallocating :th04: \c /  :th05: 0x04 command inside a dialog script.
In the case of 4️⃣, the game then even crashes on this frame at the end of the dialog, in a way that resembles the infamous 📝 TH04 crash before Stage 5 Yuuka if no EMS driver is loaded. Both the stage- and boss-specific BFNT sprites are loaded into memory at this point, leaving no room for the 256×256-pixel background image on the size-limited master.lib heap.

With all the general details out of the way, here's the command reference:

:th04: :th05:
0
1
0x00
0x01
Selects either the player character (0) or the boss (1) as the currently speaking character, and moves the cursor to the beginning of the text box. In TH04, this command also directly starts the new dialog box, which is probably why it's not prefixed with a \ as it only makes sense outside of text. TH05 requires a separate 0x0D command to do the same.
\=1 0x02 0x!! Replaces the face portrait of the currently active speaking character with image #1 within her .CD2 file.
\=255 0x02 0xFF Removes the face portrait from the currently active text box.
\l,filename 0x03 filename 0x00 Calls master.lib's super_entry_bfnt() function, which loads sprites from a BFNT file to consecutive IDs starting at the current patnum write cursor.
\c 0x04 Deallocates all stage-specific BFNT sprites (i.e., stage enemies and midbosses), freeing up conventional RAM for the boss sprites and ensuring that master.lib's patnum write cursor ends up at :th04: 128 / :th05: 180.
In TH05's Extra Stage, this command also replaces 📝 the sprites loaded from MIKO16.BFT with the ones from ST06_16.BFT.
\d Deallocates all face portrait images.
The game automatically does this at the end of each dialog sequence. However, ZUN wanted to load Stage 6 Yuuka's 76 KiB of additional animations inside the script via \l, and would have once again run up against the master.lib heap size limit without that extra free memory.
\m,filename 0x05 filename 0x00 Stops the currently playing BGM, loads a new one from the given file, and starts playback.
\m$ 0x05 $ 0x00 Stops the currently playing BGM.
Note that TH05 interprets $ as a null-terminated filename as well.
\m* Restarts playback of the currently loaded BGM from the beginning.
\b0,0,0 0x06 0x!!!! 0x!!!! 0x!! Blits the master.lib patnum with the ID indicated by the third parameter to the current VRAM page at the top-left screen position indicated by the first two parameters.
\e0 Plays the sound effect with the given ID.
\t100 Sets palette brightness via master.lib's palette_settone() to any value from 0 (fully black) to 200 (fully white). 100 corresponds to the palette's original colors.
\fo1
\fi1
Calls master.lib's palette_black_out() or palette_black_in() to play a hardware palette fade animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\wo1
\wi1
0x09 0x!!
0x0A 0x!!
Calls master.lib's palette_white_out() or palette_white_in() to play a hardware palette fade animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
The TH05 version of 0x09 also clears the text in both boxes before the animation.
\n 0x0B Starts a new line by resetting the X coordinate of the TRAM cursor to the left edge of the text area and incrementing the Y coordinate.
The new line will always be the next one below the last one that was properly started, regardless of whether the text previously wrapped to the next TRAM row at the edge of the screen.
\g8 Plays a blocking 8-frame screen shake animation. Copy-pasted from the cutscene parser, but actually used right at the end of the dialog shown before TH04's Bad Ending.
\ga0 0x0C 0x!! Shows the gaiji with the given ID from 0 to 255 at the current cursor position, ignoring the per-glyph delay.
\k0 Waits 0 frames (0 = forever) for any key to be pressed before continuing script execution.
0x0D Starts a new dialog box with the previously selected speaker. All text until the next 0xFF command will appear on screen.
Inside dialogs, this is a no-op.
0x0E Takes the current dialog cursor as the top-left corner of a 240×48-pixel rectangle, and replaces all text RAM characters within that rectangle with whitespace.
This is only used to clear the player character's text box before Shinki's final いくよ‼ box. Shinki has two consecutive text boxes in all 4 scripts here, and ZUN probably wanted to clear the otherwise blue text to imply a dramatic pause before Shinki's final sentence. Nice touch.
(You could, however, also use it after a box-ending 0xFF command to mess with text RAM in general.)
\# Quits the currently running loop. This returns from either the text loop to the command loop, or it ends the dialog sequence by returning from the command loop back to gameplay. If this stage of the game later starts another dialog sequence, it will start at the next script byte.
\$ Like \#, but first waits for any key to be pressed.
0xFF Behaves like TH04's \$ in the text loop, and like \# in the command loop. Hence, it's not possible in TH05 to automatically end a text box and advance to the next one without waiting for a key press.
Unused commands are in gray.

At the end of the day, you might criticize the system for how its landmines make it annoying to mod in ASCII text, but it all works and does what it's supposed to. ZUN could have written the cleanest single and central Shift-JIS iterator that properly chunks a byte buffer into halfwidth and fullwidth codepoints, and I'd still be throwing it out for the upcoming non-ASCII translations in favor of something that either also supports UTF-8 or performs dictionary lookups with a full box of text.
The only actual bug can be found in the input detection, which once again doesn't correctly handle the infamous key up/key down scancode quirk of PC-98 keyboards. All it takes is one wrongly placed input polling call, and suddenly you have to think about how the update cycle behind the PC-98 keyboard state bytes might cause the game to run the regular 2-frame delay for a single 2-byte chunk of text before it shows the full text of a box after all… But even this bug is highly theoretical and could probably only be observed very, very rarely, and exclusively on real hardware.


The same can't be said about TH02 though, but more on that later. Let's first take a look at its data, which started out much simpler in that game. The STAGE?.TXT files contain just raw Shift-JIS text with no trace of commands or structure. Turning on the whitespace display feature in your editor reveals how the dialog system even assumes a fixed byte length for each box: 36 bytes per line which will appear on screen, followed by 4 bytes of padding, which the original files conveniently use to visually split the lines via a CR/LF newline sequence. Make sure to disable trimming of trailing whitespace in your editor to not ruin the file when modding the text… :onricdennat:

靈夢:あんた、まだ名前も聞いてないの··
······に覚えられないわよ。・・・・・··
里香:あたいは、里香よ。覚えときなさ··
・・・い。・・・・・・················
Two boxes from TH02's STAGE5.TXT with visualized whitespace. These also demonstrate how the CR/LF newlines only make up 2 of the 4 padding bytes, and require each line to be padded with two more bytes; you could not use these trailing spaces for actual text. Also note how the exquisite mixture of fullwidth and halfwidth spaces demands the text to be viewed with only the most metrically consistent monospace fonts to preserve the intended alignment. 🍷 It appears quite misaligned on my phone.

Consequently, everything else is hardcoded – every effect shown between text boxes, the face portrait shown for each box, and even how many boxes are part of each dialog sequence. Which means that the source code now contains a long hardcoded list of face IDs for most of the text boxes in the game, with the rest being part of the dedicated hardcoded dialog scripts for 2/3 of the game's stages.
Without the restriction to a fixed set of scripting commands, TH02 naturally gravitated to having the most varied dialog sequences of all PC-98 Touhou games. This flexibility certainly facilitated Mima's grand entrance animation in Stage 4, or the different lines in Stage 4 and 5 depending on whether you already used a continue or not. Marisa's post-boss dialog even inserts the number of continues into the text itself – by, you guessed it, writing to hardcoded byte offsets inside the dialog text before printing it to the screen. :godzun: But once again, I have nothing to criticize here – not even the fact that the alternate dialog scripts have to mutate the "box cursor" to jump to the intended boxes within the file. I know that some people in my audience like VMs, but I would have considered it more bloated if ZUN had implemented a full-blown scripting language just to handle all these special cases.


Another unique aspect of TH02 is the way it stores its face portraits, which are infamous for how hard they are to find in the original data files. These sprites are actually map tiles, stored in MIKO_K.MPN, and drawn using the same functions used to blit the regular map tiles to the 📝 tile source area in VRAM. We can only guess why ZUN chose this one out of the three graphics formats he used in TH02:

TH02's MIKO_K.PTN, arranged into a 16×16-tile layout that reveals how these tiles are combined into face portraits.
MPNDEF from -Tom-'s MysticTK conveniently uses this exact layout in its .BMP output. Earlier MPNDEF versions crashed when converting this file as its 256 tiles led to an 8-bit overflow bug, so make sure you've updated to the current version from the end of October 2023 if you want to convert this file yourself. The format stores the 4 bitplanes of each 16×16 tile in order, so good luck finding a different planar image viewer that would support both such a tiled layout and a custom palette. Sometimes, a weird internal format is the best type of obfuscation. :tannedcirno:
TH02's MIKO_K.PTN with the 16×16 tile grid overlaid

And since you're certainly wondering about all these black tiles at the edges: Yes, these are not only part of the file and pad it from the required 240×192 pixels to 256×256, but also kept in memory during a stage, wasting 9.5 KiB of conventional RAM. That's 172 seconds of potential input replay data, just for those people who might still think that we need EMS for replays.


Alright, we've got the text, we've got the faces, let's slide in the box and display it all on screen. Apparently though, we also have to blit the player and option sprites using raw, low-level master.lib function calls in the process? :thonk: This can't be right, especially because ZUN always blits the option sprite associated with the Reimu-A shot type, regardless of which one the player actually selected. And if you keep moving above the box area before the dialog starts, you get to see exactly how wrong this is:

Let's look closer at Reimu's sprite during the slide-in animation, and in the two frames before:

Zoomed-in area around Reimu's sprite from frame 35 of the video aboveZoomed-in area around Reimu's sprite from frame 36 of the video aboveZoomed-in area around Reimu's sprite from frame 37 of the video above

This one image shows off no less than 4 bugs:

  1. ZUN blits the stationary player sprite here, regardless of whether the player was previously moving left or right. This is a nice way of indicating that Reimu stops moving once the dialog starts, but maybe ZUN should have unblitted the old sprite so that the new one wouldn't have appeared on top. The game only unblits the 384×64 pixels covered by the dialog box on every frame of the slide-in animation, so Reimu would only appear correctly if her sprite happened to be entirely located within that area.
  2. All sprites are shifted up by 1 pixel in frame 2️⃣. This one is not a bug in the dialog system, but in the main game loop. The game runs the relevant actions in the following order:

    1. Invalidate any map tiles covered by entities
    2. Redraw invalidated tiles
    3. Decrement the Y coordinate at the top of VRAM according to the scroll speed
    4. Update and render all game entities
    5. Scroll in new tiles as necessary according to the scroll speed, and report whether the game has scrolled one pixel past the end of the map
    6. If that happened, pretend it didn't by incrementing the value calculated in #3 for all further frames and skipping to #8.
    7. Issue a GDC SCROLL command to reflect the line calculated in #3 on the display
    8. Wait for VSync
    9. Flip VRAM pages
    10. Start boss if we're past the end of the map

    The problem here: Once the dialog starts, the game has already rendered an entire new frame, with all sprites being offset by a new Y scroll offset, without adjusting the graphics GDC's scroll registers to compensate. Hence, the Y position in 3️⃣ is the correct one, and the whole existence of frame 2️⃣ is a bug in itself. (Well… OK, probably a quirk because speedrunning exists, and it would be pretty annoying to synchronize any video regression tests of the future TH02 Anniversary Edition if it renders one fewer frame in the middle of a stage.)

  3. ZUN blits the option sprites to their position from frame 1️⃣. This brings us back to 📝 TH02's special way of retaining the previous and current position in a two-element array, indexed with a VRAM page ID. Normally, this would be equivalent to using dedicated prev and cur structure fields and you'd just index it with the back page for every rendering call. But if you then decide to go single-buffered for dialogs and render them onto the front page instead… :zunpet:
    Note that fixing bug #2 would not cancel out this one – the sprites would then simply be rendered to their position in the frame before 1️⃣.

  4. And of course, the fixed option sprite ID also counts as a bug.

As for the boxes themselves, it's yet another loop that prints 2-byte chunks of Shift-JIS text at an even slower fixed interval of 3 frames. In an interesting quirk though, ZUN assumes that every box starts with the name of the speaking character in its first two fullwidth Shift-JIS characters, followed by a fullwidth colon. These 6 bytes are displayed immediately at the start of every box, without the usual delay. The resulting alignment looks rather janky with Genjii, whose single right-padded kanji looks quite awkward with the fullwidth space between the name and the colon. Kind of makes you wonder why ZUN just didn't spell out his proper name, 玄爺, instead, but I get the stylistic difference.
In Stage 4, the two-kanji assumption then breaks with Marisa's three-kanji name, which causes the full-width colon to be printed as the first delayed character in each of her boxes:


That's all the issues and quirks in the system itself. The scripts themselves don't leave much room for bugs as they basically just loop over the hardcoded face ID array at this level… until we reach the end of the game. Previously, the slide-in animation could simply use the tile invalidation and re-rendering system to unblit the box on each frame, which also explained why Reimu had to be separately rendered on top. But this no longer works with a custom-rendered boss background, and so the game just chooses to flood-fill the area with graphics chip color #0:

Then again, transferring pixels from the back page would be just as wrong as they lag one frame behind. No way around capturing these 384×64 pixels to main memory here… Oh well, this flood-fill at least adds even more legibility on top of the already half-transparent text box. A property that the following dialog sequence unfortunately lacks…

For Mima's final defeat dialog though, ZUN chose to not even show the box. He might have realized the issue by that point, or simply preferred the more dramatic effect this had on the lines. The resulting issues, however, might even have ramifications for such un-technical things as lore and character dynamics. :zunpet: As it turns out, the code for this dialog sequence does in fact render Mima's smiling face for all boxes?! You only don't see it in the original game because it's rendered to the other VRAM page that remains invisible during the dialog sequence:

Caution, flashing lights.

Here's how I interpret the situation:

So, the future TH02 Anniversary Edition will fix the bug by showing the back page, but retain the quirk by rewriting the dialog code to not blit the face.


And with that, we've secured all in-game dialog for the upcoming non-ASCII translations! The remaining 2/3 of the last push made for a good occasion to also decompile the small amount of code related to TH03's win messages, stored in the @0?TX.TXT files. Similar to TH02's dialog format, these files are also split into fixed-size blocks of 3×60 bytes. But this time, TH03 loads all 60 bytes of a line, including the CR/LF line breaking codepoints in the original files, into the statically allocated buffer that it renders from. These control characters are then only filtered to whitespace by ZUN's graph_putsa_fx() function. If you remove the line breaks, you get to use the full 60 bytes on every line.
The final commits went to the MIKO.CFG loading and saving functions used in TH04's and TH05's OP.EXE, as well as TH04's game startup code to finally catch up with 📝 TH05's counterpart from over 3 years ago. This brought us right in front of the main menu rendering code in both TH04 and TH05, which is identical in both games and will be tackled in the next PC-98 Touhou delivery.

Next up, though: Returning to Shuusou Gyoku, and adding support for SC-88Pro recordings as BGM. Which may or may not come with a slight controversy…

📝 Posted:
💰 Funded by:
JonathKane, Blue Bolt, [Anonymous]
🏷️ Tags:

Well, well. My original plan was to ship the first step of Shuusou Gyoku OpenGL support on the next day after this delivery. But unfortunately, the complications just kept piling up, to a point where the required solutions definitely blow the current budget for that goal. I'm currently sitting on over 70 commits that would take at least 5 pushes to deliver as a meaningful release, and all of that is just rearchitecting work, preparing the game for a not too Windows-specific OpenGL backend in the first place. I haven't even written a single line of OpenGL yet… 🥲
This shifts the intended Big Release Month™ to June after all. Now I know that the next round of Shuusou Gyoku features should better start with the SC-88Pro recordings, which are much more likely to get done within their current budget. At least I've already completed the configuration versioning system required for that goal, which leaves only the actual audio part.

So, TH04 position independence. Thanks to a bit of funding for stage dialogue RE, non-ASCII translations will soon become viable, which finally presents a reason to push TH04 to 100% position independence after 📝 TH05 had been there for almost 3 years. I haven't heard back from Touhou Patch Center about how much they want to be involved in funding this goal, if at all, but maybe other backers are interested as well.
And sure, it would be entirely possible to implement non-ASCII translations in a way that retains the layout of the original binaries and can be easily compared at a binary level, in case we consider translations to be a critical piece of infrastructure. This wouldn't even just be an exercise in needless perfectionism, and we only have to look to Shuusou Gyoku to realize why: Players expected that my builds were compatible with existing SpoilerAL SSG files, which was something I hadn't even considered the need for. I mean, the game is open-source 📝 and I made it easy to build. You can just fork the code, implement all the practice features you want in a much more efficient way, and I'd probably even merge your code into my builds then?
But I get it – recompiling the game yields just yet another build that can't be easily compared to the original release. A cheat table is much more trustworthy in giving players the confidence that they're still practicing the same original game. And given the current priorities of my backers, it'll still take a while for me to implement proof by replay validation, which will ultimately free every part of the community from depending on the original builds of both Seihou and PC-98 Touhou.

However, such an implementation within the original binary layout would significantly drive up the budget of non-ASCII translations, and I sure don't want to constantly maintain this layout during development. So, let's chase TH04 position independence like it's 2020, and quickly cover a larger amount of PI-relevant structures and functions at a shallow level. The only parts I decompiled for now contain calculations whose intent can't be clearly communicated in ASM. Hitbox visualizations or other more in-depth research would have to wait until I get to the proper decompilation of these features.
But even this shallow work left us with a large amount of TH04-exclusive code that had its worst parts RE'd and could be decompiled fairly quickly. If you want to see big TH04 finalization% gains, general TH04 progress would be a very good investment.


The first push went to the often-mentioned stage-specific custom entities that share a single statically allocated buffer. Back in 2020, I 📝 wrongly claimed that these were a TH05 innovation, but the system actually originated in TH04. Both games use a 26-byte structure, but TH04 only allocates a 32-element array rather than TH05's 64-element one. The conclusions from back then still apply, but I also kept wondering why these games used a static array for these entities to begin with. You know what they call an area of memory that you can cleanly repurpose for things? That's right, a heap! :tannedcirno: And absolutely no one would mind one additional heap allocation at the start of a stage, next to the ones for all the sprites and portraits.
However, we are still running in Real Mode with segmented memory. Accessing anything outside a common data segment involves modifying segment registers, which has a nonzero CPU cycle cost, and Turbo C++ 4.0J is terrible at optimizing away the respective instructions. Does this matter? Probably not, but you don't take "risks" like these if you're in a permanent micro-optimization mindset… :godzun:

In TH04, this system is used for:

  1. Kurumi's symmetric bullet spawn rays, fired from her hands towards the left and right edges of the playfield. These are rather infamous for being the last thing you see before 📝 the Divide Error crash that can happen in ZUN's original build. Capped to 6 entities.

  2. The 4 📝 bits used in Marisa's Stage 4 boss fight. Coincidentally also related to the rare Divide Error crash in that fight.

  3. Stage 4 Reimu's spinning orbs. Note how the game uses two different sets of sprites just to have two different outline colors. This was probably better than messing with the palette, which can easily cause unintended effects if you only have 16 colors to work with. Heck, I have an entire blog post tag just to highlight these cases. Capped to the full 32 entities.

  4. The chasing cross bullets, seen in Phase 14 of the same Stage 6 Yuuka fight. Featuring some smart sprite work, making use of point symmetry to achieve a fluid animation in just 4 frames. This is good-code in sprite form. Capped to 31 entities, because the 32nd custom entity during this fight is defined to be…

  5. The single purple pulsating and shrinking safety circle, seen in Phase 4 of the same fight. The most interesting aspect here is actually still related to the cross bullets, whose spawn function is wrongly limited to 32 entities and could theoretically overwrite this circle. :zunpet: This is strictly landmine territory though:

    • Yuuka never uses these bullets and the safety circle simultaneously
    • She never spawns more than 24 cross bullets
    • All cross bullets are fast enough to have left the screen by the time Yuuka restarts the corresponding subpattern
    • The cross bullets spawn at Yuuka's center position, and assign its Q12.4 coordinates to structure fields that the safety circle interprets as raw pixels. The game does try to render the circle afterward, but since Yuuka's static position during this phase is nowhere near a valid pixel coordinate, it is immediately clipped.
  6. The flashing lines seen in Phase 5 of the Gengetsu fight, telegraphing the slightly random bullet columns.

    The spawn column lines in the TH05 Gengetsu fight, in the first of their two flashing colors.The spawn column lines in the TH05 Gengetsu fight, in the second of their two flashing colors.

These structures only took 1 push to reverse-engineer rather than the 2 I needed for their TH05 counterparts because they are much simpler in this game. The "structure" for Gengetsu's lines literally uses just a single X position, with the remaining 24 bytes being basically padding. The only minor bug I found on this shallow level concerns Marisa's bits, which are clipped at the right and bottom edges of the playfield 16 pixels earlier than you would expect:


The remaining push went to a bunch of smaller structures and functions:


To top off the second push, we've got the vertically scrolling checkerboard background during the Stage 6 Yuuka fight, made up of 32×32 squares. This one deserves a special highlight just because of its needless complexity. You'd think that even a performant implementation would be pretty simple:

  1. Set the GRCG to TDW mode
  2. Set the GRCG tile to one of the two square colors
  3. Start with Y as the current scroll offset, and X as some indicator of which color is currently shown at the start of each row of squares
  4. Iterate over all lines of the playfield, filling in all pixels that should be displayed in the current color, skipping over the other ones
  5. Count down Y for each line drawn
  6. If Y reaches 0, reset it to 32 and flip X
  7. At the bottom of the playfield, change the GRCG tile to the other color, and repeat with the initial value of X flipped

The most important aspect of this algorithm is how it reduces GRCG state changes to a minimum, avoiding the costly port I/O that we've identified time and time again as one of the main bottlenecks in TH01. With just 2 state variables and 3 loops, the resulting code isn't that complex either. A naive implementation that just drew the squares from top to bottom in a single pass would barely be simpler, but much slower: By changing the GRCG tile on every color, such an implementation would burn a low 5-digit number of CPU cycles per frame for the 12×11.5-square checkerboard used in the game.
And indeed, ZUN retained all important aspects of this algorithm… but still implemented it all in ASM, with a ridiculous layer of x86 segment arithmetic on top? :zunpet: Which blows up the complexity to 4 state variables, 5 nested loops, and a bunch of constants in unusual units. I'm not sure what this code is supposed to optimize for, especially with that rather questionable register allocation that nevertheless leaves one of the general-purpose registers unused. :onricdennat: Fortunately, the function was still decompilable without too many code generation hacks, and retains the 5 nested loops in all their goto-connected glory. If you want to add a checkerboard to your next PC-98 demo, just stick to the algorithm I gave above.
(Using a single XOR for flipping the starting X offset between 32 and 64 pixels is pretty nice though, I have to give him that.)


This makes for a good occasion to talk about the third and final GRCG mode, completing the series I started with my previous coverage of the 📝 RMW and 📝 TCR modes. The TDW (Tile Data Write) mode is the simplest of the three and just writes the 8×1 GRCG tile into VRAM as-is, without applying any alpha bitmask. This makes it perfect for clearing rectangular areas of pixels – or even all of VRAM by doing a single memset():

// Set up the GRCG in TDW mode.
outportb(0x7C, 0x80);

// Fill the tile register with color #7 (0111 in binary).
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0xFF); // Plane 2: (G): (********)
outportb(0x7E, 0x00); // Plane 3: (E): (        )

// Set the 32 pixels at the top-left corner of VRAM to the exact contents of
// the tile register, effectively repeating the tile 4 times. In TDW mode, the
// GRCG ignores the CPU-supplied operand, so we might as well just pass the
// contents of a register with the intended width. This eliminates useless load
// instructions in the compiled assembly, and even sort of signals to readers
// of this code that we do not care about the source value.
*reinterpret_cast<uint32_t far *>(MK_FP(0xA800, 0)) = _EAX;

// Fill the entirety of VRAM with the GRCG tile. A simple C one-liner that will
// probably compile into a single `REP STOS` instruction. Unfortunately, Turbo
// C++ 4.0J only ever generates the 16-bit `REP STOSW` here, even when using
// the `__memset__` intrinsic and when compiling in 386 mode. When targeting
// that CPU and above, you'd ideally want `REP STOSD` for twice the speed.
memset(MK_FP(0xA800, 0), _AL, ((640 / 8) * 400));

However, this might make you wonder why TDW mode is even necessary. If it's functionally equivalent to RMW mode with a CPU-supplied bitmask made up entirely of 1 bits (i.e., 0xFF, 0xFFFF, or 0xFFFFFFFF), what's the point? The difference lies in the hardware implementation: If all you need to do is write tile data to VRAM, you don't need the read and modify parts of RMW mode which require additional processing time. The PC-9801 Programmers' Bible claims a speedup of almost 2× when using TDW mode over equivalent operations in RMW mode.
And that's the only performance claim I found, because none of these old PC-98 hardware and programming books did any benchmarks. Then again, it's not too interesting of a question to benchmark either, as the byte-aligned nature of TDW blitting severely limits its use in a game engine anyway. Sure, maybe it makes sense to temporarily switch from RMW to TDW mode if you've identified a large rectangular and byte-aligned section within a sprite that could be blitted without a bitmask? But the necessary identification work likely nullifies the performance gained from TDW mode, I'd say. In any case, that's pretty deep micro-optimization territory. Just use TDW mode for the few cases it's good at, and stick to RMW mode for the rest.

So is this all that can be said about the GRCG? Not quite, because there are 4 bits I haven't talked about yet…


And now we're just 5.37% away from 100% position independence for TH04! From this point, another 2 pushes should be enough to reach this goal. It might not look like we're that close based on the current estimate, but a big chunk of the remaining numbers are false positives from the player shot control functions. Since we've got a very special deadline to hit, I'm going to cobble these two pushes together from the two current general subscriptions and the rest of the backlog. But you can, of course, still invest in this goal to allow the existing contributions to go to something else.
… Well, if the store was actually open. :thonk: So I'd better continue with a quick task to free up some capacity sooner rather than later. Next up, therefore: Back to TH02, and its item and player systems. Shouldn't take that long, I'm not expecting any surprises there. (Yeah, I know, famous last words…)

📝 Posted:
💰 Funded by:
rosenrose, Blue Bolt, Splashman, -Tom-, Yanga, Enderwolf, 32th System
🏷️ Tags:

More than three months without any reverse-engineering progress! It's been way too long. Coincidentally, we're at least back with a surprising 1.25% of overall RE, achieved within just 3 pushes. The ending script system is not only more or less the same in TH04 and TH05, but actually originated in TH03, where it's also used for the cutscenes before stages 8 and 9. This means that it was one of the final pieces of code shared between three of the four remaining games, which I got to decompile at roughly 3× the usual speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The Music Room is largely equivalent in all three remaining games as well, and the sound device selection, ZUN Soft logo screens, and main/option menus are the same in TH04 and TH05. A lot of that code is in the "technically RE'd but not yet decompiled" ASM form though, so it would shift Finalized% more significantly than RE%. Therefore, make sure to order the new Finalization option rather than Reverse-engineering if you want to make number go up.

  1. General overview
  2. Game-specific differences
  3. Command reference
  4. Thoughts about translation support

So, cutscenes. On the surface, the .TXT files look simple enough: You directly write the text that should appear on the screen into the file without any special markup, and add commands to define visuals, music, and other effects at any place within the script. Let's start with the basics of how text is rendered, which are the same in all three games:


Superficially, the list of game-specific differences doesn't look too long, and can be summarized in a rather short table:

:th03: TH03 :th04: TH04 :th05: TH05
Script size limit 65536 bytes (heap-allocated) 8192 bytes (statically allocated)
Delay between every 2 bytes of text 1 frame by default, customizable via \v None
Text delay when holding ESC Varying speed-up factor None
Visibility of new text Immediately typed onto the screen Rendered onto invisible VRAM page, faded in on wait commands
Visibility of old text Unblitted when starting a new box Left on screen until crossfaded out with new text
Key binding for advancing the script Any key ⏎ Return, Shot, or ESC
Animation while waiting for an advance key None ⏎⃣, past right edge of current row
Inexplicable delays None 1 frame before changing pictures and after rendering new text boxes
Additional delay per interpreter loop 614.4 µs None 614.4 µs
The 614.4 µs correspond to the necessary delay for working around the repeated key up and key down events sent by PC-98 keyboards when holding down a key. While the absence of this delay significantly speeds up TH04's interpreter, it's also the reason why that game will stop recognizing a held ESC key after a few seconds, requiring you to press it again.

It's when you get into the implementation that the combined three systems reveal themselves as a giant mess, with more like 56 differences between the games. :zunpet: Every single new weird line of code opened up another can of worms, which ultimately made all of this end up with 24 pieces of bloat and 14 bugs. The worst of these should be quite interesting for the general PC-98 homebrew developers among my audience:


That brings us to the individual script commands… and yes, I'm going to document every single one of them. Some of their interactions and edge cases are not clear at all from just looking at the code.

Almost all commands are preceded by… well, a 0x5C lead byte. :thonk: Which raises the question of whether we should document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded ¥ yen sign. From a gaijin perspective, it seems obvious that it's a backslash, as it's consistently displayed as one in most of the editors you would actually use nowadays. But interestingly, iconv -f shift-jis -t utf-8 does convert any 0x5C lead bytes to actual ¥ U+00A5 YEN SIGN code points :tannedcirno:.
Ultimately, the distinction comes down to the font. There are fonts that still render 0x5C as ¥, but mainly do so out of an obvious concern about backward compatibility to JIS X 0201, where this mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho, the old Japanese fonts from Windows 3.1, but even Meiryo and Yu Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much every other modern font, and freely licensed ones in particular, render this code point as \, even if you set your editor to Shift-JIS. And while ZUN most definitely saw it as a ¥, documenting this code point as \ is less ambiguous in the long run. It can only possibly correspond to one specific code point in either Shift-JIS or UTF-8, and will remain correct even if we later mod the cutscene system to support full-blown Unicode.

Now we've only got to clarify the parameter syntax, and then we can look at the big table of commands:

:th03: :th04: :th05: \@ Clears both VRAM pages by filling them with VRAM color 0.
🐞 In TH03 and TH04, this command does not update the internal text area background used for unblitting. This bug effectively restricts usage of this command to either the beginning of a script (before the first background image is shown) or its end (after no more new text boxes are started). See the image below for an example of using it anywhere else.
:th03: :th04: :th05: \b2 Sets the font weight to a value between 0 (raw font ROM glyphs) to 3 (very thicc). Specifying any other value has no effect.
:th04: :th05: 🐞 In TH04 and TH05, \b3 leads to glitched pixels when rendering half-width glyphs due to a bug in the newly micro-optimized ASM version of 📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the graph_putsa_fx() effect function, removing the sanity check that was present in TH03. In exchange, you can also access the four dissolve masks for the bold font (\b2) by specifying a parameter between 4 (fewest pixels) to 7 (most pixels). Demo video below.
:th03: :th04: :th05: \c15 Changes the text color to VRAM color 15.
:th05: \c=,15 Adds a color map entry: If is the first code point inside the name area on a new line, the text color is automatically set to 15. Up to 8 such entries can be registered before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
:th03: :th04: :th05: \e0 Plays the sound effect with the given ID.
:th03: :th04: :th05: \f (no-op)
:th03: :th04: :th05: \fi1
\fo1
Calls master.lib's palette_black_in() or palette_black_out() to play a hardware palette fade animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
:th03: :th04: :th05: \fm1 Fades out BGM volume via PMD's AH=02h interrupt call, in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to AH=02h's fade-in feature, which can't be used from cutscene scripts because it requires BGM volume to first be lowered via AH=19h, and there is no command to do that.
:th03: :th04: :th05: \g8 Plays a blocking 8-frame screen shake animation.
:th03: :th04: \ga0 Shows the gaiji with the given ID from 0 to 255 at the current cursor position. Even in TH03, gaiji always ignore the text delay interval configured with \v.
:th05: @3 TH05's replacement for the \ga command from TH03 and TH04. The default ID of 3 corresponds to the ♫ gaiji. Not to be confused with \@, which starts with a backslash, unlike this command.
:th05: @h Shows the 🎔 gaiji.
:th05: @t Shows the 💦 gaiji.
:th05: @! Shows the ! gaiji.
:th05: @? Shows the ? gaiji.
:th05: @!! Shows the ‼ gaiji.
:th05: @!? Shows the ⁉ gaiji.
:th03: :th04: :th05: \k0 Waits 0 frames (0 = forever) for an advance key to be pressed before continuing script execution. Before waiting, TH05 crossfades in any new text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time. As a workaround, \vp1 can be used before \k to immediately display that text without a fade-in animation.
:th03: :th04: :th05: \m$ Stops the currently playing BGM.
:th03: :th04: :th05: \m* Restarts playback of the currently loaded BGM from the beginning.
:th03: :th04: :th05: \m,filename Stops the currently playing BGM, loads a new one from the given file, and starts playback.
:th03: :th04: :th05: \n Starts a new line at the leftmost X coordinate of the box, i.e., the start of the name area. This is how scripts can "change" the name of the currently speaking character, or use the entire 480×64 pixels without being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line. Using this command at the "end" of a line with the maximum number of 30 full-width glyphs would therefore start a second new line and leave the previously started line empty.
If this command moved the cursor into the 5th line of a box, \s is executed afterward, with any of \n's parameters passed to \s.
:th03: :th04: :th05: \p (no-op)
:th03: :th04: :th05: \p- Deallocates the loaded .PI image.
:th03: :th04: :th05: \p,filename Loads the .PI image with the given file into the single .PI slot available to cutscenes. TH04 and TH05 automatically deallocate any previous image, 🐞 TH03 would leak memory without a manual prior call to \p-.
:th03: :th04: :th05: \pp Sets the hardware palette to the one of the loaded .PI image.
:th03: :th04: :th05: \p@ Sets the loaded .PI image as the full-screen 640×400 background image and overwrites both VRAM pages with its pixels, retaining the current hardware palette.
:th03: :th04: :th05: \p= Runs \pp followed by \p@.
:th03: :th04: :th05: \s0
\s-
Ends a text box and starts a new one. Fades in any text rendered to the invisible VRAM page, then waits 0 frames (0 = forever) for an advance key to be pressed. Afterward, the new text box is started with the cursor moved to the top-left corner of the name area.
\s- skips the wait time and starts the new box immediately.
:th03: :th04: :th05: \t100 Sets palette brightness via master.lib's palette_settone() to any value from 0 (fully black) to 200 (fully white). 100 corresponds to the palette's original colors. Preceded by a 1-frame delay unless ESC is held.
:th03: \v1 Sets the number of frames to wait between every 2 bytes of rendered text.
:th04: Sets the number of frames to spend on each of the 4 fade steps when crossfading between old and new text. The game-specific default value is also used before the first use of this command.
:th05: \v2
:th03: :th04: :th05: \vp0 Shows VRAM page 0. Completely useless in TH03 (this game always synchronizes both VRAM pages at a command boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to their intended shown page before every blitting operation anyway. A debloated mod of this game would just remove this command, as it exposes an implementation detail that script authors should not need to worry about. None of the original scripts use it anyway.
:th03: :th04: :th05: \w64
  • \w and \wk wait for the given number of frames
  • \wm and \wmk wait until PMD has played back the current BGM for the total number of measures, including loops, given in the first parameter, and fall back on calling \w and \wk with the second parameter as the frame number if BGM is disabled.
    🐞 Neither PMD nor MMD reset the internal measure when stopping playback. If no BGM is playing and the previous BGM hasn't been played back for at least the given number of measures, this command will deadlock.
Since both TH04 and TH05 fade in any new text from the invisible VRAM page, these commands can be used to simulate TH03's typing effect in those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be interrupted if ⏎ Return or Shot are held down. TH04 and TH05 recognize the k as well, but removed its functionality.
All of these commands have no effect if ESC is held.
\wm64,64
:th03: \wk64
\wmk64,64
:th03: :th04: :th05: \wi1
\wo1
Calls master.lib's palette_white_in() or palette_white_out() to play a hardware palette fade animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
:th03: :th04: :th05: \=4 Immediately displays the given quarter of the loaded .PI image in the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
:th03: :th04: :th05: \==4,1 Crossfades the picture area between its current content and quarter #4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless ESC is held. Any value ≥ 4 is replaced with quarter #0.
:th03: :th04: :th05: \$ Stops script execution. Must be called at the end of each file; otherwise, execution continues into whatever lies after the script buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04 require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter is omitted; \c is therefore equivalent to \c15.
Using the \@ command in the middle of a TH03 or TH04 cutscene script
The \@ bug. Yes, the ¥ is fake. It was easier to GIMP it than to reword the sentences so that the backslashes landed on the second byte of a 2-byte half-width character pair. :onricdennat:
Cutscene font weights in TH03Cutscene font weights in TH05, demonstrating the <code>\b3</code> bug that also affects TH04Cutscene font weights in TH03, rendered at a hypothetical unaligned X positionCutscene font weights in TH05, rendered at a hypothetical unaligned X position
The font weights and effects available through \b, including the glitch with \b3 in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels at the left side of each glyph, which would extend into the previous VRAM byte. Ironically, the TH04/TH05 version is more correct in this regard: For half-width glyphs, it preserves any further pixel columns generated by the weight functions in the high byte of the 16-dot glyph variable. Unlike TH03, which still cuts them off when rendering text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them towards their correct place (4️⃣). It's only at byte-aligned X positions (2️⃣) where they remain at their internally calculated place, and appear on screen as these glitched pixel columns, 15 pixels away from the glyph they belong to. It's easy to blame bugs like these on micro-optimized ASM code, but in this instance, you really can't argue against it if the original C++ version was equally incorrect.
Combining \b and s- into a partial dissolve animation. The speed can be controlled with \v.
Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we also get an additional fade animation after the box ends.

So yeah, that's the cutscene system. I'm dreading the moment I will have to deal with the other command interpreter in these games, i.e., the stage enemy system. Luckily, that one is completely disconnected from any other system, so I won't have to deal with it until we're close to finishing MAIN.EXE… that is, unless someone requests it before. And it won't involve text encodings or unblitting…


The cutscene system got me thinking in greater detail about how I would implement translations, being one of the main dependencies behind them. This goal has been on the order form for a while and could soon be implemented for these cutscenes, with 100% PI being right around the corner for the TH03 and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching for Latin-script languages could be implemented fairly quickly:

  1. Establish basic UTF-8 parsing for less painful manual editing of the source files
  2. Procedurally generate glyphs for the few required additional letters based on existing font ROM glyphs. For example, we'd generate ä by painting two short lines on top of the font ROM's a glyph, or generate ¿ by vertically flipping the question mark. This way, the text retains a consistent look regardless of whether the translated game is run with an NEC or EPSON font ROM, or the hideous abomination that Neko Project II auto-generates if you don't provide either.
  3. (Optional) Change automatic line breaks to work on a per-word basis, rather than per-glyph

That's it – script editing and distribution would be handled by your local translation group. It might seem as if this would also work for Greek and Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not sure if I want to attempt procedurally shrinking these glyphs from 16×16 to 8×16… For any more thorough solution, we'd need to go for a more "Chad" kind of full-blown translation support:

  1. Implement text subdivisions at a sensible granularity while retaining automatic line and box breaks
  2. Compile translatable text into a Japanese→target language dictionary (I'm too old to develop any further translation systems that would overwrite modded source text with translations of the original text)
  3. Implement a custom Unicode font system (glyphs would be taken from GNU Unifont unless translators provide a different 8×16 font for their language)
  4. Combine the text compiler with the font compiler to only store needed glyphs as part of the translation's font file (dealing with a multi-MB font file would be rather ugly in a Real Mode game)
  5. Write a simple install/update/patch stacking tool that supports both .HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to warrant a separate tool – each patch stack would be statically compiled into a single package file in the game's directory)
  6. Add a nice language selection option to the main menu
  7. (Optional) Support proportional fonts

Which sounds more like a separate project to be commissioned from Touhou Patch Center's Open Collective funds, separate from the ReC98 cap. This way, we can make sure that the feature is completely implemented, and I can talk with every interested translator to make sure that their language works.
It's still cheaper overall to do this on PC-98 than to first port the games to a modern system and then translate them. On the other hand, most of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with the difficulty of getting arbitrary Unicode characters to work natively in a PC-98 DOS game at all, and would be either unnecessary or trivial if we had already ported the game. Depending on where the patrons' interests lie, it may not be worth it. So let's see what all of you think about which way we should go, or whether it's worth doing at all. (Edit (2022-12-01): With Splashman's order towards the stage dialogue system, we've pretty much confirmed that it is.) Maybe we want to meet in the middle – using e.g. procedural glyph generation for dynamic translations to keep text rendering consistent with the rest of the PC-98 system, and just not support non-Latin-script languages in the beginning? In any case, I've added both options to the order form.
Edit (2023-07-28): Touhou Patch Center has agreed to fund a basic feature set somewhere between the Virgin and Chad level. Check the 📝 dedicated announcement blog post for more details and ideas, and to find out how you can support this goal!


Surprisingly, there was still a bit of RE work left in the third push after all of this, which I filled with some small rendering boilerplate. Since I also wanted to include TH02's playfield overlay functions, 1/15 of that last push went towards getting a TH02-exclusive function out of the way, which also ended up including that game in this delivery. :tannedcirno:
The other small function pointed out how TH05's Stage 5 midboss pops into the playfield quite suddenly, since its clipping test thinks it's only 32 pixels tall rather than 64:

Good chance that the pop-in might have been intended.
Edit (2023-06-30): Actually, it's a 📝 systematic consequence of ZUN having to work around the lack of clipping in master.lib's sprite functions.
There's even another quirk here: The white flash during its first frame is actually carried over from the previous midboss, which the game still considers as actively getting hit by the player shot that defeated it. It's the regular boilerplate code for rendering a midboss that resets the responsible damage variable, and that code doesn't run during the defeat explosion animation.

Next up: Staying with TH05 and looking at more of the pattern code of its boss fights. Given the remaining TH05 budget, it makes the most sense to continue in in-game order, with Sara and the Stage 2 midboss. If more money comes in towards this goal, I could alternatively go for the Mai & Yuki fight and immediately develop a pretty fix for the cheeto storage glitch. Also, there's a rather intricate pull request for direct ZMBV decoding on the website that I've still got to review…

📝 Posted:
💰 Funded by:
nrook, -Tom-, [Anonymous]
🏷️ Tags:

The important things first:

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:


Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels
A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.


And then the code quality takes a nosedive with Shinki's main function. :onricdennat: Even in TH05, these boss and midboss update functions are still very imperative:

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.


The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.


Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

📝 Posted:
💰 Funded by:
Lmocinemod, [Anonymous], Yanga
🏷️ Tags:

Been 📝 a while since we last looked at any of TH03's game code! But before that, we need to talk about Y coordinates.

During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its line-doubled 640×200 resolution, which gives the in-game portion its distinctive stretched low-res look. This lower resolution is a consequence of using 📝 Promisence Soft's SPRITE16 driver: Its performance simply stems from the fact that it expects sprites to be stored in the bottom half of VRAM, which allows them to be blitted using the same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all other games. Reducing the visible resolution also means that the sprites can be stored on both VRAM pages, allowing the game to still be double-buffered. If you force the graphics chip to run at 640×400, you can see them:

The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
TH03's VRAM at regular line-doubled 640×200 resolutionTH03's VRAM at full 640×400 resolution, including the SPRITE16 sprite areaTH03's text layer during an in-game round.

Note that the text chip still displays its overlaid contents at 640×400, which means that TH03's in-game portion technically runs at two resolutions at the same time.

But that means that any mention of a Y coordinate is ambiguous: Does it refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially people who have known about the line doubling for years might almost expect technical blog posts on this game to use undoubled VRAM coordinates. So, let's introduce a new formatting convention for both on-screen 640×400 and undoubled 640×200 coordinates, and always write out both to minimize the confusion.


Alright, now what's the thing gonna be? The enemy structure is highly overloaded, being used for enemies, fireballs, and explosions with seemingly different semantics for each. Maybe a bit too much to be figured out in what should ideally be a single push, especially with all the functions that would need to be decompiled? Bullet code would be easier, but not exactly single-push material either. As it turns out though, there's something more fundamental left to be done first, which both of these subsystems depend on: collision detection!

And it's implemented exactly how I always naively imagined collision detection to be implemented in a fixed-resolution 2D bullet hell game with small hitboxes: By keeping a separate 1bpp bitmap of both playfields in memory, drawing in the collidable regions of all entities on every frame, and then checking whether any pixels at the current location of the player's hitbox are set to 1. It's probably not done in the other games because their single data segment was already too packed for the necessary 17,664 bytes to store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly useful, as the AI also uses it to elegantly learn what's on the playfield. By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total of 6,624 bytes of memory for the collision bitmaps of both playfields.

So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. :zunpet: And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX, BX, CX, and DX registers can also be accessed as two 8-bit registers, calculations that change the semantic meaning behind the value of a register, or just straight-up reassignments of different values to the same small set of registers. Sure, in some way it is impressive, and it all does work and correctly covers every edge case, but come on. This could have all been a lot more readable in exchange for just a few CPU cycles.

What's most interesting though are the actual shapes that these functions draw into the collision bitmap. On the surface, we have:

  1. vertical slopes at any angle across the whole playfield; exclusively used for Chiyuri's diagonal laser EX attack
  2. straight vertical lines, with a width of 1 tile; exclusively used for the 2×2 / 2×1 hitboxes of bullets
  3. rectangles at arbitrary sizes

But only 2) actually draws a full solid line. 1) and 3) are only ever drawn as horizontal stripes, with a hardcoded distance of 2 vertical tiles between every stripe of a slope, and 4 vertical tiles between every stripe of a rectangle. That's 66-75% of each rectangular entity's intended hitbox not actually taking part in collision detection. Now, if player hitboxes were ≤ 6 / 3 pixels, we'd have one possible explanation of how the AI can "cheat", because it could just precisely move through those blank regions at TAS speeds. So, let's make this two pushes after all and tell the complete story, since this is one of the more interesting aspects to still be documented in this game.


And the code only gets worse. :godzun: While the player collision detection function is decompilable, it might as well not have been, because it's just more of the same "optimized", hard-to-follow assembly. With the four splittable 16-bit registers having a total of 20 different meanings in this function, I would have almost preferred self-modifying code…

In fact, it was so bad that it prompted some maintenance work on my inline assembly coding standards as a whole. Turns out that the _asm keyword is not only still supported in modern Visual Studio compilers, but also in Clang with the -fms-extensions flag, and compiles fine there even for 64-bit targets. While that might sound like amazing news at first ("awesome, no need to rewrite this stuff for my x86_64 Linux port!"), you quickly realize that almost all inline assembly in this codebase assumes either PC-98 hardware, segmented 16-bit memory addressing, or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s register pseudovariables where possible. While they certainly have their drawbacks, being a non-standard extension that's not supported in other x86-targeting C compilers, their advantages are quite significant: They allow this code to stay in the same language, and provide slightly more immediate portability to any other architecture, together with 📝 readability and maintainability improvements that can get quite significant when combined with inlining:

// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);

However, register pseudovariables might cause potential portability issues as soon as they are mixed with inline assembly instructions that rely on their state. The lazy way of "supporting pseudo-registers" in other compilers would involve declaring the full set as global variables, which would immediately break every one of those instances:

_DI = 0;
_AX = 0xFFFF;

// Special x86 instruction doing the equivalent of
//
// 	*reinterpret_cast<uint16_t far *>(MK_FP(_ES, _DI)) = _AX;
// 	_DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }

What's also not all too standardized, though, are certain variants of the asm keyword. That's why I've now introduced a distinction between the _asm keyword for "decently sane" inline assembly, and the slightly less standard asm keyword for inline assembly that relies on the contents of pseudo-registers, and should break on compilers that don't support them.
So yeah, have some minor portability work in exchange for these two pushes not having all that much in RE'd content.

With that out of the way and the function deciphered, we can confirm the player hitboxes to be a constant 8×8 / 8×4 pixels, and prove that the hit stripes are nothing but an adequate optimization that doesn't affect gameplay in any way.


And what's the obvious thing to immediately do if you have both the collision bitmap and the player hitbox? Writing a "real hitbox" mod, of course:

  1. Reorder the calls to rendering functions so that player and shot sprites are rendered after bullets
  2. Blank out all player sprite pixels outside an 8×8 / 8×4 box around the center point
  3. After the bullet rendering function, turn on the GRCG in RMW mode and set the tile register set to the background color
  4. Stretch the negated contents of collision bitmap onto each playfield, leaving only collidable pixels untouched
  5. Do the same with the actual, non-negated contents and a white color, for extra contrast against the background. This also makes sure to show any collidable areas whose sprite pixels are transparent, such as with the moon enemy. (Yeah, how unfair.) Doing that also loses a lot of information about the playfield, such as enemy HP indicated by their color, but what can you do:
A decently busy TH03 in-game frame.The underlying content of the collision bitmap, showing off all three different shapes together with the player hitboxes.
A decently busy TH03 in-game frame and its underlying collision bitmap, showing off all three different collision shapes together with the player hitboxes.

2022-02-18-TH03-real-hitbox.zip The secret for writing such mods before having reached a sufficient level of position independence? Put your new code segment into DGROUP, past the end of the uninitialized data section. That's why this modded MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these uninitialized 0 bytes between the end of the data segment and the first instruction of the mod code – normally, this number is simply a part of the MZ EXE header, and doesn't need to be redundantly stored on disk. Check the th03_real_hitbox branch for the code.

And now we know why so many "real hitbox" mods for the Windows Touhou games are inaccurate: The games would simply be unplayable otherwise – or can you dodge rapidly moving 2×2 / 2×1 blocks as an 8×8 / 8×4 rectangle that is smaller than your shot sprites, especially without focused movement? I can't. :tannedcirno: Maybe it will feel more playable after making explosions visible, but that would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both playfields per frame doesn't significantly drop the game's frame rate – so why did the drawing functions have to be micro-optimized again? It would be possible in one pass by using the GRCG's TDW mode, which should theoretically be 8× faster, but I have to stop somewhere. :onricdennat:

Next up: The final missing piece of TH04's and TH05's bullet-moving code, which will include a certain other type of projectile as well.

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:

TH03 finally passed 20% RE, and the newly decompiled code contains no serious ZUN bugs! What a nice way to end the year.

There's only a single unlockable feature in TH03: Chiyuri and Yumemi as playable characters, unlocked after a 1CC on any difficulty. Just like the Extra Stages in TH04 and TH05, YUME.NEM contains a single designated variable for this unlocked feature, making it trivial to craft a fully unlocked score file without recording any high scores that others would have to compete against. So, we can now put together a complete set for all PC-98 Touhou games: 2021-12-27-Fully-unlocked-clean-score-files.zip It would have been cool to set the randomly generated encryption keys in these files to a fixed value so that they cancel out and end up not actually encrypting the file. Too bad that TH03 also started feeding each encrypted byte back into its stream cipher, which makes this impossible.

The main loading and saving code turned out to be the second-cleanest implementation of a score file format in PC-98 Touhou, just behind TH02. Only two of the YUME.NEM functions come with nonsensical differences between OP.EXE and MAINL.EXE, rather than 📝 all of them, as in TH01 or 📝 too many of them, as in TH04 and TH05. As for the rest of the per-difficulty structure though… well, it quickly becomes clear why this was the final score file format to be RE'd. The name, score, and stage fields are directly stored in terms of the internal REGI*.BFT sprite IDs used on the high score screen. TH03 also stores 10 score digits for each place rather than the 9 possible ones, keeps any leading 0 digits, and stores the letters of entered names in reverse order… yeah, let's decompile the high score screen as well, for a full understanding of why ZUN might have done all that. (Answer: For no reason at all. :zunpet:)


And wow, what a breath of fresh air. It's surely not good-code: The overlapping shadows resulting from using a 24-pixel letterspacing with 32-pixel glyphs in the name column led ZUN to do quite a lot of unnecessary and slightly confusing rendering work when moving the cursor back and forth, and he even forgot about the EGC there. But it's nowhere close to the level of jank we saw in 📝 TH01's high score menu last year. Good to see that ZUN had learned a thing or two by his third game – especially when it comes to storing the character map cursor in terms of a character ID, and improving the layout of the character map:

The alphabet available for TH03 high score names.

That's almost a nicely regular grid there. With the question mark and the double-wide SP, BS, and END options, the cursor movement code only comes with a reasonable two exceptions, which are easily handled. And while I didn't get this screen completely decompiled, one additional push was enough to cover all important code there.

The only potential glitch on this screen is a result of ZUN's continued use of binary-coded decimal digits without any bounds check or cap. Like the in-game HUD score display in TH04 and TH05, TH03's high score screen simply uses the next glyph in the character set for the most significant digit of any score above 1,000,000,000 points – in this case, the period. Still, it only really gets bad at 8,000,000,000 points: Once the glyphs are exhausted, the blitting function ends up accessing garbage data and filling the entire screen with garbage pixels. For comparison though, the current world record is 133,650,710 points, so good luck getting 8 billion in the first place.

Next up: Starting 2022 with the long-awaited decompilation of TH01's Sariel fight! Due to the 📝 recent price increase, we now got a window in the cap that is going to remain open until tomorrow, providing an early opportunity to set a new priority after Sariel is done.

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Back after taking way too long to get Touhou Patch Center's MediaWiki update feature complete… I'm still waiting for more translators to test and review the new translation interface before delivering and deploying it all, which will most likely lead to another break from ReC98 within the next few months. For now though, I'm happy to have mostly addressed the nagging responsibility I still had after willing that site into existence, and to be back working on ReC98. 🙂

As announced, the next few pushes will focus on TH04's and TH05's bullet spawning code, before I get to put all that accumulated TH01 money towards finishing all of konngara's code in TH01. For a full picture of what's happening with bullets, we'd really also like to have the bullet update function as readable C code though.
Clearing all bullets on the playfield will trigger a Bonus!! popup, displayed as 📝 gaiji in that proportional font. Unfortunately, TLINK refused to link the code as soon as I referenced the function for animating the popups at the top of the playfield? Which can only mean that we have to decompile that function first… :thonk:

So, let's turn that piece of technical debt into a full push, and first decompile another random set of previously reverse-engineered TH04 and TH05 functions. Most of these are stored in a different place within the two MAIN.EXE binaries, and the tried-and-true method of matching segment names would therefore have introduced several unnecessary translation units. So I resorted to a segment splitting technique I should have started using way earlier: Simply creating new segments with names derived from their functions, at the exact positions they're needed. All the new segment start and end directives do bloat the ASM code somewhat, and certainly contributed to this push barely removing any actual lines of code. However, what we get in return is total freedom as far as decompilation order is concerned, 📝 which should be the case for any ReC project, really. And in the end, all these tiny code segments will cancel out anyway.
If only we could do the same with the data segment…


The popup function happened to be the final one I RE'd before my long break in the spring of 2019. Back then, I didn't even bother looking into that 64-frame delay between changing popups, and what that meant for the game.
Each of these popups stays on screen for 128 frames, during which, of course, another popup-worthy event might happen. Handling this cleanly without removing previous popups too early would involve some sort of event queue, whose size might even be meaningfully limited to the number of distinct events that can happen. But still, that'd be a data structure, and we're not gonna have that! :zunpet: Instead, ZUN simply keeps two variables for the new and current popup ID. During an active popup, any change to that ID will only be committed once the current popup has been shown for at least 64 frames. And during that time, that new ID can be freely overwritten with a different one, which drops any previous, undisplayed event. But surely, there won't be more than two events happening within 63 frames, right? :tannedcirno:

The rest was fairly uneventful – no newly RE'd functions in this push, after all – until I reached the widely used helper function for applying the current vertical scrolling offset to a Y coordinate. Its combination of a function parameter, the pascal calling convention, and no stack frame was previously thought to be undecompilable… except that it isn't, and the decompilation didn't even require any new workarounds to be developed? Good thing that I already forgot how impossible it was to decompile the first function I looked at that fell into this category!
Oh well, this discovery wasn't too groundbreaking. Looking back at all the other functions with that combination only revealed a grand total of 1 additional one where a decompilation made sense: TH05's version of snd_kaja_interrupt(), which is now compiled from the same C++ file for all 4 games that use it. And well, looks like some quirks really remain unnoticed and undocumented until you look at a function for the 11th time: Its return value is undefined if BGM is inactive – that is, if the user disabled it, or if no FM board is installed. Not that it matters for the original code, which never uses this function to retrieve anything from KAJA's drivers. But people apparently do copy ReC98 code into their own projects, so it is something to keep in mind.


All in all, nothing quite at jank level in this one, but we were surely grazing that tag. Next up, with that out of the way: The bullet update/step function! Very soon in fact, since I've mostly got it done already.

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Technical debt, part 10… in which two of the PMD-related functions came with such complex ramifications that they required one full push after all, leaving no room for the additional decompilations I wanted to do. At least, this did end up being the final one, completing all SHARED segments for the time being.


The first one of these functions determines the BGM and sound effect modes, combining the resident type of the PMD driver with the Option menu setting. The TH04 and TH05 version is apparently coded quite smartly, as PC-98 Touhou only needs to distinguish "OPN- / PC-9801-26K-compatible sound sources handled by PMD.COM" from "everything else", since all other PMD varieties are OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's AH=09h function. I'll leave a comprehensive, fully documented enum to interested contributors, since that would involve research into basically the entire history of the PC-9800 series, and even the clearly out-of-scope PC-88VA. After all, distinguishing between more versions of the PMD driver in the Option menu (and adding new sprites for them!) is strictly mod territory.


The honor of being the final decompiled function in any SHARED segment went to TH04's snd_load(). TH04 contains by far the sanest version of this function: Readable C code, no new ZUN bugs (and still missing file I/O error handling, of course)… but wait, what about that actual file read syscall, using the INT 21h, AH=3Fh DOS file read API? Reading up to a hardcoded number of bytes into PMD's or MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in TH05… that's kind of weird. About time we looked closer into this. :thonk:

Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one memory segment for these, as especially TH05's code might suggest to anyone unfamiliar with these drivers. :zunpet: Instead, you can customize the size of these buffers on its command line. In GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound effects, and 12 KiB for MMD files in TH02… which means that the hardcoded sizes in snd_load() are completely wrong, no matter how you look at them. :onricdennat: Consequently, this read syscall will overflow PMD's or MMD's song or sound effect buffer if the given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT instead, and it would have been fine. As it also turns out though, PMD has an API function (AH=22h) to retrieve the actual buffer sizes, provided for exactly that purpose. There is little excuse not to use it, as it also gives you PMD's default sizes if you don't specify any yourself.
(Unless your build process enumerates all PMD files that are part of the game, and bakes the largest size into both snd_load() and GAME.BAT. That would even work with MMD, which doesn't have an equivalent for AH=22h.)

What'd be the consequence of loading a larger file then? Well, since we don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single memory segment. As a result, the limit for the combined size of the song, instrument, and sound effect buffer is determined by the amount of code in the driver itself. In PMD86 version 4.8o (bundled with TH04 and TH05) for example, the remaining size for these buffers is exactly 45,555 bytes. Being an actually good programmer who doesn't blindly trust user input, KAJA thankfully validates the sizes given via the /M, /V, and /E command-line options before letting the driver reside in memory, and shuts down with an error message if they exceed 40 KiB. Would have been even better if he calculated the exact size – even in the current PMD version 4.8s from January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect is down to the INT 21h, AH=3Fh implementation in the underlying DOS version. DOS 3.3 treats the destination address as linear and reads past the end of the segment, DOS 5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining space in the segment, and maybe there's even a DOS that wraps around and ends up overwriting the PMD driver code. In any case: You will overwrite what's after the driver in memory – typically, the game .EXE and its master.lib functions.

It almost feels like a happy accident that this doesn't cause issues in the original games. The largest PMD file in any of the 4 games, the -86 version of 幽夢 ~ Inanimate Dream, takes up 8,099 bytes, just under the 8,192 byte limit for BGM. For modders, I'd really recommend implementing this properly, with PMD's AH=22h function and error handling, once position independence has been reached.

Whew, didn't think I'd be doing more research into KAJA's drivers during regular ReC98 development! That's probably been the final time though, as all involved functions are now decompiled, and I'm unlikely to iterate over them again.


And that's it! Repaid the biggest chunk of technical debt, time for some actual progress again. Next up: Reopening the store tomorrow, and waiting for new priorities. If we got nothing by Sunday, I'm going to put the pending [Anonymous] pushes towards some work on the website.

📝 Posted:
💰 Funded by:
[Anonymous], Blue Bolt
🏷️ Tags:

Technical debt, part 9… and as it turns out, it's highly impractical to repay 100% of it at this point in development. 😕

The reason: graph_putsa_fx(), ZUN's function for rendering optionally boldfaced text to VRAM using the font ROM glyphs, in its ridiculously micro-optimized TH04 and TH05 version. This one sets the "callback function" for applying the boldface effect by self-modifying the target of two CALL rel16 instructions… because there really wasn't any free register left for an indirect CALL, eh? The necessary distance, from the call site to the function itself, has to be calculated at assembly time, by subtracting the target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting lookup tables in the .DATA segment. With code segments, we can easily split them at pretty much any point between functions because there are multiple of them. But there's only a single .DATA segment, with all ZUN and master.lib data sandwiched between Borland C++'s crt0 at the top, and Borland C++'s library functions at the bottom of the segment. Adding another split point would require all data after that point to be moved to its own translation unit, which in turn requires EXTERN references in the big .ASM file to all that moved data… in short, it would turn the codebase into an even greater mess.
Declaring the labels as EXTERN wouldn't work either, since the linker can't do fancy arithmetic and is limited to simply replacing address placeholders with one single address. So, we're now stuck with this function at the bottom of the SHARED segment, for the foreseeable future.


We can still continue to separate functions off the top of that segment, though. Pretty much the only thing noteworthy there, so far: TH04's code for loading stage tile images from .MPN files, which we hadn't reverse-engineered so far, and which nicely fit into one of Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved the RE% bars again! If only for a tiny bit. :tannedcirno:
Both TH02 and TH05 simply store one pointer to one dynamically allocated memory block for all tile images, as well as the number of images, in the data segment. TH04, on the other hand, reserves memory for 8 .MPN slots, complete with their color palettes, even though it only ever uses the first one of these. There goes another 458 bytes of conventional RAM… I should start summing up all the waste we've seen so far. Let's put the next website contribution towards a tagging system for these blog posts.

At 86% of technical debt in the SHARED segment repaid, we aren't quite done yet, but the rest is mostly just TH04 needing to catch up with functions we've already separated. Next up: Getting to that practical 98.5% point. Since this is very likely to not require a full push, I'll also decompile some more actual TH04 and TH05 game code I previously reverse-engineered – and after that, reopen the store!

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Whoops, the build was broken again? Since P0127 from mid-November 2020, on TASM32 version 5.3, which also happens to be the one in the DevKit… That version changed the alignment for the default segments of certain memory models when requesting .386 support. And since redefining segment alignment apparently is highly illegal and absolutely has to be a build error, some of the stand-alone .ASM translation units didn't assemble anymore on this version. I've only spotted this on my own because I casually compiled ReC98 somewhere else – on my development system, I happened to have TASM32 version 5.0 in the PATH during all this time.
At least this was a good occasion to get rid of some weird segment alignment workarounds from 2015, and replace them with the superior convention of using the USE16 modifier for the .MODEL directive.

ReC98 would highly benefit from a build server – both in order to immediately spot issues like this one, and as a service for modders. Even more so than the usual open-source project of its size, I would say. But that might be exactly because it doesn't seem like something you can trivially outsource to one of the big CI providers for open-source projects, and quickly set it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular 64-bit Windows 10 and DOSBox running the exact build tools from the DevKit. Ideally, though, such a server should really run the optimal configuration of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step to run natively, which already is something that no popular CI service out there offers. Then, we'd optimally expand to Linux, every other Windows version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd be a lot. An experimental project all on its own, with additional hosting costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest there is once the store reopens (which will be at the beginning of May, at the latest). That aside, it would 📝 also be a great project for outside contributors!


So, technical debt, part 8… and right away, we're faced with TH03's low-level input function, which 📝 once 📝 again 📝 insists on being word-aligned in a way we can't fake without duplicating translation units. Being undecompilable isn't exactly the best property for a function that has been interesting to modders in the past: In 2018, spaztron64 created an ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human multiplayer mode: 2021-04-04-TH03-WASD-2player.zip However, this remapping attempt remained quite limited, since we hadn't (and still haven't) reached full position independence for TH03 yet. There's quite some potential for size optimizations in this function, which would allow more BIOS key groups to already be used right now, but it's not all that obvious to modders who aren't intimately familiar with x86 ASM. Therefore, I really wouldn't want to keep such a long and important function in ASM if we don't absolutely have to…

… and apparently, that's all the motivation I needed? So I took the risk, and spent the first half of this push on reverse-engineering TCC.EXE, to hopefully find a way to get word-aligned code segments out of Turbo C++ after all.

And there is! The -WX option, used for creating DPMI applications, messes up all sorts of code generation aspects in weird ways, but does in fact mark the code segment as word-aligned. We can consider ourselves quite lucky that we get to use Turbo C++ 4.0, because this feature isn't available in any previous version of Borland's C++ compilers.
That allowed us to restore all the decompilations I previously threw away… well, two of the three, that lookup table generator was too much of a mess in C. :tannedcirno: But what an abuse this is. The subtly different code generation has basically required one creative workaround per usage of -WX. For example, enabling that option causes the regular PUSH BP and POP BP prolog and epilog instructions to be wrapped with INC BP and DEC BP, for some reason:

a_function_compiled_with_wx proc
	inc 	bp    	; ???
	push	bp
	mov 	bp, sp
	    	      	; [… function code …]
	pop 	bp
	dec 	bp    	; ???
	ret
a_function_compiled_with_wx endp

Luckily again, all the functions that currently require -WX don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty close: snd_se_reset(void) is one of the functions that require word alignment. Previously, it shared a translation unit with the immediately following snd_se_play(int new_se), which does take a parameter, and therefore would have had its prolog and epilog code messed up by -WX. Since the latter function has a consistent (and thus, fakeable) alignment, I simply split that code segment into two, with a new -WX translation unit for just snd_se_reset(void). Problem solved – after all, two C++ translation units are still better than one ASM translation unit. :onricdennat: Especially with all the previous #include improvements.

The rest was more of the usual, getting us 74% done with repaying the technical debt in the SHARED segment. A lot of the remaining 26% is TH04 needing to catch up with TH03 and TH05, which takes comparatively little time. With some good luck, we might get this done within the next push… that is, if we aren't confronted with all too many more disgusting decompilations, like the two functions that ended this push. If we are, we might be needing 10 pushes to complete this after all, but that piece of research was definitely worth the delay. Next up: One more of these.

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Alright, no more big code maintenance tasks that absolutely need to be done right now. Time to really focus on parts 6 and 7 of repaying technical debt, right? Except that we don't get to speed up just yet, as TH05's barely decompilable PMD file loading function is rather… complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98 Touhou, I first consult the disassembly of Wolfenstein 3D. That game was originally compiled with the quite similar Borland C++ 3.0, so it's quite helpful to compare its ASM to the officially released source code. If I find the instructions in question, they mostly come from that game's ASM code, leading to the amusing realization that "even John Carmack was unable to get these instructions out of this compiler" :onricdennat: This time though, Wolfenstein 3D did point me to Borland's intrinsics for common C functions like memcpy() and strchr(), available via #pragma intrinsic. Bu~t those unfortunately still generate worse code than what ZUN micro-optimized here. Commenting how these sequences of instructions should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely though, clarifying the control flow, and clearly exposing a ZUN bug: TH05's snd_load() will hang in an infinite loop when trying to load a non-existing -86 BGM file (with a .M2 extension) if the corresponding -26 BGM file (with a .M extension) doesn't exist either.

Unsurprisingly, the PMD channel monitoring code in TH05's Music Room remains undecompilable outside the two most "high-level" initialization and rendering functions. And it's not because there's data in the middle of the code segment – that would have actually been possible with some #pragmas to ensure that the data and code segments have the same name. As soon as the SI and DI registers are referenced anywhere, Turbo C++ insists on emitting prolog code to save these on the stack at the beginning of the function, and epilog code to restore them from there before returning. Found that out in September 2019, and confirmed that there's no way around it. All the small helper functions here are quite simply too optimized, throwing away any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate that I do try.


Within that same 6th push though, we've finally reached the one function in TH05 that was blocking further progress in TH04, allowing that game to finally catch up with the others in terms of separated translation units. Feels good to finally delete more of those .ASM files we've decompiled a while ago… finally!

But since that was just getting started, the most satisfying development in both of these pushes actually came from some more experiments with macros and inline functions for near-ASM code. By adding "unused" dummy parameters for all relevant registers, the exact input registers are made more explicit, which might help future port authors who then maybe wouldn't have to look them up in an x86 instruction reference quite as often. At its best, this even allows us to declare certain functions with the __fastcall convention and express their parameter lists as regular C, with no additional pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even more amazing than previously thought when it comes to returning pseudo-registers from inline functions. A nice example for how this can improve readability can be found in this piece of TH02 code for polling the PC-98 keyboard state using a BIOS interrupt:

inline uint8_t keygroup_sense(uint8_t group) {
	_AL = group;
	_AH = 0x04;
	geninterrupt(0x18);
	// This turns the output register of this BIOS call into the return value
	// of this function. Surprisingly enough, this does *not* naively generate
	// the `MOV AL, AH` instruction you might expect here!
	return _AH;
}

void input_sense(void)
{
	// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
	// never emits as such, giving us only the three instructions we need.
	_AH = keygroup_sense(8);

	// Whereas this one gives us the one additional `MOV BH, AH` instruction
	// we'd expect, and nothing more.
	_BH = keygroup_sense(7);

	// And now it's obvious what both of these registers contain, from just
	// the assignments above.
	if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
		key_det |= INPUT_UP;
	}
	// […]
}

I love it. No inline assembly, as close to idiomatic C code as something like this is going to get, yet still compiling into the minimum possible number of x86 instructions on even a 1994 compiler. This is how I keep this project interesting for myself during chores like these. :tannedcirno: We might have even reached peak inline already?

And that's 65% of technical debt in the SHARED segment repaid so far. Next up: Two more of these, which might already complete that segment? Finally!

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Wow, 31 commits in a single push? Well, what the last push had in progress, this one had in maintenance. The 📝 master.lib header transition absolutely had to be completed in this one, for my own sanity. And indeed, it reduced the build time for the entirety of ReC98 to about 27 seconds on my system, just as expected in the original announcement. Looking forward to even faster build times with the upcoming #include improvements I've got up my sleeve! The port authors of the future are going to appreciate those quite a bit.

As for the new translation units, the funniest one is probably TH05's function for blitting the 1-color .CDG images used for the main menu options. Which is so optimized that it becomes decompilable again, by ditching the self-modifying code of its TH04 counterpart in favor of simply making better use of CPU registers. The resulting C code is still a mess, but what can you do. :tannedcirno:
This was followed by even more TH05 functions that clearly weren't compiled from C, as evidenced by their padding bytes. It's about time I've documented my lack of ideas of how to get those out of Turbo C++. :onricdennat:

And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…

In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Now that's the amount of translation unit separation progress I was looking for! Too bad that RL is keeping me more and more occupied these days, and ended up delaying this push until 2021. Now that Touhou Patch Center is also commissioning me to update their infrastructure, it's going to take a while for ReC98 to return to full speed, and for the store to be reopened. Should happen by April at the latest, though!

With everything related to this separation of translation units explained earlier, we've really got a push with nothing to talk about, this time. Except, maybe, for the realization that 📝 this current approach might not be the best fit for TH02 after all: Not only did it force us to 📝 throw away the previous decompilation of the sound effect playback functions, but OP.EXE also contains obviously copy-pasted code in addition to the common, shared set of library functions. How was that game even built, originally??? No way around compiling that one instance of the "delay until given BGM measure" function separately then, if it insists on using its own instance of the VSync delay function…
Oh well, this separated layout still works better for the later games, and consistency is good. Smooth sailing with all of the other functions, at least.

Next up: One more of these, which might even end up completing the 📝 transition to our own master.lib header file. In terms of the total number of ASM code left in the SHARED code segments, we're now 30% done after 3 dedicated pushes. It really shouldn't require 7 more pushes, though!

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:

Alright, back to continuing the master.hpp transition started in P0124, and repaying technical debt. The last blog post already announced some ridiculous decompilations… and in fact, not a single one of the functions in these two pushes was decompilable into idiomatic C/C++ code.

As usual, that didn't keep me from trying though. The TH04 and TH05 version of the infamous 16-pixel-aligned, EGC-accelerated rectangle blitting function from page 1 to page 0 was fairly average as far as unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be the .MRS functions, used to render the gauge attack portraits and bomb backgrounds. The blitting code there uses the additional FS and GS segment registers provided by the Intel 386… which

  1. are not supported by Turbo C++'s inline assembler, and
  2. can't be turned into pointers, due to a compiler bug in Turbo C++ that generates wrong segment prefix opcodes for the _FS and _GS pseudo-registers.

Apparently I'm the first one to even try doing that with this compiler? I haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around this bug and generate the correct instructions. But that would incur yet another dependency on a 16-bit TASM, for something honestly quite insignificant.

What we can always do, however, is using __emit__() to simply output x86 opcodes anywhere in a function. Unlike spelled-out inline assembly, that can even be used in helper functions that are supposed to inline… which does in fact allow us to fully abstract away this compiler bug. Regular if() comparisons with pseudo-registers wouldn't inline, but "converting" them into C++ template function specializations does. All that's left is some C preprocessor abuse to turn the pseudo-registers into types, and then we do retain a normal-looking poke() call in the blitting functions in the end. 🤯

Yeah… the result is batshit insane. I may have gone too far in a few places…


One might certainly argue that all these ridiculous decompilations actually hurt the preservation angle of this project. "Clearly, ZUN couldn't have possibly written such unreasonable C++ code. So why pretend he did, and not just keep it all in its more natural ASM form?" Well, there are several reasons:

Unfortunately, these pushes also demonstrated a second disadvantage in trying to decompile everything possible: Since Turbo C++ lacks TASM's fine-grained ability to enforce code alignment on certain multiples of bytes, it might actually be unfeasible to link in a C-compiled object file at its intended original position in some of the .EXE files it's used in. Which… you're only going to notice once you encounter such a case. Due to the slightly jumbled order of functions in the 📝 second, shared code segment, that might be long after you decompiled and successfully linked in the function everywhere else.

And then you'll have to throw away that decompilation after all 😕 Oh well. In this specific case (the lookup table generator for horizontally flipping images), that decompilation was a mess anyway, and probably helped nobody. I could have added a dummy .OBJ that does nothing but enforce the needed 2-byte alignment before the function if I really insisted on keeping the C version, but it really wasn't worth it.


Now that I've also described yet another meta-issue, maybe there'll really be nothing to say about the next technical debt pushes? :onricdennat: Next up though: Back to actual progress again, with TH01. Which maybe even ends up pushing that game over the 50% RE mark?

📝 Posted:
💰 Funded by:
Lmocinemod, Blue Bolt, [Anonymous]
🏷️ Tags:

Finally, after a long while, we've got two pushes with barely anything to talk about! Continuing the road towards 100% PI for TH05, these were exactly the two pushes that TH05 MAINE.EXE PI was estimated to additionally cost, relative to TH04's. Consequently, they mostly went to TH05's unique data structures in the ending cutscenes, the score name registration menu, and the staff roll.

A unique feature in there is TH05's support for automatic text color changes in its ending scripts, based on the first full-width Shift-JIS codepoint in a line. The \c=codepoint,color commands at the top of the _ED??.TXT set up exactly this codepoint→color mapping. As far as I can tell, TH05 is the only Touhou game with a feature like this – even the Windows Touhou games went back to manually spelling out each color change.

The orb particles in TH05's staff roll also try to be a bit unique by using 32-bit X and Y subpixel variables for their current position. With still just 4 fractional bits, I can't really tell yet whether the extended range was actually necessary. Maybe due to how the "camera scrolling" through "space" was implemented? All other entities were pretty much the usual fare, though.
12.4, 4.4, and now a 28.4 fixed-point format… yup, 📝 C++ templates were definitely the right choice.

At the end of its staff roll, TH05 not only displays the usual performance verdict, but then scrolls in the scores at the end of each stage before switching to the high score menu. The simplest way to smoothly scroll between two full screens on a PC-98 involves a separate bitmap… which is exactly what TH05 does here, reserving 28,160 bytes of its global data segment for just one overly large monochrome 320×704 bitmap where both the screens are rendered to. That's… one benefit of splitting your game into multiple executables, I guess? :tannedcirno:
Not sure if it's common knowledge that you can actually scroll back and forth between the two screens with the Up and Down keys before moving to the score menu. I surely didn't know that before. But it makes sense – might as well get the most out of that memory.


The necessary groundwork for all of this may have actually made TH04's (yes, TH04's) MAINE.EXE technically position-independent. Didn't quite reach the same goal for TH05's – but what we did reach is ⅔ of all PC-98 Touhou code now being position-independent! Next up: Celebrating even more milestones, as -Tom- is about to finish development on his TH05 MAIN.EXE PI demo…

📝 Posted:
💰 Funded by:
Lmocinemod
🏷️ Tags:

Alright, tooling and technical debt. Shouldn't be really much to talk about… oh, wait, this is still ReC98 :tannedcirno:

For the tooling part, I finished up the remaining ergonomics and error handling for the 📝 sprite converter that Jonathan Campbell contributed two months ago. While I familiarized myself with the tool, I've actually ran into some unreported errors myself, so this was sort of important to me. Still got no command-line help in there, but the error messages can now do that job probably even better, since we would have had to write them anyway.

So, what's up with the technical debt then? Well, by now we've accumulated quite a number of 📝 ASM code slices that need to be either decompiled or clearly marked as undecompilable. Since we define those slices as "already reverse-engineered", that decision won't affect the numbers on the front page at all. But for a complete decompilation, we'd still have to do this someday. So, rather than incorporating this work into pushes that were purchased with the expectation of measurable progress in a certain area, let's take the "anything goes" pushes, and focus entirely on that during them.

The second code segment seemed like the best place to start with this, since it affects the largest number of games simultaneously. Starting with TH02, this segment contains a set of random "core" functions needed by the binary. Image formats, sounds, input, math, it's all there in some capacity. You could maybe call it all "libzun" or something like that? But for the time being, I simply went with the obvious name, seg2. Maybe I'll come up with something more convincing in the future.


Oh, but wait, why were we assembling all the previous undecompilable ASM translation units in the 16-bit build part? By moving those to the 32-bit part, we don't even need a 16-bit TASM in our list of dependencies, as long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit Windows version. 🎉 Which is certainly the most user-visible improvement in all of these two pushes. :onricdennat:


Back in 2015, I already decompiled all of TH02's seg2 functions. As suggested by the Borland compiler, I tried to follow a "one translation unit per segment" layout, bundling the binary-specific contents via #include. In the end, it required two translation units – and that was even after manually inserting the original padding bytes via #pragma codestring… yuck. But it worked, compiled, and kept the linker's job (and, by extension, segmentation worries) to a minimum. And as long as it all matched the original binaries, it still counted as a valid reconstruction of ZUN's code. :zunpet:

However, that idea ultimately falls apart once TH03 starts mixing undecompilable ASM code inbetween C functions. Now, we officially have no choice but to use multiple C and ASM translation units, with maybe only just one or two #includes in them…

…or we finally start reconstructing the actual seg2 library, turning every sequence of related functions into its own translation unit. This way, we can simply reuse the once-compiled .OBJ files for all the binaries those functions appear in, without requiring that additional layer of translation units mirroring the original segmentation.
The best example for this is TH03's almost undecompilable function that generates a lookup table for horizontally flipping 8 1bpp pixels. It's part of every binary since TH03, but only used in that game. With the previous approach, we would have had to add 9 C translation units, which would all have just #included that one file. Now, we simply put the .OBJ file into the correct place on the linker command line, as soon as we can.

💡 And suddenly, the linker just inserts the correct padding bytes itself.

The most immediate gains there also happened to come from TH03. Which is also where we did get some tiny RE% and PI% gains out of this after all, by reverse-engineering some of its sprite blitting setup code. Sure, I should have done even more RE here, to also cover those 5 functions at the end of code segment #2 in TH03's MAIN.EXE that were in front of a number of library functions I already covered in this push. But let's leave that to an actual RE push 😛


All in all though, I was just getting started with this; the real gains in terms of removed ASM files are still to come. But in the meantime, the funding situation has become even better in terms of allowing me to focus on things nobody asked for. 🙂 So here's a slightly better idea: Instead of spending two more pushes on this, let's shoot for TH05 MAINE.EXE position independence next. If I manage to get it done, we'll have a 100% position-independent TH05 by the time -Tom- finishes his MAIN.EXE PI demo, rather than the 94% we'd get from just MAIN.EXE. That's bound to make a much better impression on all the people who will then (re-)discover the project.

📝 Posted:
💰 Funded by:
[Anonymous], Blue Bolt
🏷️ Tags:

… and just as I explained 📝 in the last post how decompilation is typically more sensible and efficient than ASM-level reverse-engineering, we have this push demonstrating a counter-example. The reason why the background particles and lines in the Shinki and EX-Alice battles contributed so much to position dependence was simply because they're accessed in a relatively large amount of functions, one for each different animation. Too many to spend the remaining precious crowdfunded time on reverse-engineering or even decompiling them all, especially now that everyone anticipates 100% PI for TH05's MAIN.EXE.

Therefore, I only decompiled the two functions of the line structure that also demonstrate best how it works, which in turn also helped with RE. Sadly, this revealed that we actually can't 📝 overload operator =() to get that nice assignment syntax for 12.4 fixed-point values, because one of those new functions relies on Turbo C++'s built-in optimizations for trivially copyable structures. Still, impressive that this abstraction caused no other issues for almost one year.

As for the structures themselves… nope, nothing to criticize this time! Sure, one good particle system would have been awesome, instead of having separate structures for the Stage 2 "starfield" particles and the one used in Shinki's battle, with hardcoded animations for both. But given the game's short development time, that was quite an acceptable compromise, I'd say.
And as for the lines, there just has to be a reason why the game reserves 20 lines per set, but only renders lines #0, #6, #12, and #18. We'll probably see once we get to look at those animation functions more closely.

This was quite a 📝 TH03-style RE push, which yielded way more PI% than RE%. But now that that's done, I can finally not get distracted by all that stuff when looking at the list of remaining memory references. Next up: The last few missing structures in TH05's MAIN.EXE!

📝 Posted:
💰 Funded by:
[Anonymous], -Tom-, Splashman
🏷️ Tags:

Well, that took twice as long as I thought, with the two pushes containing a lot more maintenance than actual new research. Spending some time improving both field names and types in 32th System's TH03 resident structure finally gives us all of those structures. Which means that we can now cover all the remaining decompilable ZUN.COM parts at once…

Oh wait, their main() functions have stayed largely identical since TH02? Time to clean up and separate that first, then… and combine two recent code generation observations into the solution to a decompilation puzzle from 4½ years ago. Alright, time to decomp-

Oh wait, we'd kinda like to properly RE all the code in TH03-TH05 that deals with loading and saving .CFG files. Almost every outside contributor wanted to grab this supposedly low-hanging fruit a lot earlier, but (of course) always just for a single game, while missing how the format evolved.

So, ZUN.COM. For some reason, people seem to consider it particularly important, even though it contains neither any game logic nor any code specific to PC-98 hardware… All that this decompilable part does is to initialize a game's .CFG file, allocate an empty resident structure using master.lib functions, release it after you quit the game, error-check all that, and print some playful messages~ (OK, TH05's also directly fills the resident structure with all data from MIKO.CFG, which all the other games do in OP.EXE.) At least modders can now freely change and extend all the resident structures, as well as the .CFG files? And translators can translate those messages that you won't see on a decently fast emulator anyway? Have fun, I guess 🤷‍

And you can in fact do this right now – even for TH04 and TH05, whose ZUN.COM currently isn't rebuilt by ReC98. There is actually a rather involved reason for this:

So yeah, no meaningful RE and PI progress at any of these levels. Heck, even as a modder, you can just replace the zun zun_res (TH02), zun -5 (TH03), or zun -s (TH04/TH05) calls in GAME.BAT with a direct call to your modified *RES*.COM. And with the alternative being "manually typing 0 and 1 bits into a text file", editing the sprites in TH05's GJINIT.COM is way more comfortable in a binary sprite editor anyway.

For me though, the best part in all of this was that it finally made sense to throw out the old Borland C++ run-time assembly slices 🗑 This giant waste of time became obvious 5 years ago, but any ASM dump of a .COM file would have needed rather ugly workarounds without those slices. Now that all .COM binaries that were originally written in C are compiled from C, we can all enjoy slightly faster grepping over the entire repository, which now has 229 fewer files. Productivity will skyrocket! :tannedcirno:

Next up: Three weeks of almost full-time ReC98 work! Two more PI-focused pushes to finish this TH05 stretch first, before switching priorities to TH01 again.

📝 Posted:
💰 Funded by:
KirbyComment, -Tom-
🏷️ Tags:

Turns out that covering TH03's 128-byte player structure was way more insightful than expected! And while it doesn't include every bit of per-player data, we still got to know quite a bit about the game from just trying to name its members:

The next TH03 pushes can now cover all the functions that reference this structure in one way or another, and actually commit all this research and translate it into some RE%. Since the non-TH05 priorities have become a bit unclear after the last 50 € RE contribution though (as of this writing, it's still 10 € to decide on what game to cover in two RE pushes!), I'll be returning to TH05 until that's decided.

📝 Posted:
💰 Funded by:
KirbyComment
🏷️ Tags:

As noted in 📝 P0061, TH03 gameplay RE is indeed going to progress very slowly in the beginning. A lot of the initial progress won't even be reflected in the RE% – there are just so many features in this game that are intertwined into each other, and I only consider functions to be "reverse-engineered" once we understand every involved piece of code and data, and labeled every absolute memory reference in it. (Yes, that means that the percentages on the front page are actually underselling ReC98's progress quite a bit, and reflect a pretty low bound of our actual understanding of the games.)

So, when I get asked to look directly at gameplay code right now, it's quite the struggle to find a place that can be covered within a push or two and that would immediately benefit scoreplayers. The basics of score and combo handling themselves managed to fit in pretty well, though:

Oh well, at least we still got a bit of PI% out of this one. From this point though, the next push (or two) should be enough to cover the big 128-byte player structure – which by itself might not be immediately interesting to scoreplayers, but surely is quite a blocker for everything else.

📝 Posted:
💰 Funded by:
Touhou Patch Center
🏷️ Tags:

A~nd resident structures ended up being exactly the right thing to start off the new year with. WindowsTiger and spaztron64 have already been pushing for them with their own reverse-engineering, and together with my own recent GENSOU.SCR RE work, we've clarified just enough context around the harder-to-explain values to make both TH04's and TH05's structures fit nicely into the typical time frame of a single push.

With all the apparently obvious and seemingly just duplicated values, it has always been easy to do a superficial job for most of the structure, then lose motivation for the last few unknown fields. Pretty glad to got this finally covered; I've heard that people are going to write trainer tools now?

Also, where better to slot in a push that, in terms of figures, seems to deliver 0% RE and only miniscule PI progress, than at the end of Touhou Patch Center's 5-push order that already had multiple pushes yielding above-average progress? :onricdennat: As usual, we'll be reaping the rewards of this work in the next few TH04/TH05 pushes…

…whenever they get funded, that is, as for January, the backers have shifted the priorities towards TH01 and TH03. TH01 especially is something I'm quite excited about, as we're finally going to see just how fast this bloated game is really going to progress. Are you excited?

📝 Posted:
💰 Funded by:
Touhou Patch Center
🏷️ Tags:

🎉 TH04's and TH05's OP.EXE are now fully position-independent! 🎉

What does this mean?

You can now add any data or code to the main menus of the two games, by simply editing the ReC98 source, writing your mod in ASM or C/C++, and recompiling the code. Since all absolute memory addresses have now been converted to labels, this will work without causing any instability. See the position independence section in the FAQ for a more thorough explanation about why this was a problem.

What does this not mean?

The original ZUN code hasn't been completely reverse-engineered yet, let alone decompiled. Pretty much all of that is still ASM, which might make modding a bit inconvenient right now.

Since this push was otherwise pretty unremarkable, I made a video demonstrating a few basic things you can do with this:

Now, what to do for the last outstanding Touhou Patch Center push? Bullets, or resident structures? :thonk:

📝 Posted:
💰 Funded by:
Touhou Patch Center
🏷️ Tags:

… nope, with a game whose MAIN.EXE is still just 5% reverse-engineered and which naturally makes heavy use of structures, there's still a lot more PI groundwork to be done before RE progress can speed up to the levels that we've now reached with TH05. The good news is that this game is (now) way easier to understand: In contrast to TH04 and TH05, where we needed to work towards player shots over a two-digit number of pushes, TH03 only needed two for SPRITE16, and a half one for the playfield shaking mechanism. After that, I could even already decompile the per-frame shot update and render functions, thanks to TH03's high number of code segments. Now, even the big 128-byte player structure doesn't seem all too far off.

Then again, as TH03 shares no code with any other game, this actually was a completely average PI push. For the remaining three, we'll return to TH04 and TH05 though, which should more than make up for the slight drop in RE speed after this one.

In other news, we've now also reached peak C++, with the introduction of templates! TH03 stores movement speeds in a 4.4 fixed-point format, which is an 8-bit spin on the usual 16-bit, 12.4 fixed-point format.

📝 Posted:
💰 Funded by:
Touhou Patch Center
🏷️ Tags:

So, where to start? Well, TH04 bullets are hard, so let's procrastinate start with TH03 instead :tannedcirno: The 📝 sprite display functions are the obvious blocker for any structure describing a sprite, and therefore most meaningful PI gains in that game… and I actually did manage to fit a decompilation of those three functions into exactly the amount of time that the Touhou Patch Center community votes alloted to TH03 reverse-engineering!

And a pretty amazing one at that. The original code was so obviously written in ASM and was just barely decompilable by exclusively using register pseudovariables and a bit of goto, but I was able to abstract most of that away, not least thanks to a few helpful optimization properties of Turbo C++… seriously, I can't stop marveling at this ancient compiler. The end result is both readable, clear, and dare I say portable?! To anyone interested in porting TH03, take a look. How painful would it be to port that away from 16-bit x86?

However, this push is also a typical example that the RE/PI priorities can only control what I look at, and the outcome can actually differ greatly. Even though the priorities were 65% RE and 35% PI, the progress outcome was +0.13% RE and +1.35% PI. But hey, we've got one more push with a focus on TH03 PI, so maybe that one will include more RE than PI, and then everything will end up just as ordered? :onricdennat:

📝 Posted:
💰 Funded by:
rosenrose, [Anonymous]
🏷️ Tags:

No priorities, again…?! Please don't do this to me… 😕

Well, let's not continue with TH05 then 😛 And instead use the occasion to commit this interesting discovery, made by @m1yur1 last year. Yup, TH03's "ZUNSP" sprite driver is actually a "rebranded" version of Promisence Soft's SPRITE16.COM. Sure, you were allowed to use this driver in your own game, but replacing the copyright with your own isn't exactly the nicest thing to do… That now makes three library programmers that ZUN didn't credit. Makes me wonder what makes M. Kajihara so special. Probably the fact that Touhou has always been about the music for ZUN, first and foremost.

But what makes this more than a piece of trivia is the fact that Promiscence Soft's SPRITE16 sample game StormySpace was bundled with documentation on the driver. Shoutout to the Neo Kobe PC-98 collection for preserving he original release!

That means more documented third-party code that we don't necessarily have to reverse-engineer, just like master.lib or KAJA's PMD driver. However, the PC-98 EGC is rather complex and definitely not designed for alpha-tested 16-color sprite blitting. So it (once again) took quite a while to make sense of SPRITE16's code and the available documentation on the EGC, to come up with satisfying function names. As a result, I'm going to distribute the entire RE work related to TH03's SPRITE16 interface across a total of three pushes, this one being the first of them.

Next up, we do have more TH05 progress though, now for the first time specifically with position independence in mind. Pretty excited for this one!

📝 Posted:
💰 Funded by:
-Tom-
🏷️ Tags:

Boss explosions! And… urgh, I really also had to wade through that overly complicated HUD rendering code. Even though I had to pick -Tom-'s 7th push here as well, the worst of that is still to come. TH04 and TH05 exclusively store the current and high score internally as unpacked little-endian BCD, with some pretty dense ASM code involving the venerable x86 BCD instructions to update it.

So, what's actually the goal here. Since I was given no priorities :onricdennat:, I still haven't had to (potentially) waste time researching whether we really can decompile from anywhere else inside a segment other than backwards from the end. So, the most efficient place for decompilation right now still is the end of TH05's main_01_TEXT segment. With maybe 1 or 2 more reverse-engineering commits, we'd have everything for an efficient decompilation up to sub_123AD. And that mass of code just happens to include all the shot type control functions, and makes up 3,007 instructions in total, or 12% of the entire remaining unknown code in MAIN.EXE.

So, the most reasonable thing would be to actually put some of the upcoming decompilation pushes towards reverse-engineering that missing part. I don't think that's a bad deal since it will allow us to mod TH05 shot types in C sooner, but zorg and qp might disagree :thonk:

Next up: thcrap TL notes, followed by finally finishing GhostPhanom's old ReC98 future-proofing pushes. I really don't want to decompile without a proper build system.

📝 Posted:
💰 Funded by:
-Tom-
🏷️ Tags:

Turns out I had only been about half done with the drawing routines. The rest was all related to redrawing the scrolling stage backgrounds after other sprites were drawn on top. Since the PC-98 does have hardware-accelerated scrolling, but no hardware-accelerated sprites, everything that draws animated sprites into a scrolling VRAM must then also make sure that the background tiles covered by the sprite are redrawn in the next frame, which required a bit of ZUN code. And that are the functions that have been in the way of the expected rapid reverse-engineering progress that uth05win was supposed to bring. So, looks like everything's going to go really fast now?

📝 Posted:
💰 Funded by:
zorg
🏷️ Tags:

… yeah, no, we won't get very far without figuring out these drawing routines.
Which process data that comes from the .STD files. Which has various arrays related to the background… including one to specify the scrolling speed. And wait, setting that to 0 actually is what starts a boss battle?

So, have a TH05 Boss Rush patch: 2018-12-26-TH05BossRush.zip Theoretically, this should have also worked for TH04, but for some reason, the Stage 3 boss gets stuck on the first phase if we do this?

Here's the diff for the Boss Rush. Turning it into a thcrap-style Skipgame patch is left as an exercise for the reader.

📝 Posted:
💰 Funded by:
DTM
🏷️ Tags:

While we're waiting for Bruno to release the next thcrap build with ANM header patching, here are the resulting commits of the ReC98 CDG/CD2 special offer purchased by DTM, reverse-engineering all code that covers these formats.

📝 Posted:
💰 Funded by:
zorg
🏷️ Tags:
> OK, let's do a quick ReC98 update before going back to thcrap, shouldn't take long > Hm, all that input code is kind of in the way, would be nice to cover that first to ease comparisons with uth05win's source code > What the hell, why does ZUN do this? Need to do more research > … > OK, research done, wait, what are those other functions doing? > Wha, everything about this is just ever so slightly awkward

Which ended up turning this one update into 2/10, 3/10, 4/10 and 5/10 of zorg's reverse-engineering commits. But at least we now got all shared input functions of TH02-TH05 covered and well understood.