P0327
TH03 RE (Character-specific attack function pointers / Bullet dependencies / Bullet structure)
P0328
TH03 decompilation (Bullets, part 1)
P0329
TH03 decompilation (Bullets, part 2) / Twitter→Fediverse migration, part 1
💰 Funded by:
[Anonymous], Ember2528
🏷️ Tags:
Alright! I've been announcing a big look at TH03's in-game systems all throughout 2025, and I technically still made it before the end of the year. TH03's enemy, fireball, and explosion systems are a great fit for this occasion: They fulfill both of the netplay-relevant criteria I mentioned 📝 at the end of the previous blog post, but also unfortunately share the same structure and overload some of their fields with vastly different meanings, much like 📝 TH04's and TH05's custom entities. Hence, they will take rather long to untangle, which ensures that the resulting look will be appropriately big.
Until I noticed that explosions spawn bullets in a not-so-straightforward way that basically requires complete knowledge of the bullet system. What a great discovery to make 2½ pushes into development… Oh well, we have the budget, and bullets also happen to match our netplay-relevant criteria, so let's get those done first.
As usual, we'd also like to identify and name all character-specific functions in the ASM code so that we can immediately correlate certain interesting features of the bullet system with the characters and attacks that use them. In TH03, this is particularly worthwhile because it's all we need for a 100% complete overview of how bullets are used. Apart from the transferred pellets fired from exploding enemies outside of Gauge or Boss Attacks, every bullet pattern in the game is part of such a hardcoded and character-specific Extra, Gauge, or Boss Attack, since enemy scripts cannot fire bullets in this game.
This quick look also showed how ZUN implemented the 9 characters in a highly consistent manner. Gauge Attacks in particular follow a predictable convention:
The "Level 2" attack (available at 50% gauge and consuming 25% gauge) always fires 8×8 pellets.
The "Level 3" attack (available at 75% gauge and consuming 50% gauge) always fires 16×16 bullets.
The game only provides a single function pointer for each of the two levels, which gets called as part of game logic and before the game starts rendering the current frame. With no room for custom rendering calls, characters can only define these attacks as patterns that are made up of common entities.
The funny part in all of this: All characters follow these conventions, yet ZUN still architected TH03 as if they don't. Each of the two Gauge Attack levels gets a separate per-player function pointer, but every character just uses these two functions to call a single common function with a flag that indicates Level 2 or Level 3. This common function then follows a similar structure for all 9 characters as well. The same trend continues with the Boss Attacks, where we find 9 copies of the more or less unchanged update and rendering boilerplate… so yeah, ZUN basically copy-pasted the same code 9 times with minor variations.
And now we can be very hyped for the future of TH03 decompilation. Lots of duplicates of the same functionality means that I'll basically only have to decompile them once, which means that TH03 decompilation is very likely to progress very quickly in terms of absolute numbers once I get the basic gameplay systems done. And there's not a lot of that code left either: After this delivery, we're left with a mere 128 undecompiled foundational functions that are not related to any specific character. After the next delivery, that number will drop to 95. I'm expecting a return to the glorious days of 2020, where the 3 copies of TH01's foundational graphics code allowed me to decompile 10% of its entire code within 2½weeks. With 9 copies tripling that speed, we may even get to finish this game next year?
Onto bullets then! As you'd expect, TH03's bullet system is based on 📝 TH02's system we looked at earlier this year, which in turn was based on TH01's system. In some respects, it's a minor iteration of TH02's system adapted to the new features in TH03, but some of these new features also form the missing link between TH02 and 📝 TH04. The high-level overview:
Like most entities in TH03, bullets are stored in a single array that is shared between both players. Each bullet has a structure field that denotes which playfield it is moving on and constrained to.
The total bullet cap shared among both players is 320, slightly more than twice the 150-bullet cap we saw in TH02.
Just like in TH02, this system covers both 8×8 pellets and 16×16 sprite bullets. The former are once again hardcoded and rendered using the GRCG, while the latter are rendered by SPRITE16.
The system defines a default set of four adjacent 16×16 bullet sprites, starting at (64, 0) within 📝 SPRITE16's sprite area. These can be used in two ways:
Four animation frames for a single bullet type, animated at the maximum speed of 1 cel per frame. This is how they are used by most characters.
Alternatively, they can represent one non-animated bullet type and three trail sprites, as seen in Mima's and Chiyuri's Boss Attacks.
Patterns can override the default 16×16 sprites with an arbitrary other set of four adjacent sprites, but this feature is only used in Kana's Extra Attack.
Character
Sprites
Reimu
Mima
Marisa
Ellen
Kotohime
Kana
Rikako
Chiyuri
Yumemi
The SPRITE16 area of certain characters might contain other bullet-like sprites, but these are used by a different gameplay system than the one described in this post.
I've reduced the animation speed to 1/8 its original length because 18 ms would look very obnoxious in the context of a web page. Run webpmux -duration 18ms on the animated WebP files to restore the original speed.
And yes, Yumemi's sprite is animated very subtly. Click the animated sprites for the raw sprite sheet.
As we already found out 📝 in 2022, both 8×8 pellets and 16×16 bullets have the same "hitbox" – a single 2×2-pixel tile in the game's collision bitmap that gets compared against the 8×8 square surrounding the player's center.
Delay clouds are back after their absence in TH02. They are still limited to pellets, though.
The 📝 bullet template makes its debut in this game, replacing TH01's and TH02's spawn function parameters with a single piece of global data. This introduces the usual trade-offs with this sort of thing: Code size savings in patterns that spawn multiple groups with minor variations to their parameters, in exchange for the usual confusion that comes with widely mutated global state. As a result, code quality suffers greatly, especially when it comes to the derived transfer pellets fired within the bullet system itself. Sure, there are lots of patterns where retained state comes in handy, but local per-pattern template instances would have solved that as well.
This decline in code quality also extends to the rest of the logic code. Continuing his general trend of micro-optimizing based purely on vibes we've seen 📝 time and 📝 time again, ZUN wrote a significant part of TH03's bullet logic in ASM, especially in the update function. And once again, it's the same tragic conclusion: While TH04's and TH05's full-on ASM approach might later bring some measurable runtime benefits to bullet logic via self-modifying code and whatnot, TH03 is left with pretty much only the downsides of its partial ASM approach. If ZUN just wrote idiomatic C++ code without any optimization tricks and inlined just one function, the whole bullet update code would have been 87 lines of C++ shorter and exactly as large when compiled. And that's with all the redundant code still in place! TH02's not-great-but-passable implementation of bullets indeed marked the high point for the PC-98 series, and it only went downhill from there.
Three features of the bullet system deserve a deeper look:
Trail sprites
These work by remembering the last 6 positions of a bullet and rendering the sprites at the 2nd, 4th, and 6th position, respectively:
Obviously, this requires (6 ×
2 ×
2) =
24 additional bytes per bullet. Adding these to the regular bullet structure would waste
(24 × 320) = 7,680 bytes of conventional RAM, which would in no way be justified for a feature that ZUN used in a grand total of two patterns. For once, ZUN agreed, and instead provided a single ring buffer that can hold these 6 additional positions for up to 48 bullets. This allowed ZUN to reduce the per-bullet cost to 3 bytes: 1 byte for the has trail flag, and 2 bytes for a near pointer into the ring buffer. That's still two more bytes than absolutely needed, and the debloated branch will definitely free up these wasted 640 bytes for portability reasons alone.
This trail sprite cap of 48 seems a bit random at first. Unlike the regular bullet cap that the game enforces by just not spawning any new bullets if all 320 slots are occupied, the trail sprite cap is not enforced or even just checked in any way. Due to the circular nature of the buffer, the 49th simultaneously active bullet with a trail sprite will then share its position memory with the 1st trail sprite bullet, leading to one additional position memory update per frame and trail sprites appearing in wrong positions.
Thus, it's the game design's responsibility to make minimal use of trail sprites to avoid these glitches. On the surface, it certainly looks as if ZUN was careful here:
The cap happens to exactly match the 48 bullets fired as part of the ring group in Chiyuri's pattern, which is definitely the more bullet-intensive pattern of the two.
Both trail-using patterns are part of Boss Attacks, and only a single player's Boss Attack can be active at any given time.
The ring groups in Chiyuri's pattern move fast and are separated by a 96-frame delay, as captured in the video above. By the time Chiyuri spawns the next group, every bullet of the previous one should have long been removed due to flying past the edges of the playfield.
Mima fires her alternating 5- and 4-spreads at a much shorter interval, but that interval is still long enough to never leave more than 5 of these 9-bullet subpatterns on screen at once.
Until you test Chiyuri's pattern on the (📝 announced) Boss Attack level 1 and notice that the bullets move slow enough for 2) to no longer apply. The result:
Missing bullets on every group beyond the first, and even sporadic trail sprites at mixed-up X and Y coordinates. This nicely demonstrates how these trail sprites are not just cosmetic, but also take control of clipping and affect gameplay as a result. For obvious optical reasons, trail sprite bullets will only get removed after the 6th remembered position lies outside of the clipping area – i.e., 6 frames later than bullets without trail sprites. If the remembered positions are then shared with a second bullet, the game would also clip that bullet if the clipping condition of the first one is met – regardless of the fact that the second bullet's main sprite might be nowhere close to the boundaries of the playfield. This clipping then either happens on the same frame if the second bullet's slot number within the 320-element bullet array is higher than the slot number of the first one, or on the next frame if the second bullet's slot number is lower.
The mixed-up X and Y positions on frames 139 and 235 can also be explained by clipping. The update function processes the X and Y coordinates independently from each other: It starts with the horizontal clipping checks, updates the position memory for the X coordinate, and then repeats both steps for the Y coordinate, immediately removing the bullet and moving on to the next one if it failed the respective clipping check. If the vertical clipping checks fail in a situation where two bullets share the same position memory, you'll end up with a mismatched X/Y pair where X comes from a clipped bullet and Y comes from an active one… for a single frame, until the same clipping check is applied to the other bullet and removes it as well. Hence, this is the only "fixable" bug in the bullet system that won't affect gameplay, as the mixed-up positions are unrelated to the result of the clipping condition that ultimately removes both bullets.
It's rare for Chiyuri's Boss Attack to launch the same pattern multiple times in a row, and once the (announced) Boss Attack level is ≥3, bullets already move fast enough to prevent this quirk from happening. But it's definitely possible to run into it during regular gameplay.
For a clearer and more extreme demonstration of the resulting glitches, let's turn Mima's trail sprite pattern into a 64-ring:
Explaining every single quirk in this hypothetical video is left as an exercise to the reader.
Rings and other groups
If there's one aspect where TH03's bullet system shows its TH02 lineage most clearly, it's the set of predefined bullet groups. The 2-, 3-, 4-, and 5-spreads with fixed narrow, medium, and wide arc angles, as well as the multi-bullet groups with randomized angles and speeds, are not only available in TH03 once again, but reuse the exact same code from TH02.
Instead, TH03's main innovation can be found in its ring system. Rings can now have any number of bullets between 0 and 255, and are no longer limited to the first six powers of 2. This allowed ZUN to fine-tune most ring groups based on the Gauge or Boss Attack level, and to also just have a few static ring patterns with non-power-of-two bullet counts. Chiyuri's aforementioned 48-ring trail sprite pattern falls in this category, and the rotating 5-ring pattern seen in Kana's Boss Attack is another example.
And yes, storing the number of ring bullets in a regular uint8_t field now also allows patterns to spawn 0-rings. And sure enough, the bug that would later cause 📝 Kurumi's Divide Error crash in TH04 was actually introduced in TH03! The underlying code wasn't modified between the two games, which further proves that TH04's bullet system also traces back to TH03 and wasn't rewritten from scratch, at least concerning this aspect. TH03 just doesn't have any (known) way of triggering the bug in the unmodded original game.
Interestingly, TH02's ring system with predefined power-of-2 bullet counts is still part of TH03, and ZUN does use it for some ring groups in a few Boss Attack patterns. Did he do this because it's shorter than adding a second line of code that sets bullet_template.count? Did he deliberately need to preserve the previous value of bullet_template.count across groups? Or did he code these patterns at an earlier time in development when the arbitrary ring system didn't exist yet? Until I've decompiled every single bullet pattern in this game, we can only guess.
However, ZUN also removed two of TH02's group-related features from TH03:
The eight special motion types have been reduced to a single gravity type. While gravity is now a separate flag in the bullet structure and template that can now be applied to any group, this vast removal of options still severely limits the expressivity of bullet patterns in TH03. This means that every non-gravity bullet in the game moves at a constant velocity.
Gravity is also exclusively used by Kana, in both her Extra attack with the
alternative 16×16 bullet sprites as well as in one certain pattern of her Boss Attack, which demonstrates gravity in combination with a ring group.
One of the rare patterns that arguably looks prettier on Easy, where the slower bullet speeds leave more room for the gravity effect to accelerate the fall.
The auto-stacking system was removed without any direct replacement. With TH03's more 📝 numeric method of defining difficulty, ZUN no longer needed this quick mechanism to 📝 distinguish Easy and Normal from Hard and Lunatic. This was one of the better changes between the two games though; the auto-stacking system added a quite annoying asterisk to the documentation of the random groups that is no longer needed in TH03.
Manually creating stacks is obviously still possible by spawning separate versions of the same group with gradually reduced speed. This might be considered another practical advantage of the global bullet template, since you only need to mutate a single field before calling bullets_add() again. But really, nothing justifies global data.
Transferred pellets
This is the final gameplay feature that deserves its own section. Let's follow the pellet's X coordinate on its way from the spawn point to its destination on the other playfield:
ZUN calculates the destination coordinate on the target playfield as a random Q12.4 X coordinate between 0 and 288.
This subpixel coordinate is translated to screen space, adding either 16 for the left playfield or 336 for the right one. ZUN does this using the regular pixel-space conversion function that is typically used to calculate blitting coordinates, losing subpixel precision in the process and forming a very minor quirk.
The pellet's movement angle is calculated in screen space, aiming a screen-space version of the pellet's origin point at the coordinate from 2).
The screen-space pixel coordinate from 2) is translated back to a Q12.4 subpixel coordinate on the pellet's originating playfield. The result will deliberately lie outside the boundaries of this playfield: For a pellet flying from left to right, it will be between 320 and 608, while it will be between -320 and -32 for a pellet flying from right to left.
The pellet then flies to this out-of-bounds coordinate while internally staying on the playfield it originated on. This means that neither the update nor the rendering code can clip the pellet at the borders of its originating playfield. Once it flew past the border, it only visually appears on the other playfield because that's what the out-of-bounds X coordinate translates to when the renderer converts it to screen space.
Once the pellet's X coordinate has approached or flown past this relative target coordinate from its respective movement direction, the pellet is removed and respawned as a delay cloud.
This shows that the 32-pixel border between the two playfields is not just visual, but an actual part of the simulated game world. We can visualize this by removing the black cells on the text layer:
Also, we need to clear these 32 border pixels in VRAM on every frame to nicely visualize just these pellet transfers. TH03 obviously doesn't do that for performance reasons and lets partially clipped sprites accumulate below the border, 📝 just like the other shmups do.
This video also demonstrates another minor quirk: Transferred pellets are aimed at a random center Y coordinate between 0 and 16, but the subsequent delay cloud is always spawned at center_y = 2.0.
Speaking of, there's also a lot to cover in…
TH03's bullet renderer
And at first, it looks pretty good! TH03 retains the best idea from TH02 and batches rendering into three passes:
16×16 bullets, rendered normally via SPRITE16
32×32 delay clouds, rendered via SPRITE16's monochrome mode
8×8 pellets, rendered from hardcoded sprites via the GRCG
The second pass is skipped if the first pass didn't detect at least a single active delay cloud. However, this skip can only remove 320 out of the 960 iterations over the entire bullet array, every frame. Combine that with the most unlucky allocation of registers, and the resulting instructions end up wasting a low 5-digit number of CPU cycles per frame on a 486 in the worst case of no bullets being active. Same game that wrote large parts of its bullet update function in ASM, by the way.
While that number is still an order of magnitude away from causing significant performance problems, this issue became serious enough in TH04 for ZUN to introduce a display list for at least pellets that would cut down the number of iterations.
And then, we look at…
Pellet rendering
… and are greeted by the single strangest set of hardcoded sprites across all of PC-98 Touhou so far. TH03's pellets are not only the first time we see a doubly-preshifted sprite sheet, but 2 of the 16 variants for the transfer pellet sprites are also shifted incorrectly:
You can see this bug all over the video above, for example in frame 97, 105, 107, 109, 111…
The doubly-preshifted nature of this sprite sheet, on the other hand, raises a whole lot of PC-98 blitting performance questions. This only possibly makes sense as an attempt at optimizing away the unaligned 16-bit VRAM writes you'd naturally run into when shifting an 8-wide sprite to cover two bytes.
Let's look at a regular 8-wide pellet sprite that was singly-preshifted to 16 pixels/bits. If we want to blit such a sprite to a left X position of 12 with the minimum amount of instructions, we would perform a single 2-byte write to VRAM address 0x0001, which itself is not divisible by 2:
On most 16-bit architectures, unaligned memory writes like these are either slower than aligned writes or entirely unsupported. The x86 MOV and MOVS instructions fall into the first category, so it makes sense to think that the GRCG might add a performance penalty of its own on top of the already higher latency of these instructions.
The natural workaround, then, is to add a second set of preshifted sprites to cover the remaining 8 possible start bit positions within a 16-pixel VRAM word. This would expand pellet sprites to a total width of 23 pixels. Understandably, ZUN also wanted to optimize for the low instruction counts, so he had to round up the physical width of the sprite to 32 pixels. Then, every preshifted variant could be blitted with a single MOVSD instruction:
But does this actually matter for the PC-98 and the GRCG? Are unaligned writes actually slow enough to justify writing 2× as much sprite data per frame and hardcoding 4× as many bytes? Unfortunately, I don't know of any hardware-level documentation about the GRCG that would conclusively answer this question. All the usual books and text files are disappointingly surface-level and only document the same programmer interfaces over and over, and hardware researchers are still waiting for EGC and GRCG die shots to even get started.
There are a few signs that this might be a good idea:
Any VRAM-reading EGC operation must use aligned 16-bit accesses, which probably has a deeper reason that goes beyond the size of its internal shift register. And since you activate the EGC by first activating the GRCG in TDW mode…
Neko Project spends the same number of clock cycles on both 8- and 16-bit GRCG writes. The absence of a dedicated 32-bit write handler suggests that real hardware breaks down 32-bit writes into two 16-bit writes, implying that we don't also have to worry about 32-bit alignment of our single MOVSD instruction.
Shouldn't byte access be a given? Clearly, this would only deserve special mention if it wasn't because the previous contents of this book heavily implied some sort of 16-bit nature and I just missed it.
But without documentation or benchmarks, none of this means anything.
This is also why I haven't yet explored this whole field of optimizing VRAM writes for alignment. It would always involve branching to alignment-respecting code similar to how master.lib does it, but code like this is at odds with the more tangible goal of minimizing instruction counts in the generic case. Not to mention that we'll once again have to test this across every PC-98 hardware generation and possibly even GRCG revision if we ever go down to that level of optimization…
But even if alignment matters, ZUN's unconditional MOVSD instructions approach still appears to be slower on average. Consider the optimal 56.25% of cases where the sprite does lie within a single 16-bit word:
8 start positions within the first byte + 1 start position on the second byte = 9/16 = 56.25%. The 9th variant for (x % 16) == 8 wouldn't be part of a regular singly-preshifted sprite sheet where the renderer blits the (x % 8)th = 0th variant. But it would definitely be worth adding if alignment does matter at all.
Keep in mind that we still use the GRCG here, and that it will also have to perform its fast-but-not-entirely-free four-plane Read-Modify-Write operation for the empty sprite bytes 3 and 4. Unconditional 32-bit writes would only be worth it if the GRCG somehow optimizes away empty writes at the microarchitecture level. That assumption is even more of a stretch, because 📝 why would master.lib even check for emptiness if that were true?
In the end, doubly-preshifted sprites slow down 56.25% of all blitting operations in a dubious attempt to speed up the other 43.75%. Unaligned 16-bit writes would have to be really slow to justify this approach – and judging from the fact that TH04 went back to single-byte preshifting, this is not the case. Maybe I'll write a benchmark for this someday, but honestly, this is the least interesting PC-98 benchmark question I've encountered so far. There are slowdown issues at 📝 our performance target of 66 MHz in Neko Project, but pellet sprite alignment is unlikely to significantly contribute to those.
16×16 bullets are simply rendered using standard SPRITE16 calls, nothing special there. That only leaves…
Pellet delay clouds
Just like the 48×48📝 hit circle, these 32×32 sprites are rendered using SPRITE16's single-color render-path, which uses the EGC's GRCG-equivalent mode. Last year, I took a very brief look at this mode and wondered whether this was actually faster than just using the GRCG. 1½ years and 📝 one benchmark won by the EGC later, it certainly seems so, especially since we want to blit these to unaligned X positions. The EGC's hardware-accelerated pixel shifting seems highly preferable once sprite widths exceed 24 pixels and you can't fit a row of pixels in a 32-bit register anymore.
Stepping through SPRITE16 reveals that this GRCG-equivalent mode matches the GRCG even in how it doesn't read monochrome sprite data from VRAM, but from SPRITE16's 1bpp alpha mask buffer in conventional RAM.
But that only raises the question of why you'd want to use SPRITE16 over the raw EGC. It makes sense why SPRITE16 would have this feature; flashing existing sprites in a single color every once in a while is a useful thing to have in a game-focused rendering API. But using this feature for sprites that are only rendered in this monochrome mode just wastes the VRAM that these sprites occupy in SPRITE16's sprite area. You still blit such a sprite by passing a byte offset into the sprite area, which then gets interpreted as an offset into SPRITE16's alpha mask buffer.
If SPRITE16 had a function for directly blitting from a pointer to 1bpp data, ZUN could have freed up quite a bit of VRAM and maybe even added more sprites for character-specific attacks. Conceptually, it makes sense why SPRITE16 would restrict itself to a single sprite source, but it is quite an unfortunate omission, I'd say.
13,568 pixels, to be exact. And yeah, you could technically overwrite the affected portions of VRAM after generating alpha masks via INT 42h, AH=01h. But since SPRITE16 only stores one such alpha mask buffer, you still couldn't reuse this space for other SPRITE16 sprites.
And that was the last PC-98 Touhou bullet system we were still missing! But at a little over 2 pushes, I have to find something else to do to round out the third one… wait, what about that one incident?
Migrating away from Twitter
On November 6, Twitter was hit by an automated ban wave that suspended all accounts that were using the OldTweetDeck extension. After Twitter discontinued the official free TweetDeck frontend on 2023-08-17, I quickly switched to OldTweetDeck – not just because it was free, but because it supported multiple accounts and was simply more performant than X's own premium offering at the time. In return, I gladly paid my €10 a month to dimden instead, who deserved it much more for continuously updating OldTweetDeck to all of Twitter's API changes over the years. It's very impressive how he kept it running for 2⅓ years without any such critical issues and still keeps maintaining it to this day.
Aside from Touhou Patch Center and all of my accounts, the ban wave affected enough people that Twitter decided to gradually revert it a day later. But without any public postmortem or excuse, this feels more like an act of gratitude that we shouldn't take for granted.
Ever since Elon took over, the Internet has been full of sensationalist doomposts about Twitter's imminent downfall any moment now OMG. For the longest time, I could ignore all these pundits because nothing of what they were complaining about was affecting my little corner. But sudden account suspensions are an existential threat to my business, and finally provided the first actual technical and non-political argument to get my data off Twitter in the medium term. I've put too much effort into all of the content there to let it be exclusively controlled by any one company.
Hilariously, things only got worse from there. Until two days ago, Twitter's data download option was inaccessible due to an infinite redirection bug. Call it malice or incompetence, but leaving such an issue unfixed for weeks is a definite sign of a platform in decline. Thus, I had to run the import on an older archive I happened to request on 2023-07-02.
And then I looked inside that archive and noticed that it was missing at least three key pieces of data that Twitter demonstrably stores for my account:
Poll options
Alt text for images. (Also known as the actually most annoying and time-consuming part of every tweet if you actually want to properly explain an image with all its context and implications. AI won't help with that as long as its context window doesn't span every piece of knowledge related to this project. 🤷)
The original version of each uploaded image, which they do have for a fact because it's shown in the /status view. The archive only contains the processed versions shown in the timeline, which were resized to at most 1200 pixels along their larger dimension and which may or may not have been converted to JPEG based on rules I didn't bother to reverse-engineer.
That's a rather selective interpretation of Art. 20 GDPR. If the argument is that you can just scrape that data out of the HTML yourself, why are they even bothering with sending me anything more than a nested list of tweet IDs, then? 🤨 Someone with more time and care could probably turn this into a lawsuit…
Presenting all media in its original quality is one of the more important reasons for moving to a self-hosted service as far as I'm concerned, especially since mainstream media conversion pipelines are infamous for destroying pixel art. So I went through my hard drives and replaced Twitter's images with the original versions of all 167 non-retweeted images I had uploaded to Twitter until July 2023. The videos also desperately needed to be replaced with their original AV1 versions; Twitter's enforced x264 YUV420P format has been the single worst implementation detail of that entire platform…
You can backdate posts by modifying their creation time, but Bluesky's crawlers will also record the indexed time when they first saw each post on the network. Unfortunately, the bsky.app frontend that everyone uses will then present this indexed time as a post's main timestamp, demoting your intended creation time to an archival time that Bluesky can't confirm the authenticity of:
The PDS database schema does track an indexedAt timestamp in addition to the createdAt timestamp you specify during the import, but indexedAt might as well not exist because it doesn't seem to be used anywhere.
There is a PR that would slightly improve the UI in this case, but it's been languishing unmerged throughout most of 2025. Probably because it has to be merged by the same people who came up with the current UI in the first place, and who prioritized resilience against pranks and disinformation campaigns.
But even if the UI is fixed, these imports would spam the timeline of everyone who follows the existing Bluesky account that we obviously want to import into.
Figuring out and confirming that first issue required remote debugging of the PDS server written in Node.js. Visual Studio Code's LSP quickly ran up against my server's low amount of RAM, which forced me to upgrade my server just to efficiently navigate through the source code…
Typical Node.js criticisms aside, the architecture of the PDS server is quite bizarre. A whole lot of the apparent API surface is never directly called, but generically proxied to some other node in the AT Protocol network at the byte level. If you log into your PDS via bsky.app, it seems as if the AppView calls API endpoints like /xrpc/app.bsky.unspecced.getPostThreadV2 on your PDS, but good luck meaningfully intercepting any of these requests, or even just getting your debugger to break on them.
Together with lots of bulky API schema descriptions in the form of lexicons, all this XRPC code makes up a big proportion of the code in the @atproto/pds package. But for… what exactly? Why would the PDS server need a thick layer of type safety and validation for payloads it doesn't look at, and that the relays will have to verify anyway? Why do they install all this dead code that will confuse most people who are trying to understand this system?
In the end, we just can't thoroughly backdate our imported posts because the crawl timestamps are set by the relays, whose code we have no control over. Now, I could ignore all these issues and still upload some sort of full archive to the platform that now houses 1/6 of my following, but this just doesn't match the quality I expect from the canonical, definitive source of my short-form news posts.
Trying Mastodon
That leaves the Fediverse as the only remaining alternative for a service where people can still follow, like, and repost my content using relatively commonly used clients. Among the various ActivityPub implementations, Misskey is particularly popular among the Japanese Touhou community, but I've only heard bad things about its resource usage. Mastodon isn't the most lightweight option either – as aptly implied by its name – but you can make the argument that it's become the default option across the Fediverse over the years. Thus, there'll be at least a slight chance that people will be familiar with the web UI of what I'm about to self-host.
Too bad that I didn't even get through the first page of the setup guide before being stumped by obscure asset precompilation errors that apparently no one else has ever faced. In a way, it's commendable that a project would exclusively explain a bare-metal from-source setup in the Docker-dominated DevOps seascape of 2025. But why would you want to do this for a project that requires servers to be infested with npm and Postgres and a bleeding-edge self-compiled version of Ruby and several -dev packages for C dependencies of certain Ruby gems? Unsurprisingly, Japanese Python behaves just like Dutch Ruby in how the community effectively treats every minor version as a major version because there are no adults left in the room to put all the children and Ph.Ds in their place…
Fortunately, ActivityPub is relatively simple to implement and there are plenty of existing servers that are better suited to the kind of PR channel I'm actually looking for. After a very quick search, I settled on…
GoToSocial
…which immediately impresses in pretty much every single area:
After the two previous bloatfests, it's very refreshing to see a single binary next to a bunch of static assets. Sure, 87.4 MiB is certainly way more bulky than necessary, but still much smaller than either of our two competitors.
The documentation is extremely well-organized and polished, especially for a project that's on version 0.20.2.
I can write Markdown in posts!
WebP and AV1? Just work too, without any attempt to convert the main image or video that gets attached to a post. Sure, the thumbnailer does convert images, but that's way less critical…
…and you can effectively bypass it by passing some five-digit size to media-thumb-max-pixels.
The whole thing works exactly like a lightweight server should work: A single binary serving posts from a SQLite database and media attachments from static files lying next to it. With easy access to every piece of data, fixing typos and import errors after the fact is trivial. Applying these might need a server restart for caching reasons, but they're immediately reflected in whatever app is accessing the data.
Drawbacks? The database schema is highly redundant, poster image conversion for videos results in weirdly green images for every one of my AV1 source files, and the paginated timeline view could use just a few more navigation options and customizability. Other seemingly missing features like posting and search are handled by third-party clients like the very admirable Pinafore. And except for the first issue, these are all relatively minor, and I might even fix them myself one day. That's how you get new contributors to your free software project.
And just in case you ever want to import a Twitter archive onto a GoToSocial instance, here is the no-nonsense importer I used:
So if you've got an account on Misskey, Mastodon, or another ActivityPub server, please follow @rec98@nmlgc.net. I'll keep posting everything to both Twitter and Bluesky for the time being, but will no longer advertise either of them. If they ever go down, I'll make no attempt at restoring them.
And that was 2025! It surely brought lots of words, breaking even last year's record by an additional 37% of blog post content. 😮 Here's to 2026 bringing more of the actual reverse-engineering we've been sorely lacking with all the modding and porting projects over the past few years. And with at least four TH03 gameplay pushes queued up, things are already looking quite promising…
Next up: Enemies! Formation scripts! Fireballs! Explosions! Combos, or at least the first part of them! And a slightly more common glitch that players have been wondering about for many years…
As you can already tell by this table of contents, this "initial" cleanup work was quite larger in scope than its counterpart for 📝 the first TH01 Anniversary Edition release. Even that already took unexpectedly long 2½ years ago, and now imagine doing that across four games simultaneously while keeping all the little required inconsistencies in place. Then you'll get why this has taken over four months…
With an overall goal of "general portability", it's very tempting to escalate the scope towards covering everything in these menu and cutscene binaries. So I had to draw at least some boundaries:
No big feature work in TH01
No work on any graphics formats besides PI
Graphical text will still get rendered directly to VRAM
Even then, this was way premature. Not only because we still need to maintain the memory layout of TH02's and TH03's MAIN.EXE, but also because of all the undecompilable ASM code in all four games that blocks certain architectural simplifications.
The biggest problem, however: I haven't quite decided on how to use static libraries within my build environment yet. Since Turbo C++ 4.0J's linker just blindly includes every explicitly named object file without eliminating dead code, static libraries are essential for reducing bloat by providing a layer of optional files to be included on demand. However, the Windows and DOS versions of TLIB are easily confused, TLIB's usual paradigm of mutating existing library files goes against Tup's explicit dependency graph, and should we really depend on an ancient proprietary tool for a job that I could reimplement in a few hundred lines? Famous last words, I know. But since I 📝 didn't want to do any dedicated build system work this year, I also didn't want to sort out these questions in a 12th or even 13th push. Leaving the build environment woefully ill-equipped for the complexity of this task was probably a mistake; while the resulting workaround of feature bundles does the job, it's very silly and hopefully won't last very long. I'm definitely going to spend some time sorting out the static library situation before I ever attempt something like this again. Or at some general point before we hit the overall 100% finalization mark, because we've still got that long-awaited librarization of ZUN's master.lib fork ahead of us.
Let's get to it then, starting with the feature that will remove lag in menus by removing PC-98-specific page-flipping and EGC code:
Retaining menu backgrounds in conventional RAM
At first, this seems to be no problem. We just swap out master.lib's .PI functions with our forked PiLoad and our generic blitter, and make sure to keep the images allocated. master.lib's graph_pi_load_pack() has always loaded .PI images into one big contiguous buffer in conventional RAM, so this shouldn't negatively affect the heap layout. If anything, we'd be saving memory by 📝 not allocating these extra two rows, right?
Unfortunately, it's that second goal that would turn out to be a massive problem. 📝 The end of part 1 already hinted at how the majority of menu backgrounds are only rendered to VRAM a single time before ZUN immediately frees them from memory. These cases are so common that I defined a macro for them:
At the current state of decompilation, this macro is used 16 times across TH02-TH05, and it will appear an additional 12 times by the time decompilation is done.
In these cases, the games only need that single 128 KiB block temporarily, and then get to reuse that memory for other, more dynamic graphics. Consequently, ZUN probably dimensioned the master.lib heap sizes for TH02-TH05 to leave ample headroom with this fact in mind. I wasn't so sure about deliberately limiting the amount of heap memory 📝 in late 2021 when I fixed the one out-of-memory landmine that remained in TH04, but I've begun to appreciate these memory limits quite a lot as the scope of my research has deepened. Specify the right amount of bytes, perform the single allocation from the DOS heap at startup, and if that allocation succeeds, you've removed an entire class of out-of-memory bugs from consideration. Sure, modders might prefer mem_assign_all() for simplicity during development, but it does make sense to return to a static limit when shipping. For once, ZUN was right, and there is no excuse.
On the surface, this macro is equivalent to PiLoad's original direct-to-VRAM approach. And indeed, we can replace this code with a call to PiLoad's original code path in the few cases where we just want to show a static image without unblitting any of its regions later on, removing even the requirement for that temporary 128 KiB block in the process. But in the majority of cases, we do need these images in RAM, and ZUN's original heap sizes simply weren't intended for that.
But how much of a problem are ZUN's limits in practice? Well, there's at least one instance where retaining all images would require significantly more memory than ZUN anticipated. TH02's OP.EXE requests a 256,000-byte master.lib heap, but then wants to do this:
There's really no reason against just increasing the size of the master.lib heap to incorporate all of these four images at the same time. But how much additional memory do we actually need? Obviously, these four images are not the only allocations on the heap, which also needs to fit at least the following buffers:
The 8,192-byte snapshot of the gaiji loaded before the game, which are restored upon quitting the game… as well as when switching between game binaries. Yup – since the master.lib heap is not retained across binary switches, every one of the three binaries restores the system's previous gaiji before being switched out, before the new binary reads those same gaiji from the character generator back onto the master.lib heap. It would have been much smarter to keep them in a separate persistent allocation on the DOS heap instead.
The 16,384-byte .PI load buffer. This one has to be allocated before we allocate any .PI, so it will necessarily fragment away that memory.
The 2,560 bytes of High Score menu sprites loaded from op_h.bft. We might have shown the High Score menu if we came from a demo, and ZUN wants to keep these sprites loaded across the lifetime of the process.
The 9,216-byte super_buffer that master.lib pre-allocates upon loading the first BFNT sprite, just in case you later want to call super_convert_tiny().
We can bypass the super_buffer allocation through a dumb trick. But the single worst aspect hides between all these individual allocations:
Fighting heap fragmentation
Let's look at TH05's OP.EXE, whose heap limit of 336,000 bytes is much more lenient. This limit should be more than enough to fit the new additional 128,000-byte buffer for the background image in addition to the original heap contents on every menu screen, and we can indeed enter the main menu without any issue. But then, we're still greeted with an out-of-memory crash after entering and leaving the Music Room? Let's take a look into the master.lib heap, with the retained background images we'd like to have:
Something gradually shreds our heap into tiny pieces that ultimately prevent us from allocating the main menu background image a second time. We can surely blame this on ZUN's suboptimal order of load calls that doesn't prioritize larger images over smaller ones, or on the 16 KiB .PI load buffer that we maybe should have allocated statically. But the biggest hidden offender turns out to be… master.lib's packfile implementation?!
Yup. Every time you load a file out of an archive, master.lib heap-allocates a 31-byte state structure and a file read buffer that master.lib originally dimensioned at 520 bytes. TH03's MAIN.EXE and MAINL.EXE then increased the size of that buffer to 4,104 bytes, before TH04 and TH05 went up to 8,200 bytes.
Ultimately though, it's not the size that's the problem here, but the fact that we repeatedly allocate any memory that could have been allocated once when setting up the INT 21h handlers. You'd think that master.lib went for dynamic allocations in order to support the fact that the INT 21h file API lets you open multiple file handles simultaneously, which would point to different RLE-compressed files within the archive. But no, master.lib doesn't even support this case, and even incorrectly returns FileNotFound if you attempt to open a second file from an archive before closing the first one!
After I identified this whole issue, I immediately wanted to replace master.lib's packfile code with the much saner and more explicit C++ implementation from TH01. Sooner or later, we'll have to do this anyway because we can't just hook file syscalls on other operating systems the way we can hook them on DOS.
However, TH01's implementation would quickly turn out to have its own share of heap fragmentation issues. Every time the game loads an RLE-compressed file from 東方靈異.伝, the archived file is completely decompressed into a newly allocated temporary buffer, from where the game then copies out parts into the actual game structures. The resulting fragmentation is at least easily fixable though, and that's what the TH01 part of the very first push assigned to part 1 went to. Switching to a zero-copy architecture basically only required persisting the RLE state and brought a significant improvement: 15,776 bytes of heap memory during Stage 1 freed up by that switch alone, as reported by the coreleft() output seen in debug mode?! That much for just removing temporary allocations that the game was freeing anyway?
Let's check the Borland C++ DOS Reference for how this value is actually calculated. Turns out that it is simply intended to be a measure of unused RAM memory, and sure enough:
In the large data models, coreleft returns the amount of memory between the highest allocated block and the end of memory.
That's a reasonably meaningful measurement that can be determined in constant time, compared to the 𝑂(𝑛) operation of finding the true total size of available heap memory by walking over every node.
In the end though, rolling out this C++ implementation to the other four games was way premature and would have pushed this delivery way above 12 pushes. After all, both ZUN's and master.lib's code is still full of INT 21h file syscalls that would all need to be replaced. Conditionally, even, given the two binaries that are not yet position-independent…
Fortunately, .PI loading itself is just as much of an issue and can be worked around in the much simpler way I already spoiled when explaining 📝 the API changes I made to PiLoad: We simply hold on to the menu background pixel buffer for as long as possible. Ideally, we only allocate these 128 KiB once, decode every new menu background into that same buffer, and only explicitly free it when we really need to. That's why the platform layer logic requires full control over .PI buffer allocation.
This was enough to keep the required amount of additional heap memory to a more than acceptable level:
None of the MAINE.EXE/MAINL.EXE binaries needed a larger memory limit.
TH04's OP.EXE also got to keep its original 336,000-byte heap.
As did TH03's OP.EXE. That one seems particularly surprising if you remember 📝 its 255,216-byte character selection screen. But since ZUN only preloads 186,096 bytes before or during the main menu, we can nicely fit the title screen image into the original 352,000-byte heap.
TH05's OP.EXE needed a slight increase by 4,768 bytes to a new limit of 340,768 bytes, for the mere sake of accommodating the heap fragmentation caused by entering and leaving the Music Room and then entering its character selection screen. An increase at that level would have been fine even if it wasn't temporary, as we're still 52,832 bytes short of reaching the amount of memory required by MAIN.EXE:
(original)
Static
Heap
Total
OP.EXE
88,064
336,000
424,064
MAIN.EXE
190,464
291,200
481,664
That only left TH02's OP.EXE, which did require a whopping additional 80,416 bytes up to a very similar new limit of 336,416 bytes. But again, an increase at that level is also fine for this game when compared against its MAIN.EXE, which would allow OP.EXE to go up to 383,232 bytes of heap without increasing the original memory requirements:
(original)
Static
Heap
Total
OP.EXE
69,632
256,000
325,632
MAIN.EXE
164,864
288,000
452,864
Not to mention that our final merged DEBLOAT.EXE will only require 63,488 bytes of static memory, and that's despite TH02 being the one game that received the quickest and sloppiest merge job with a lot of corners cut for budget reasons.
And after a few rewritten function calls, we've indeed removed every single EGC-powered inter-page copy from all menus and cutscenes of TH02-TH05! On to the next goal…
Resolving screen tearing landmines
…which requires individual solutions for every case that merely follow a common pattern. If things are already done close to a VSync wait loop and just in a slightly wrong order, the solution is easy and we just have to shift around a few function calls.
But what can we do if ZUN mutates visible pixels at some place far away from the last VSync wait loop? After all, a lot of these landmines result from confusing the current CRT beam position across multiple functions. Often, it's impossible to see at a glance where these menu-specific subfunctions are called within a frame without tracing execution back to the last VSync delay loop at the call site. For starters, it would be nice to clearly formalize that a specific section of code must be run in VBLANK.
master.lib's vsync_Proc function pointer already gets us most of the way there. Its VSync subsystem automatically calls any non-nullptr shortly after the VSync interrupt fires, and our task function would then set vsync_Proc back to a nullptr to ensure the intended one-shot behavior.
However, this approach can at best defer a task to the next VBLANK interval, which might leave us one frame behind the original game and hurt our frame-perfection goals. What we actually want is a conditional approach for timing-sensitive tasks, as a common operation that only requires a single line of code:
If we're within VBLANK, run the task right now.
If we aren't, set up a VSync proc that runs the task immediately after the next VSync and then removes the proc.
Now we're only missing that crucial one bit of information, which is delivered by Bit 5 of the graphics GDC's status register at I/O port 0xA0. In fact, ZUN uses the same bit in all hand-written VSync wait code throughout TH01 and in the bouncing-ball ZUN Soft logo:
void vsync_wait_via_gdc_polling(void)
{
// Bit 5 of the graphics GDC register indicates VBLANK. Wait until this bit
// is set.
while((inportb(0xA0) & 0x20) != 0) {
}
// Once Bit 5 is no longer set, the CRT has started drawing the next frame.
// I have no idea why you would ever want to throw away all your precious
// vertical retrace time, but ZUN does this all throughout TH01.
while((inportb(0xA0) & 0x20) == 0) {
}
}
Of course, this only solves the problem in theory, as the tasks themselves don't come with any real-time guarantees. It's entirely possible for the resulting vblank_run() function to get called near the end of VBLANK, start the task immediately, and return long after the CRT beam has started drawing again. Heck, if the system is slow enough, the task might not even complete within the VBLANK interval if we run it immediately after VSync. But this is a much more complex problem to solve, requiring upfront measurements of both the VBLANK interval and the execution time for each potential task, which can then be factored into the run-now-or-defer decision. We definitely don't need to go there as long as we're mainly targeting emulated 66 MHz systems.
In easy cases, vblank_run() can then resolve screen tearing landmines completely by itself. Towards the end of PC-98 Touhou, ZUN's menus made more and more use of master.lib's blocking palette fading functions, which delay themselves to the next VSync signal and thus avoid any tearing issues. Hence, TH04's and TH05's screen tearing landmines are limited to the very few sudden palette changes that remained in these games:
void return_from_other_screen_to_main_menu(void)
{
// Loads the .PI image into our persistent menu image background buffer,
// and overwrites master.lib's 8-bit palette. Takes a few frames and
// probably won't return during VBLANK.
GrpSurface_LoadPI(bgimage, &Palettes, "op1.pi");
graph_accesspage(0);
bgimage.write(0, 0); // Planar blit
PaletteTone = 100; // Use original brightness in palette_show()
- // ZUN landmine: Updating the hardware palette right now will most likely
- // cause screen tearing.
- palette_show();
+ vblank_run(palette_show);
[…]
}
The fixes for the landmines in TH03 and TH02, however, require much more thought and care to stay as close to ZUN's defined logical frame sequence as possible. TH03's character selection screen, which prompted this whole subproject in the first place, houses one of the harder groups of landmines:
Recorded on DOSBox-X with 375,000 cycles, since exact machine specifications are not important to demonstrate these landmines.
💣 Landmine #1 is caused by loading the Selection BGM and the character name sprites before clearing VRAM, without a frame delay inbetween. The fix is obvious.
💣 Landmine #3 looks conceptually identical to landmine #2, being another mid-screen-frame page flip caused by vsync_Count1 being ≥3 by the time execution reaches the frame-rate-dropping delay loop. However, this one is caused by the game's response to a selection-confirming input and therefore needs a dedicated fix. Here's what's going on:
The game renders the next frame to the invisible VRAM page, as it usually does. We won't see this frame for a while.
The game checks for input and sees the Shot key. It then immediately runs the palette flash effect, using master.lib's blocking palette_white_in() while still displaying the previously rendered frame.
palette_white_in() itself avoids screen tearing issues by running its own VSync busy-waiting loops using the (inportb(0xA0) & 0x20) check, without mutating vsync_Count1. This is why the game immediately page-flips once execution is back to the main loop.
However, this is not the cause of this landmine. palette_white_in() also stops within VBLANK, so you'd also expect the immediate page flip to not cause any screen tearing.
Except that ZUN then (re-)loads the .CDG portrait for the selected or automatically assigned palette variant of the confirmed character, immediately after palette_white_in(), and only then drops back into the main loop, without any further delay. Hence, we have file I/O in our logical frame, and thus can't guarantee anything.
On the hypothetical infinitely fast PC-98, the .CDG load call completes instantly and turns this into a non-issue. On real systems, however, we would need some way of hiding this load call to stick to ZUN's defined logical frames:
Leaving it after palette_white_in() is completely wrong because it messes with the defined sequence of logical frames. Even just maintaining the original number of frames requires inserting an additional delay frame and compensating for that by cutting that one frame from the next iteration of the loop. This was my original solution, and the realization of how wrong it was certainly delayed this blog post by about a day…
Moving it in front of palette_white_in() might work since the effect starts with a VSync wait, but it might also insert an additional screen frame of delay. Keep in mind that we're still on the same logical frame that rendered the very expensive curve effect.
That only leaves one answer: Running both the white-in effect and the .CDG load concurrently. 💡 Using vsync_Proc, we can implement a non-blocking version of palette_white_in() that runs one iteration of its palette-manipulating loop during VBLANK. Meanwhile, the "main thread" gets 16 frames to load a single 22.5 KiB character portrait and then simply waits for the white-in effect to complete. And since our VSync proc also always signals completion during VBLANK, we then get to immediately page-flip and retain ZUN's intended 3-frame timing. With this solution, we don't even have to optimize away the .CDG load in the usual case where the game just reloads the character's regular palette variant.
💣 Landmine #4 is the palette tearing issue that got an entire section in the post from last year. After moving the palette-mutating branch from before VSync to immediately after VSync, we also have to adjust the calculation of the palette brightness value to match ZUN's original values.
Very finicky work, where every single branch has the potential to introduce an off-by-one-frame error, and vblank_run() doesn't help at all.
And then you reach TH02, which asks for way too much to happen within a single frame, in plain sight, and with no palette tricks to hide it. The screen transitions into and out of its HiScore screen are by far the worst example:
Recorded at 1.9968 × 17 = 33.9456 MHz for a change to magnify the jank that you perhaps wouldn't see at higher clock speeds.
These screen transitions exhibit no less than 6 landmines and 2 bugs:
💣 Frame 1 shows how TRAM (containing the actual menu text as gaiji) gets cleared immediately, but VRAM (containing the shadow) remains untouched as ZUN decides to load HUUHI.DAT first.
💣 While the following VRAM clear appears to produce a well-defined black frame 2, it's anything but well-defined, as the load operation only happens to conclude within VBLANK by sheer chance in this recording.
💣 Frame 4 is wild. First of all, the code still hasn't waited for a single VBLANK signal ever since entering the menu, and therefore shouldn't be writing to TRAM to begin with.
But even then, you wouldn't expect to see only the name and nothing else on the scanlines of a score record in such a partial rendering. How can TRAM operations possibly be that slow? This almost seemed as if I was missing some crucial timing-related detail about the hardware. But in the end, what we're seeing here is simply Neko Project not actually using scanline rendering for the text layer. If you write to a TRAM cell, Neko Project just marks the entire 16-pixel row to be redrawn during the next screen refresh event.
🐞 Frame 5, then, is the first well-defined frame that actually renders the way it's defined in the code. The green 東方封魔録 logo is indeed only meant to be visible from the next frame onwards. This certainly meets all criteria for a bug, but the debloated build isn't allowed to fix those. In fact, it needs a dedicated conditional branch to preserve this bug.
Once you leave the menu, you'll first have to sit through a stylistic and non-productive 20-frame delay, before… the screen switches back to the last frame rendered before the delay on frame 77?
💣 By that point, we're technically already back to the main menu, where the first thing ZUN does is to switch from double-buffering back to single-buffering with VRAM page 0 shown. If you happened to leave the menu by hitting a key on the 50% of frames where VRAM page 1 is shown, the screen will therefore flip back to the frame rendered before the 20-frame delay, and keep it visible while master.lib decodes the title screen image.
💣 This decoding process finishes after ~20.4 frames in this recording, near the middle of frame 98. Clearly, we then have to immediately switch the hardware palette to the one we just loaded. Let's completely disregard that we're probably not in VBLANK, or that the screen is still showing the last High Score menu frame…
Then, we need to get the image onto both VRAM pages. 📝 As we found out in Part 2, a low-clocked 386 is pretty much the most suboptimal system for master.lib's packed→planar conversion code, and 12 frames exactly match the performance we would expect from Neko Project at 33 MHz.
💣 But that only rendered the image to the invisible VRAM page 1. We could now temporarily show page 1 after the next VSync signal to hide the pretty much guaranteed multi-frame VRAM writes… but nah, who cares except for some researcher 28 years later. By leaving VRAM page 0 on screen, ZUN doesn't even attempt to hide the jank that is about to occur. Once again, he reaches for master.lib's graph_copy_page(), 📝 whose slowness I already talked about in Part 2. At 33 MHz, Neko Project takes 3 frames to copy one page after another, leaving us with two frames of mixed pixels. This can be even worse on real hardware: On 📝 spaztron64's K6-2-upgraded and southbridge-bottlenecked PC-9821V166 model, this copy took 100 ms. I was able to watch every single bitplane getting individually copied in the recording. Unpacking the .PI image a second time would have been faster on that machine.
🐞 Also, ZUN should have definitely cleared TRAM before the page copy instead of deferring this responsibility to the main menu rendering code. Since we then return to the main menu's VSync-timed loop and regularly wait for VSync while the scoreboard remains on screen and part of the current logical frame, this is not a landmine.
Compare this with the debloated version:
The first three landmines are fixed by running the common "set palette to black and clear TRAM" operation in VBLANK, and deferring both the palette update and the scoreboard rendering to the VBLANK interval preceding frame 5.
Everything between frame 77 and frame 113 inclusive is defined to happen on a single logical frame. Since this screen doesn't allocate its own 640×400 background, we get to keep the title screen image in memory and actually turn this logical frame into a real one. Then, we can use ZUN's defined 20-frame delay constructively:
First, we render the last frame to the other VRAM page to defuse landmine #5. Yes, render – a GRCG-accelerated 385×209-pixel flood fill followed by eight transparent 16-color 128×32 sprites is much faster than copying VRAM pages.
Then, we can unconditionally switch to showing VRAM page 1 and accessing VRAM page 0 on the next VSync, without affecting what's shown on screen.
Then, we have all the time in the world to blit the planar title screen image from memory to VRAM page 0, the only one we still need to touch.
On the VSync that precedes frame 77, we then simply 🫰 flip VRAM pages and the hardware palette to produce exactly the well-defined image that an infinitely fast PC-98 would have produced for ZUN's original code.
Then, I did that 13 more times for the other screen tearing landmines fixed in this build. And no, these new builds don't even fix every instance of this issue…
Intermission: Handling unused code on fork branches
Given that all of these improvements are taking place on the debloated branch, it's time to decide on how to handle the biggest unneeded obstacle in the way of our portability efforts, after 📝 I procrastinated this question 2½ years ago.
In Shuusou Gyoku, I've been trying to retain every single line of unused code in a dedicated directory, not least because that game has 📝 some very wild effects that should be reasonably preserved. The problem with this approach is that all this unused code quickly stopped compiling as I started to refactor the game into its current cross-platform state. For discoverability, this is still better than outright deleting the code and expecting people to read pbg's original codebase, but it's not all too practical either.
In the ReC98 codebase, we have a different situation: All the unused code doesn't just exist at some old commit that maybe won't even compile going forward, but is an integral part of the master branch. Therefore, removing this code from fork branches is not only in line with their goals, but also completely non-destructive, since its compilable form on master keeps getting maintained for a handful of building platforms.
Then again, I like the added overview and discoverability of the Shuusou Gyoku approach. So let's meet in the middle: From now on, the debloated branch will only keep unused code in the form of its declarations and some short explanatory comments, in files within the unused/ directory whose names point to the actual implementations on the master branch.
Funnily enough, unused code wasn't even the main reason why TH01's ANNIV.EXE lost 10,834 bytes between the previous and current builds. Although TH01 is the one game with by far the most unused engine code, that code only made up 3,728 bytes of that difference. The rest came from the work surrounding the zero-copy unpacker and the few portability features that already made sense to be rolled out for this game. Yes, TH01 really is that bloated.
Merging TH02-TH05's OP.EXE and MAINE.EXE/MAINL.EXE
Onto the second most exciting feature, 📝 as motivated by the blog post from May! A true single-executable build 📝 never looked that viable for TH04 and TH05 to begin with, so let's just go for the one viable partial merge that makes sense for all of the four games. With all of MAINE.EXE/MAINL.EXE being position-independent, the remaining bunch of ASM code there isn't much of an obstacle either.
And once again, this merge means that we have to resolve all 📝 binary-specific inconsistencies at once. While ZUN thankfully eliminated most of them by the end of the PC-98 era, the scorefile code remained inconsistent until the very end, 📝 as Part 3 already mentioned. Hopefully, this is the second-to-last time I have to mention these formats…
Funnily enough, all of their most noteworthy inconsistencies are found in how these formats deal with corrupted files:
The TH03 inconsistency I 📝 teased in part 3 is almost not worth mentioning. If the game ends up recreating YUME.NEM while loading the high scores for name registration after a 1CC, the clear flag is written for all difficulties, not just the one you've actually cleared. Our exact definition of observable bugs comes in doubly handy here:
To get a 1CC in the first place, you must have gone through character selection, which also (re-)creates YUME.NEM if necessary. Therefore, MAINL.EXE would only ever recreate YUME.NEM in this "1CC mode" if something outside the game deleted or tampered with the file while the game was running.
TH03 offers no benefits for a 1CC on specific difficulties, and doesn't even visually indicate this flag, unlike the three other games. 1CC'ing any difficulty is all that matters for unlocking Chiyuri and Yumemi.
With no way to observe this per-difficulty state, this is one of the rare landmines where we get total freedom for the fix. Thus, we can just do the right thing and set the clear flag for only the current difficulty, reflecting your actual achievements and paving the way for a future feature that can highlight this per-difficulty clear state in the UI.
TH04's OP.EXE simultaneously loads both Reimu's and Marisa's scores for the currently selected difficulty into two separate structures. This alone is a great source of unnecessary inconsistencies, but it gets even worse when either of the two sections is found to be corrupted during decryption. In that case, the game doesn't decrypt Marisa's section and leaves its encrypted state in the respective structure. However, the High Score viewer still assumes that both sections were decrypted. While Reimu's section will always contain either valid or recreated default data, you probably won't see that under all the garbage sprite data rendered for the still encrypted Marisa:
Corruption with random bytes will look slightly more varied than the zeroed-out example from the previous post.
/ The original games would recreate the full GENSOU.SCR with its default data if even just one character×difficulty-specific section of the file was found to be corrupted. The debloated build now only resets individual corrupted sections to their default state, preserving as much of the file as possible. This also went hand in hand with removing that separate Marisa score structure in TH04, giving us identical and glitchless corruption repair behavior in both games and saving me from having to mention TH04's corruption behavior in the release notes. Efficiency!
/ As an added consistency bonus, the debloated builds no longer fully re-encrypt GENSOU.SCR after entering a score after a cutscene. This was dumb for many reasons.
The actual merge then indeed delivers what we were hoping for: In three of the four games, the added unique code from OP.EXE and MAINE.EXE/MAINL.EXE comes in at far below the 20,512 bytes we freed by removing 📝 Borland's C++ exception handler, both in the binaries themselves and in their loaded in-memory state.
But it's TH05 where both OP.EXE's expanded Music Room and MAINE.EXE's Staff Roll and All Cast sequence add so much unique data that the initial merge ended up slightly larger than the size of the original MAINE.EXE. Getting the binary and run-time size of the new DEBLOAT.EXE below that point required every trick in the book and then some. The more critical tricks were good ideas in their own right:
Heap-allocating 📝 the scrollable verdict bitmap shown after the Staff Roll frees up 28,160 bytes of statically allocated memory. The fact that you can just have such large arrays of static data seemed like a great benefit of this binary splitting model 5 years ago, but it really doesn't hold up against just writing the two lines to allocate and free that memory from the heap. MAINE.EXE's 320,000-byte heap memory limit is more than enough to fit that bitmap in addition to all the simultaneously loaded Staff Roll sprites.
Heap-allocating 📝 TH04's and TH05's cutscene script buffer not only does the same at the smaller scale of 8,192 bytes, but also practically saves over half of that memory, as TH05's largest actual script (Reimu's Good Ending, stored in _ED10.TXT) is just 3,152 bytes. And not just that: It also removes the original 8 KiB limit on cutscene scripts in those games, allowing mods to use up to 64 KiB just like TH03.
But the rest of them definitely crossed over into silly micro-optimization territory:
The single biggest reduction came from turning the various statically allocated far pointers to hardcoded strings into near ones. ZUN used the Large memory model for every .EXE binary, where every statically initialized C pointer variable not only gets turned into this 4-byte segment+offset form, but also receives a 2-byte relocation in the MZ header that allows the DOS .EXE loader to adjust the relative segment part to the correct absolute value in conventional RAM. These relocations don't remain in memory after a process has started, but they do have quite an impact on a binary's size if it uses lots of hardcoded strings.
The correct high-level solution is to simply switch to the Medium memory model, which restricts a program to just 64 KiB of statically allocated data and reduces all data pointers to offset-only near pointers by default. Sadly, switching memory models is one of those wide-ranging architectural changes that we absolutely can not realistically do with that much undecompiled and undecompilable ASM left in the codebase:
All ZUN-written ASM code came out of the disassembler in a Large-exclusive form and would have to be manually adapted to work for the Medium model as well.
Due to all the code sharing between the games, we'd pretty much have to flip the Medium model switch for all games at the same time. A gradual transition would take even more effort.
Hence, this will only make sense at that far point in the future when we've even translated the majority of undecompilable ASM back to C++. In the meantime, we're left with manually declaring all such pointers as near. With a total of 471 pointers to hardcoded strings in the merged TH05 executable, this brought the binary size down by 1,884 bytes. 1,356 of those bytes came from the Music Room and its hardcoded track titles and BGM filenames, but we've also got 300 bytes in 📝 the All Cast sequence, 156 bytes in the main menu, and 72 bytes in the sound setup menu.
At startup, Borland's libc must correctly set up buffering for C's stdin and stdout streams. Section 7.21.3/7 of the C standard mandates how this setup must behave in case any of these streams are redirected away from the terminal, but even the "implementation-defined" terminal case must at least set up line buffering for stdin to make scanf() and similar functions behave as expected, just in case you ever want to use these functions. TH01 uses scanf() for the stage selection feature in Debug mode, but other games thankfully stay far away from C's standard I/O functions and use master.lib's text layer functions instead. Disabling this I/O setup in the same way we disable Borland's forced C++ exception handler saves 1,722 bytes in TH02-TH05.
At least C doesn't even pretend to make you not pay for things you don't use in the way C++ does. It just unconditionally throws all the trash your way…
Removing trailing whitespace from the hardcoded Music Room track titles and sound setup menu help texts saved another 862 bytes. Hex-editing translators might disapprove, but come on, we have C++ code now. If you commit and push your edits somewhere, there's at least a chance that we can keep them working into the future.
The explosion sprite structure in the ZUN Soft logo has an unused 2-byte structure field that wastes 512 statically allocated bytes in the game's data segment. That array would have been another prime candidate for heap allocation, but that would have only been feasible with a decompilation, and 📝 someone insisted on keeping this particular animation in ASM for the time being…
Removing unnecessary inlining from game startup saved 64 bytes.
Data-driving the Demo Play characters and stages saved 54 bytes.
The original MAINE.EXE contains a second copy of the スローモードでのプレイでは、スコアは記録されません string because ZUN didn't use a single optimal set of compiler flags for the entire game. Removing that second copy gives us our final 51 bytes.
Alright, another idealistic bonus goal reached! That means we're only missing a single aspect to reach feature parity with the debloated TH01 build:
Replicating TH02-TH05's GAME.BAT in C++
In TH03, this is slightly more involved. We not only need to launch PMD using this technique, but also apply it to the INTvector set program and SPRITE16. 📝 You know the way this goes:
Unfortunately though, the fixed position of all these TSRs would still prevent the game allocation from being replaced with a binary that asks for more memory than the one this block was initially allocated for. In TH01, this would have been a minor issue because it only applied to hot-reloading the single DEBLOAT.EXE or ANNIV.EXE that contains all game code. For the other four games, however, we still keep the larger MAIN.EXE as a separate binary, and most likely will do so for the foreseeable future. And we're surely not getting into the business of moving already allocated TSRs…
So we're back to the technique from two years ago after all. Let's precalculate the size of each TSR, push that TSR to the top of conventional RAM by temporarily claiming all free memory minus its expected size, and then we get…
…a ZUN.COM spawn failure from DOS as we try to start the ZUNINIT sub-binary.
Yup. Thanks to ZUN's fantastic idea of bundling these small utility tools and TSRs into a single binary that's larger than each individual TSR, we can't just reuse the strategy that worked for TH01. DOS must load the entirety of ZUN.COM into conventional RAM before the bundling code gets a chance to shift the selected sub-binary to the top of the program's memory block and then reduce the size of that block.
So how are we going to solve this?
We could ship the individual small binaries bundled in ZUN.COM. But that would defeat the whole point of reducing clutter in the game directory, being even worse than the batch file we're trying to eliminate.
We could reserve the entire required size of ZUN.COM instead of just the size we expect for each TSR. But that would leave the difference between ZUN.COM and the TSR as an unallocated block we can't do anything with, fragmenting the DOS heap as a result:
But if we can't get rid of ZUN.COM's high load-time memory requirements, how about using that memory more productively? Is there a way we could maybe spawn the other TSRs into the hole left by ZUN.COM after it went resident?
Let's take a step back from individual TSRs and instead look at the full picture of spawning a bundle of TSRs in a defined order. First, we determine both the binary size (file size of the .COM binary + Program Segment Prefix + 256 bytes of stack) and the resident size (the size of its memory block after it goes resident) of each TSR. With these metrics, we can calculate a minimum and resident size for the full bundle by simulating the TSR spawns in order:
uint32_t bundle_size_min = 0;
uint32_t bundle_size_resident = 0;
for(const auto& tsr : tsrs) {
// Since DOS has freed all excess binary memory before we get to spawn a new TSR, the new
// one will end up next to the previous resident allocations. We only need to consider
// the previous minimum size because it might be larger than the one we calculate here.
bundle_size_min = std::max((bundle_size_resident + tsr.size_binary), bundle_size_min);
bundle_size_resident += tsr.size_resident;
}
Let's step through the bundle construction for TH03:
TSR
Binary
Resident
Bundle minimum
Bundle resident
Naive
ZUNINIT (ZUN.COM)
23,276
1,056
23,276
1,056
23,276
SPRITE16 (ZUN.COM)
23,276
36,528
24,332
37,584
59,804
PMD86.COM
29,295
30,144
66,879
67,728
89,948
Then, we only need to resize our main memory block a single time to leave a gap at the top of conventional RAM whose size matches the larger of the minimum or resident bundle sizes. If we then spawn the TSRs into this gap, we indeed save 22,220 bytes over the naive approach! Let's visualize the resulting memory layout with TH02 because there's a nice detail with MMD and PMD:
However, there's one crucial detail in all of this that would prove to be more complicated:
Calculating correct resident sizes
In TH01, this was no big deal. MDRV98 was the only TSR we had to care about, and there was no reason not to just replicate its simple resident size calculation within the code. After all, people would either run the version bundled with the game or the smaller previous version if they played on a real-hardware CanBe model. No one really cares about MDRV98 beyond that level; the driver is almost universally disliked for just not being PMD, which managed to attract a sizable community, documentation, and even new developments over the years. A PMD port of TH01 has been one of the most common mod requests as well.
The TSRs in later games, however, are much more flexible. We compile both ZUNINIT and SPRITE16 from source and should therefore expect people to mod them, but these two in particular might just be considered uninteresting and static enough to justify hardcoding their sizes. But this approach utterly breaks with PMD, whose chip-specific variants come in multiple versions depending on the game:
TH02
TH03
TH04
TH05
PMD.COM
4.8l (1996-12-28) 14,336 bytes
PMDB2.COM (ADPCM)
4.8l (1996-12-28) 18,496 bytes
4.8o (1997-06-19) 18,592 bytes
PMD86.COM (86PCM)
4.8l (1996-12-28) 19,904 bytes
4.8o (1997-06-19) 19,984 bytes
PMDPPZ.COM (PPZ8/CanBe)
4.8l (1996-12-28) 20,768 bytes
4.8o (1997-06-19) 21,024 bytes
The PMD versions that ZUN shipped with each game. The byte size refers to the in-memory TSR size without any music, voice, or effect data added on top.
In theory, nothing stops us from hardcoding these sizes for each game as well. But these physical details about specific PMD versions are even less of a property of the game. There's no reason why modders shouldn't be able to replace any of the hardware-specific driver versions with any other – and given the sizable PMD composer and arranger community, this is a much more likely kind of mod to happen. SSG-EG, anyone?
But how could we figure out the required resident size of arbitrary PMD versions without hardcoding anything? From the outside, we can only really know for sure by running the driver and seeing how much memory it keeps resident…
…so that's exactly what we need to do. The merged binaries spawn each driver three times during setup – once to figure out its size, a second time to remove this test TSR, and a third time to respawn the TSR at its designated place at the top of conventional memory. And if we have such a system in place, nothing stops us from applying it to all other TSRs as well, removing the need to precalculate or hardcode any size… well, except for SPRITE16, which still needs a hack to factor in its extra two blocks on the DOS heap. In TH03, these 2×3 additional processes do slow down startup by about 6 frames on our target 66 MHz Neko Project configuration when compared to the batch file, which should still be tolerable relative to the .PI load times we removed by switching to PiLoad.
The whole feature has a few other nice properties as well:
Since this entire GAME.BAT replica should be optional, we need a reliable way of detecting whether we were started from GAME.BAT. Checking whether all of a game's TSRs are already resident is the obvious choice here. But then, we can even do one better and only start the specific TSRs that aren't resident by the time our merged binary is started. Of course, removing any non-ZUN.COM TSR from the bundle will invariably leave gaps in the DOS heap, but we do gain an extra bit of resilience since the game at least starts in case of a messed-up batch file.
If we do see all TSRs in memory though, we also skip TH02's and TH03's bouncing-ball ZUN Soft logo as well as TH05's gaiji upload, 📝 matching the behavior I ended up with in TH01. After all, we can't validate whether those were already run or not. If you remove the zun -g line from an edited version of TH05's GAME.BAT that launches DEBLOAT.EXE instead of OP.EXE, you'd therefore get the same gaiji- and HUD-less game that you'd get with ZUN's original binaries.
We also don't spawn TH04's and TH05's memory checks from C++ for a similar reason. Their hardcoded memory values assume that the checks are run from GAME.BATbefore the game gets loaded, which would obviously cause them to fail if all menu and cutscene code is already loaded into conventional RAM. After merging that code into a single binary, there's not much of a point to such an upfront check either:
If there wasn't enough memory to launch DEBLOAT.EXE/ANNIV.EXE in the first place, you'd immediately get to know.
If the single DOS-heap-allocating call to mem_assign_dos() failed, we should probably adopt ZUN's original errors to tell you about it in detail, but the game would also refuse to start immediately. This must necessarily be one of the first function calls made by each binary.
If either of these two issues occurred for just a game's MAIN.EXE, it would be somewhat inconvenient to always go through the title screen animation and the main menu to test any new memory setup, but it wouldn't be a big deal either.
The original games did have the theoretical issue that their MAINE.EXE/MAINL.EXE could have required more memory than either OP.EXE or MAIN.EXE. Without an upfront check for the expected size of MAINE.EXE/MAINL.EXE, a lack of memory could have meant losing a run to an out-of-memory crash upon switching to MAINE.EXE/MAINL.EXE, where scores and clear flags get written to disk. In practice, none of the games actually have this issue, and merging the two binaries avoids it entirely.
These merged binaries also integrate PMDPPZ/CanBe support via the -c or --canbe option. It is quite silly how the community refers to the combination of PMDPPZ.COM and GAMECB.BAT as a CanBe patch, since this is a strict surface-level addition and doesn't modify anything. Now that my package integrates at least one of the two required parts, can we maybe stop calling it like that? You even get a nice error message in case PMDPPZ.COM is still missing from your game directory!
And then you test with the actual ZUN.COM and notice that you're still not done:
The INTvector set program sets up handlers for INT 5 and INT 6, which collide with Turbo C++ 4.0J's implementation of signal(2). If your program only consists of its main process and the TSR you launch from it, this is no problem as long as you shut down the TSR before your process. However, we want to launch DEBLOATM.EXE/ANNIVM.EXE via execl() from the same process that launched the TSR. You'd think that Borland's signal() implementation would then install an atexit() handler to restore the specific hooked interrupt vector at shutdown. But no: execl() unconditionally resets all interrupts that signal() can possibly hook to their original handlers during libc initialization, even if your program never calls signal(). Hence, execl() would not only remove ZUN's INT 5 and INT 6 handlers if they were set up by a C++-spawned ZUNINIT process, but also leak said process: ZUNINIT's -r command locates the resident process via the segment part of the system-wide INT 6 handler, which obviously no longer works after Borland overwrote that handler.
Thankfully, Borland's function pointers for the original handlers must come with public symbols to remain accessible from two different places in the standard library. Overwriting these pointers after spawning and removing the ZUNINIT TSR is therefore enough to work around this dumb issue.
Bundling all these small utility programs into ZUN.COM was apparently not enough for ZUN, and so he additionally compressed TH03's and TH04's ZUN.COM using Diet. This means that these binaries also have to first decompress themselves before they can unbundle and actually launch the requested sub-binary. Any compressed binary necessarily decompresses into a process larger than the size of its binary file, and the .COM format has no way of expressing that larger size. Dynamically resizing the program's DOS memory block at startup could work, but Diet made the much more reliable choice of turning such .COM binaries into .EXE binaries, which can declaratively request more memory. Although it certainly is questionable how these binaries retain their original .COM extension…
Thus, our TSR size calculation code also needs to support .EXE binaries. The implementation is not complicated at all; you read the MZ header and adapt the single expression for calculating the minimum size from DOSBox-X's source code. But then, we're up for a major disappointment once we see how Diet requests almost one full 64 KiB segment to fit both its compressed and decompressed payload. This doesn't matter for TH03, where SPRITE16 allocates an extra 32 KB for alpha channels that would be placed into that extra bit of memory allocated for Diet before. But TH04 doesn't have a similarly sized third TSR, which leaves us with an unsightly 34,944-byte hole at the top of the DOS heap:
TSR
Binary
Resident
Bundle minimum
Bundle resident
ZUNINIT (ZUN.COM)
13,394
784
13,394
784
PMD86.COM
29,383
30,224
30,167
31,008
TSR
Binary
Resident
Bundle minimum
Bundle resident
ZUNINIT (ZUN.COM)
65,968
784
65,968
784
PMD86.COM
29,383
30,224
66,752
31,008
It's this TH04 issue that raises the question of whether this whole TSR bundling solution was even worthwhile in the first place. It sure was an interesting problem to solve, but it'd be much simpler and less bloated to just integrate the INTvector set program into every binary. For TH03, we could similarly integrate all SPRITE16 functionality directly into DEBLOATM.EXE/ANNIVM.EXE and still end up with a smaller-than-original binary after removing Borland's C++ exception handler. That would leave PMD and MMD as the only TSRs we'd need to spawn from C++, and those do have good reasons to be separate from game code.
Oh well, gotta get TH03's MAIN.EXE position-independent first…
Also, the usual caveats from two years ago still apply. This whole trick of pushing TSRs to the top of conventional RAM still relies on witchcraft that may not work on certain DOS kernels. For developers, tinkerers, and people who know what they're doing, it does succeed at nicely decluttering the game directory. But for… ahem, distributors, I still recommend shipping the modified version of GAME.BAT and GAMECB.BAT in the package below to defend against any potential stability issues.
Finally, if the performance improvements aren't enough of a reason to upgrade to these new builds, how about an actual new feature? TH03's Anniversary Edition now lets you quit out of the VS Start menu via either ESC or a new menu item, without going through the Select screen. 🙌
Matching the style of the version text to the style of ☪ The Phantasmagoria of Dim. Dream on the other side seemed like the least bad option here. That outline is indeed created by rendering every line 9 times…
And with that, I'm finally done with 2025's most indulgent subproject! Let's quickly check the overall impact on the codebase:
That's almost 4,000 lines of ad-hoc PC-98-native graphics code, bloat, landmines, bloat- and landmine-documenting comments, and binary-specific inconsistencies removed from game code, in exchange for…
After the Shuusou Gyoku debacle and the many last-minute fixes that cropped up while I was writing this post, I'm not particularly confident in these builds, despite the weeks of testing that went into them. Still, we've got to start somewhere. At least for TH03, we're bound to quickly find any issues that slipped through the cracks while I'm implementing netplay into the Anniversary Edition.
Next up: The very quick round of 📝 Shuusou Gyoku maintenance and forward compatibility I announced in April, to clear out the backlog a bit. This whole series also really stretched the concept of what 11 pushes should be, so I'll charge 2 pushes for that maintenance round to compensate. In exchange, I'll also incorporate a small bit of new Windows 98 feature work, since it fits nicely with the cleanup work.
P0317
TH02 RE (Main menu overhaul) / TH03 decompilation (High score menu, part 2/2) / TH02/TH03 debloating (Initial screen tearing fixes)
P0318
TH04/TH05 decompilation (TH04 title screen animation / TH05 All Cast sequence / GENSOU.SCR, part 3/3)
💰 Funded by:
Blue Bolt, [Anonymous], Yanga, Ember2528
🏷️ Tags:
Part 3 of 📝 the 4-post series about the big 2025 PC-98 Touhou portability subproject, and we actually get to move some percentages on the front page with this one! For once, there truly isn't a lot to mention about most of these five disconnected small-feature decompilations, so let's go for more of a touhou-memories style and string together a few shorter bullet points and paragraphs. For even greater brevity, I'll also use the ZUN code issue emoji you might already know from Twitter or Bluesky: 🐞 denotes a bug, 💣 denotes a landmine, and 🎺 denotes a quirk.
This was one of those old decompilations from 2015 that I really wanted to bring up to current standards before the debloated branch would roll out the new more portable and performant blitting code. Replacing the magic-number coordinates with constants and calculations revealed 📝 the usual off-by-one text positioning bugs in the Option menu, despite ZUN still using monospaced text in this game…
As for more unique and exciting details in this screen: ZUN's defined gaiji strings contain an unused adaptation of TH01's blinking HIT KEY text. On screen, it might have looked something like this:
This string is so unused that we don't even know its intended position, though.
Finishing TH03's High Score menu
At the end of 2021, 📝 I already decompiled most of this menu, but left two functions in ASM due to push scope constraints. Originally, I thought that this menu would need a few changes to address a certain scorefile inconsistency I'll mention in Part 4, but I ended up finding a better solution. Still, we got one interesting discovery per function out of it:
If you've ever entered a score and were too lazy to type a proper name, you know that TH03 just uses the name of the player character in Romaji if you enter either nothing or AAAAAAAA. But did you know that this happens if you enter any letter 8 times?
🐞 When sorting a new score into the list, ZUN does not look at the 9th digit, i.e., the number of continues used. If you ever manage to enter a score whose most significant 8 digits match an existing entry in the current difficulty's score list, those two scores are considered equal and the new score always gets inserted below the old one. If you enter more than one such score, the list will therefore maintain the order in which the scores were entered:
In this example, I first entered 800-million scores with 0, 3, and 1 continues in exactly this order, before entering this new 2-continue score.
TH04's title screen animation
This decompilation was necessary because its palette manipulation code did the very dubious thing of accessing the palette in a freed .PI slot. I don't think that the stylish effect of separately whiting in the image's black outlines is appreciated enough. And yes, that formally was the last non-RE'd tiny bit of any OP.EXE binary!
Also note that single black pixel in Reimu's gohei.
TH05's All Cast sequence
This sequence contained the last not yet decompiled instance of 📝 masked crossfading, which the debloated branch wants to replace with our single optimized implementation.
Most picture and text cues in this sequence are synced to the BGM, using PMD's AH=05h function to retrieve the current measure. And yes, that's measures, which is indeed the only time unit you get from PMD. The cues appear to be timed based on beats rather than measures, but the secret there is that ZUN simply wrote Peaceful Romancer in the internal time signature of 1/4. Just in case anyone tries to mod this BGM and starts wondering why the sequence suddenly progresses more slowly. I'll just use beats below since it's shorter.
Any cues that don't appear synced only do so because of – you guessed it – weird ZUN code issues.
🐞 But first, what happens if you run the game on a system without an FM chip? PMD does remain resident in that case, but enters a reduced-functionality mode that refuses to even process song data, leaving you with no BGM beats to sync to. Due to the various ways of setting the tempo in a .M file, it's impossible to just parse out the tempo without reimplementing the entire format, so it makes sense why ZUN just hardcoded a fixed replacement delay of 44 frames per beat. However, 44 frames translate to (44/56.423) ≈ 780 ms ≈ 76.94 BPM, which is ~1.9× slower than Peaceful Romancer's actual ~145-147 BPM.
Discoveries like these always start out as quirks until I find evidence that would promote them to bugs. And sure enough: ZUN renders this entire sequence at the halved frame rate of 28.212 FPS, that slowdown factor is suspiciously close to 2, and the code actually specifies 22 frames. This looks as if ZUN simply didn't realize that 22 frames would only translate to the slightly more correct 153.88 BPM at the native frame rate of 56.423 FPS.
This bug also applies if you deactivated BGM in the Option menu, since ZUN treats both cases identically.
🎺 The very first crossfading animation doesn't appear to be synced to any beat, though? It starts close to but not exactly on beat 5:
This one is quickly explained: ZUN does enter the first screen within 2 frames of Peaceful Romancer's first downbeat on "beat" 3, but each screen actually starts with a 34-frame fade-out of the previous screen before crossfading in the new picture. Hence, most of this apparent delay is taken up by a fade-out from black to black. The remaining 4 frames between the beat and the first visible on-screen pixels can be attributed to double-buffering at the sequence's halved frame rate.
🎺 Also, why does the crossfading animation only use two of the four mask patterns across its 16 frames? This seems like a typo in the code, but was almost certainly done on purpose to make this sequence feel more languid and relaxed. The dequirked version with all four mask patterns looks almost too hectic, especially compared to the single mask pattern that ZUN used for text.
But even after that initial screen, the first two or three text cues on later screens don't appear in sync with the BGM beats either?
As pointed out by the uneven placement of the Reimu and Rika cues.
These are Yuuka's second and third screens; the fact that each character gets its own sequence of pictures is common knowledge by now, right?
To understand this, we have to look at how ZUN defines the target BGM beat for each cue in the first place. There's only a single variable that defines the target beat for the BGM-syncing delay, and ZUN simply adds a certain number of beats to this variable before every cue. In the case of these text cues, he adds 2 beats, which matches what we can observe for the correctly synced cues in the video above. The very first text cue, however, is placed two beats after… the beat the fade-out was started on, even though we've just spent at least 56 frames on the two fading effects. This means that BGM playback will not only have already reached this beat, but will even have progressed about half a beat beyond. Thus, the game just fades in the text immediately…
💣 …except that it doesn't! All of the above was pretty quirky, but then ZUN adds a definite landmine by loading the .PI file with the picture for the next screen right after the fade-in animation. If you just look at the few lines after that load call, this seems like a productive use of an intended 2-beat delay, but we don't actually get that 2-beat delay, as I explained above. Instead, BGM playback gets to progress even further beyond the target beat, by the CPU-specific amount of frames it takes to load that next .PI image on the system the game happens to run on. I've recorded the video above by running the original game on our target Neko Project 66 MHz configuration, and got an additional 17 frames of cue drift, between frames 101 and 118 inclusive. In the end, it takes the first three text cues for the beat target to catch up with the BGM on this system, and we only return to proper syncing with Meira, where the beat target has finally moved ahead of BGM playback.
That .PI load call would have been much more appropriate before the 30-beat delay in front of the fade-out…
💣 Even worse, ZUN also loads a new image on the last screen, which defines no next image. This causes the game to unconditionally load from a null pointer, resulting in a landmine in 📝 the classic sense of the word: You can completely ignore it on PC-98 because
Real Mode just lets you read from address 0000:0000 without a segmentation fault
The far pointer to the handler for INT 0 is highly unlikely to actually point to the name of an existing file
That file is even less likely to be a valid .PI file
The game won't display that image anyway, and free its buffer once the sequence ends shortly after
But you wouldn't want to rely on null pointers being filtered by the platform layer.
Finishing TH03/TH04/TH05 scorefiles
Well, at least as far as decompilation is concerned. Cleaning up all these binary-specific inconsistencies on the debloated branch will be just as annoying as reconstructing them in the first place, and I won't even get it all the way done within these 11 pushes. TH05 made this even worse by continuing its general trend of taking TH04's slightly bloated but overall fine C++ code and needlessly rewriting it in micro-optimized and only semi-decompilable ASM. If you still believe that the master branch is a good foundation for any kind of serious work, this file should convince you otherwise.
Two more discoveries here:
If you game over and continue in-game while having a score that would qualify for the current character/difficulty list, the game automatically enters it with a CONTINUE name while staying within MAIN.EXE. Of course, this means that both games get yet another dedicated piece of code to mutate the High Score list…
🐞 And so, the TH04 variant of this code also gets its own distinct version of the 📝 C integer promotion issue that limits the technically supported score to 959 million points. In an unexpected twist though, TH05's ASM rewrite actually manages to fix this issue in a surprisingly natural way by explicitly performing the necessary calculations on 8-bit registers. On the other hand, fixing it within C++ would have still been totally possible and natural and code-simplifying…
The single biggest source of inconsistencies can be found in the code that recreates corrupted scorefiles. During my tests of the cleaned-up and improved rewrite on the debloated branch, I regularly had to corrupt these files on purpose. File contents getting fully or partially overwritten with 00 bytes is the most common kind of corruption you'd encounter with modern operating systems and SSDs, but hilariously enough, that happens to be the exact kind of corruption these games might even fail to detect. If these 00 bytes cover an entire character-/difficulty-specific section, all three games consider such a zeroed section as valid, since it passes checksum validation?
The deobfuscation algorithm explains why:
// [key1] and [key2] are `uint8_t` as well.
decoded_byte[i] = (key1 + (std::rotr<uint8_t>(encoded_byte[i + 1], 3) ^ key2) + encoded_byte[i]);
When saving a section within these files, the games generate new random values for key1 and key2 and store them directly in the file. Without any kind of hardcoded nonce to perturb the input, this obfuscation scheme thus fully relies on the combination of keys and data to generate random-looking output. Set both of them to 0, and deobfuscation turns into a no-op. Then, a buffer of 00 also sums to 0, which also matches the 0 checksum in the file. In contrast, TH02's obfuscation scheme lacked any source of randomness, but it did cover this exact case…
Here's how such a fully zeroed-out GENSOU.SCR looks like in TH04's and TH05's High Score viewer:
If you remember how GENSOU.SCR saves scores in 📝 this silly gaiji-offsetted way, these screens almost explain themselves. 0 minus 160 will always be an invalid sprite ID, and since master.lib's super_put() doesn't bounds-check sprite IDs, it blindly accesses invalid sprite data and probably ends up filling every VRAM bitplane with 1 bits. After the game spent way too much time rendering this garbage data, we then only end up seeing the sprites that get rendered after the very last score digit.
The VV characters might look especially weird in place of the usual stage number, but they quickly make sense once you remember that these numbers are gaiji rendered to VRAM. The PC-98's character generator simply can't support a gaiji with an ID of 0, since it would have to be encoded as 0x0056, which is indistinguishable from the halfwidth V in ASCII. And since master.lib assumes that all gaiji are fullwidth, we get two of them next to each other.
The visual result for a zeroed-out YUME.NEM in TH03's High Score screen, however, is much more… well-defined:
Since YUME.NEM stores names, scores, and stage numbers as raw sprite IDs, we get sprite #0 from REGI2.BFT for all of them.
AAAAAAAA AAAAAAAAAA A
Finally, I stumbled over a script bug in TH04's Good Ending for Reimu A:
The 2014 static English patch fixes this issue. That's probably why this isn't talked about anywhere.
This looks unintentional, and the same line in Reimu B's Good Ending confirms that this is indeed a typo:
\p,ed07.pi
\=0,4
魔理沙:なんだよ、そりゃ\ga9\s160\c
\p,ed07.pi
\==0,4
魔理沙:なんだよ、そりゃ\ga9\s160\c
The 📝 cutscene command reference tells us that the line in the Reimu B variant is preceded by \==, the picture crossfading command, followed by both possible parameters, 0 and 4. Reimu A's script, however, lacks that second = and instead spells out \=, the immediate picture display command, which doesn't take a second parameter. Thus, the command stops reading after the 0 and leaves the trailing ,4 as text to be displayed in the newly started box. The line break is then ignored as usual, causing 魔理沙 to be displayed right next to these two characters.
Whew! Once again, this did turn into more of the typical ReC98 research by the end after all. And that was just 75% of the pushes assigned to this post, because the rest already went towards the debloating work. Next up: Concluding this series and actually applying all this research to the games.
Talk about a nerd snipe! I just wanted to take the first meaningful step towards getting PC-98 Touhou portable. But then, that step massively escalated and resulted in not only the single biggest subproject of 2025, but also in the most productive dev cycle this project has seen since the beginning of the crowdfunding era. 405 commits over 11 pushes, and touching on so many topics that writing a single blog post would have been way too much for even me to handle. So let's try something new and split this delivery into four "smaller" and thematically more focused posts that I'll release in quick succession:
Part 1 (this post) describes the various strategies of porting PC-98 Touhou to modern platforms, explains which one I'm going to take and why, and clears up common misconceptions surrounding performance and accuracy. This one is required reading for anyone (yes, anyone) who believes they want to see these games ported. Hence, it's also intended for people who aren't that familiar with ReC98 and its usual ideals, and tries to not go all too far into technical detail. (Hopefully.)
So, how do we get the PC-98 Touhou codebase into a portable state? That entirely depends on what kind of port we want in the first place, and how much of ZUN's code we are willing to change. Three particularly efficient options immediately come to mind:
On one end of the spectrum, we have a preconfigured PC-98 emulator with disabled configuration options and a stripped-down UI that tricks people into believing they're playing a port and prevents them from accidentally breaking the working configuration.
This might sound like a joke, but it's unironically the most efficient and pragmatic solution that will be good enough for the overwhelming majority of players. If you ask people what they expect from a port, they primarily name ease of use and not having to configure emulators. Both of these can be solved with a preconfigured emulator and thus don't justify the monumental engineering effort of the more complex porting methods described below. That effort also wouldn't be justified if people just wanted a port and had no standards regarding its technical implementation, besides maybe no input lag. Someone has to put in the effort to solve every little challenge on the way from PC-98 to modern systems, and if that effort is not appreciated…
By the way, I have no idea what people are talking about when they claim that PC-98 Touhou has input lag, because there sure is nothing like that in the code that would indicate anything above 1 frame / 17.7 ms for the in-game portions. Any investigation into these issues would therefore have to come from someone else, I'm afraid. Everything points to input lag being the result of misconfigured emulators.
This is not like Shuusou Gyoku, where a port to modern APIs made sense because almost every subsystem still performs suboptimally on modern Windows even after you set up DxWnd, a better MIDI synth, and whatever people are using to make modern gamepads work with ancient DirectInput these days. If you correctly set up a PC-98 emulator, the games do run at full speed, and are highly likely to continue running fine after emulator and operating system version updates.
Thus, can we conclude that wishing for ports is primarily a symptom of the Touhou community's past failure and negligence to spread preconfigured emulators to people? Because this surely shouldn't be a problem in this day and age anymore? While I did my part way back in 2013, it would take until spaztron64's 2021 package for the community at large to finally wake up and realize that this was a problem. Nowadays though, we have at least three decent packages made by separate people that have my personal seal of approval. And yes, this even includes the offering you can obtain at a certain mountaintop place of worship. That site used to be infamous for pushing out slop that violated their own mission statement and externalized costs to the tech support departments of their supply chain, but I'm glad to announce that they've leveled up and now provide a decent solution. And once they remove that archive inside their archive, it will be even better!
Still, if your emulator configuration guides are presented more prominently than your preconfigured emulator downloads, you're doing a disservice to the community. Make guides available, yes, but clearly label them as background information for people who already played the games and then got curious about this old Japanese computer architecture.
OK, but what if you do have standards and would appreciate a technically more solid port that removes layers and maybe even improves the games beyond the limits of the PC-98's architecture? If we take a single step towards native code and native performance, we end up with what people call a "static recompilation" these days. As I explained in the FAQ entry I wrote last year, this kind of port would still emulate the graphics, sound, input, and memory subsystems of a PC-98, but it would cut out CPU emulation.
For PC-98 Touhou, this is actually quite a huge deal: CPU speed is the single biggest point of contention when configuring PC-98 emulators for Touhou, and the vastly different x86 cores of each emulator result in vastly different performance characteristics once you start to benchmark them all more thoroughly. With no more CPU cycles to count, we'd also lose all the VRAM access latencies that emulators typically strive to replicate, and thus pretty much guarantee 0% slowdown in the resulting port. While the aforementioned kind of modded emulator could theoretically also remove cycle counting and VRAM latencies, it would still interpret x86 instructions and thus have a harder time actually reaching the native performance required for 0% slowdown.
This kind of port would also find immediate acceptance within the gameplay community. Since it would only take ZUN's original binaries as input and ignore our reconstructed source, we're guaranteed to retain the exact gameplay logic. The entire instruction translation process would be automated, leaving no room for modernizing the codebase by hand 📝 and accidentally breaking gameplay. We'd still have to defuse at least a few landmines to get the port running without issue, but those would be limited to things like filename casing, for example. Nothing even remotely close to gameplay code.
On the other end of the spectrum, we have something like uth05win: A fully native rewrite of the graphics code that takes every liberty and cuts every corner it needs to rework the game into something that naturally renders within a modern graphics API of our choice. Unlike uth05win, however, our ports will be based on complete decompilations and thus retain the original gameplay code instead of freely rewriting certain parts because they look strange. In turn, we would basically scrap all of ZUN's menu and cutscene code and write quirk-free and sane replacements. Part 4 will drive home just how much more relaxing this course of action would have been…
There's certainly an argument to be had that a modern port should reimagine the game to look and feel as modern as you can get within the original assets, and not stick to PC-98 limitations. After all, the unmodified PC-98 version is always there for you to play on your correctly configured emulator, right? In fact, if we ever wanted to port the games to weaker systems or consoles, this kind of port would be our only option.
But as you might have guessed, we're not going for either of these options:
The first option doesn't even need anything from ReC98. Even the sleekest imaginable release could be done by anyone who either knows about PC-98 emulation or keeps in contact with someone who does, and is comfortable messing around with emulator source code. In fact, I'm not even a particularly qualified person for this job; I frequently mess with emulator configurations for research reasons, and then forget the correct values for certain obscure settings.
This is such an obvious and efficient move that I seriously wonder why nobody has done it so far… but then again, I thought the same about every other idea I ended up doing myself in this space over the past 15 years. If that idea sounds great to you, feel free to go ahead – it represents the opposite of what this project is about, so the resulting fame is yours for the taking. If y'all see "ports" popping up from a place that isn't this project in the not-too-distant future, you can be pretty sure that their developers followed this strategy.
The second option would indeed be an interesting project in its own right, as I've stated in the FAQ entry. But if you remember 📝 the last time I thought about static recompilation, I was way more excited for recompiling the old compiler we use for the PC-98 code rather than the games themselves. Ironically, this is primarily because of how much a recompilation would complicate the new features we plan to add to the games. Since I can only develop new features on top of a previous reverse-engineering effort, they will necessarily remain tied to the PC-98-native version of the codebase at first. How would we port them, then?
Do I continue developing these features for the PC-98 and then simply recompile them along with the rest of the game? The issue with that approach is that most features won't have a version that could work with the original ZUN codebase that we'd prefer to recompile. For everyone's sanity, most features will only exist as part of a respective game's anniversary branch, which in turn is based on the rearchitected and de-landmined debloated branch. Recompiling these branches would undermine the entire selling point of delivering the pure, untainted ZUN code that would have probably convinced the gameplay community to invest in this strategy in the first place. It might be good enough for the rest of the community, but if I'm going to rearchitect the PC-98 codebase anyway, would there even be a point in developing the required recompilation techniques on the side? Would this give us ports faster than following a more classical approach?
Then again, I could still try slicing out the code for these features in a way that would allow them to be shared between the rearchitected PC-98 and recompiled ZUN codebases. But that's bound to create an unnatural and awkward mess that's probably even worse than the way I have to arrange ZUN's code on the unmodified master branch. I'd definitely charge extra for that.
Do I just copy-paste and maintain two versions of the feature code for both platforms, manually transferring all required reverse-engineering to the recompilation? That might feel very dull, but it's probably more efficient than any attempt at sharing that code.
Or do I just abandon the PC-98-native codebase? In favor of a pseudo-PC-98 codebase that still very much assumes PC-98 hardware but doesn't actually run on real or conventionally emulated PC-98 hardware…
The last point in particular demonstrates just how little of a help a recompilation would actually be. Since it would continue to emulate the PC-98's graphics system, I'd still have to write any new graphics code against the PC-98's planar and two-page VRAM. Automatically porting the games to a friendlier and more generic rendering paradigm is infeasible for even an advanced recompiler: Every part of the original game expects PC-98 hardware, and a generic rewrite requires engineering decisions at a much higher level than the individual x86 instructions a recompilation operates at.
And ultimately, it's these individual features that people should be (and mostly are) hyped for. Community-usable replays, translations, and TH03 netplay can all be implemented natively on PC-98. Sure, netplay would be easier to develop and easier to use within a TH03 recompilation since we can just use the native network stack of your host OS 📝 without any intermediaries. But developing both a recompiler and netplay would still take longer than 📝 following through with our current PC-98-native plan.
The third option is actually quite popular, or would at least be acceptable to the majority of the general fandom. This is what non-technical people have in mind anyway when they think about ports, even if they don't confuse ports with remakes.
To find out just how acceptable such a port would be, I picked screen fade effects as a representative detail for the corners that such a port would cut, and asked how people judge the natural alpha-blended implementation in uth05win against the palette-based method you'd use on a PC-98. Surprisingly, a whopping 79% of respondents don't have any problem with a port using whatever is most natural for the system it runs on. And that's 79% of my audience, which certainly is at least somewhat aware of PC-98 hardware details and the limitations that shaped these games into what they are. Of course, the 21% of die-hard PC-98 supremacists would then loudly complain that such a choice would make the port literally unplayable, but we could easily dismiss them by pointing to the poll where the community decided in favor of the smoother option. After all, ZUN's intention was to have a fade, and manipulation of a 12-bit color palette was simply the only tool he had on a PC-98.
However, the gameplay community has much higher hopes for ReC98. Both them and I don't just want to supplement the original PC-98 versions with something that's playable on modern systems, but
> replace the need for the proprietary, PC-98-exclusive original releases and their emulation for even the most conservative fan
as I wrote back in 2014. Sure, the community can manage spreading pre-configured emulators for a few more years, but wouldn't it be great if they could stop doing that at some point in the far future?
So if all the "easy" solutions either don't have much of a purpose or disappoint in some way, we're only left with the hard one: A classic, manual port done primarily for the sake of solving an engineering challenge. But hey, this means that it'll also produce tons of blog posts for all of you to read, which apparently is at least equally as popular as actually playing the games.
Here's what we're going to do:
Rearchitect the game to end up with one shared codebase that compiles for both PC-98 and modern systems, avoiding the code duplication drawback of static recompilation approaches.
Accept nothing less than a pixel-perfect port. The PC-98 and modern versions should look identical on every frame. It is not ReC98's job to reimagine the games; as usual, I'm going to do the hard work, and it's up to other modders to throw it all out and simplify it later.
Perform all the automated gameplay validation we possibly can to earn the trust of the gameplay community, avoiding debacles like 📝 the📝 two recent desyncs in my Shuusou Gyoku build. This forces us to have a lightweight method of recording replays on top of the unmodified master branch before we can start porting – a fact that Ember2528 already somewhat identified within his current roadmap of funding priorities for TH03.
Continue fixing landmines, bugs, and bloat. Many landmines must necessarily be fixed for a port to work at all, bugfixes are highly requested by most fans and backers, and bloat fixes ensure maintainability, moddability, and bring the PC-98 versions closer to the performance a modern port will naturally run at.
Sure, the main drawback here is the immense development effort required. But in exchange, the port retains readable and moddable code and continues to deliver the insights that this project has always stood for. Imagine stepping through gameplay code using a native C/C++ debugger at your native screen resolution!
But before we can get to how I'm going to do all that, there are two popular misconceptions I have to address.
Note that none of the emulators have accurate slowdown; the slowdown will not match real hardware.
Objectively, this is a true statement. Neko Project's i386 core is the closest thing to cycle-accurate PC-98 emulation we have, as its per-instruction cycle counts match Intel's documentation. But even its performance characteristics are wildly inaccurate compared to a real PC-98 system with a 386, as we're going to see in the next blog post.
The problem I have with this sentence is that it's very misleading in this specific context. The mere mention of accurate slowdown in a beginner's guide on PC-98 emulation paints said slowdown as something desirable and worthy of preservation. It evokes stories of console speedrunners and emulator developers who deal with fixed, well-defined hardware where the concept of accurate slowdown makes sense. Stories that probably originated from a time before decompilations of classic games became commonplace, when it was hard to say whether a particular instance of slowdown was intended or not. And even with a decompilation, these things remain a matter of interpretation if you can't ask the original developer. Thus, it's completely understandable why observable behavior of real hardware remains the one benchmark of accuracy and quality that people can understand and rally around.
The PC-98, however, is very much not that kind of fixed system, but a computer architecture that spanned 18 years of hardware evolution, from 1982 to 2000. Even if we reduce this list of models to the ones that match ZUN's stated minimum system requirements, we're still looking at 7 years of hardware, running different microarchitectures at different clock speeds and with different resulting bottlenecks. If there's such a big variety of systems, which particular slowdown behavior should the ports even preserve?
The obvious answer is "the one from the exact system ZUN wrote these games on", but we don't know that system. 📝 Last year, I claimed that ZUN developed these games on a PC-9821Xa7, but I didn't add a citation back then and can't find one now. The closest piece of related known info is this note on the Amusement Makers page that hosts the official downloads for the trial versions, listing three PC-98 models that they confirmed to run the games without issues:
なお当サークルでは
・ NEC PC-9821Xs i486DX2 66MHz
・ NEC PC-9821La13 Pentium Processor (P54C) 133MHz
・ EPSON PC-486MS AMD 5x86-P133 換装
などで正常に動くことを確認しています
These models are one whole CPU generation apart and their clock speed differs by 100%. Which one of these is supposed to have the accurate slowdown?
But even if we knew, it doesn't matter. The README is clear about ZUN's intentions:
If ZUN recommends a 486 or faster to avoid slowdown, this necessarily means that any unintentional slowdown is indeed unwanted.
Also, note how only TH02's README claims that the game was exclusively tested on a 66 MHz model, which is highly likely to be that PC-9821Xs listed on the Amusement Makers page. Did ZUN switch to a faster PC-98 model for the development of the last three games? That late into the architecture's lifespan? Or did he merely test the game on faster models while the main development still took place on his 66 MHz model?
Picking a CPU clock speed for emulators
Of course, this now creates a problem for everyone wanting to configure emulators for PC-98 Touhou. If the ideal Touhou machine is infinitely fast, we should always pick the fastest possible emulated CPU speed, right? Historically, this has been bad advice: Most emulators will then stick to exactly the amount of cycles per emulated second you specified in the menu, slowing down the emulated system as a result. It's this kind of emulator behavior that gets players to manually look for "the sweet spot" – the maximum possible explicitly specified CPU clock speed that still manages to render without slowdown on their system. This is a tragedy for many reasons:
Regular players probably don't analyze performance with any kind of rigor. I certainly have never heard them say how they made sure to record a video at 56.423 FPS and then stepped through its individual frames to confirm the absence of lag.
Instead, they will probably present their clock speed configuration as a general recommendation to others, without realizing that the "sweet spot" they found is specific to their system. If others then try this clock speed on a slower CPU, they get slowdown instead, and thus gain an entirely wrong impression about how fast the game is supposed to run, backed by a presumptive expert on the topic.
Admittedly, this will become less likely as time marches on, CPUs get faster, and emulators keep optimizing their x86 cores.
But really, why are we expecting players to do this?!
Ever since 2019, however, SimK has been developing an Async CPU mode for Neko Project 21/W, which
finally got stabilized in ver0.86 rev.93, back in April of this year. Activate this mode with the Screen → CPU clock stabilizer and Screen → Dynamic CPU clock adjustment options, and then you should theoretically be able to finally stop worrying: Just specify the maximum possible clock speed in the usual configuration menu, and Neko Project will dynamically reduce the emulated clock speed to the fastest speed your system can handle.
Then, the games are supposed to run similarly to how a correctly configured Anex86 has been running them all along, but with an additional 21 years of emulation accuracy improvements.
Sadly, this mode still needs a bit of work. Excessively high clock speeds will result in wildly fluctuating frame rates and even BGM tempos during the first few seconds of a game session as Neko Project 21/W apparently takes a while to find the optimal clock speed. Even afterwards, emulation remains noticeably slower than Anex86:
This is Neko Project 21/W ver.0.86 rev.95 configured with a clock speed of 1 GHz, running on an Intel Core i5-8400T. The fluctuations are not nearly as intense during the rest of a game session, but remain noticeable throughout.
But what about DOSBox-X, the other good emulator recommended these days? This Async CPU mode is very similar to the cycles=max option that DOSBox-X has supported all along. If you try running my 📝 past and future blitting benchmarks using this option, you can observe how DOSBox-X also starts with a low cycle count and then gradually speeds up to accommodate the actual processing load.
In the much less synthetic test case of running PC-98 Touhou, however, DOSBox-X's cycle adjustment reveals itself as much more sophisticated than Neko Project 21/W's implementation. The showdetails=true option reveals that the cycle count does fluctuate quite heavily, which does translate into minor BGM dropouts particularly near the start of a session. But these dropouts are tiny in comparison to what you'd get on Neko Project 21/W, and the framerate remains stable throughout.
As for overall performance, DOSBox-X's simple interpreter core is not nearly as optimized as Neko Project 21/W's interpreter and peaks at roughly half of its speed. The dynamic_nodhfpu core, however, solidly beats Neko Project 21/W by the same 50%. And it's this added bit of performance that makes all the difference: It eradicates slowdown in most of the usual spots in PC-98 Touhou where emulators and even Anex86 typically struggle, and turns DOSBox-X into the first emulator to finally beat Anex86's performance on the same hardware in all the workloads that matter. The dynamic core still doesn't quite reach the speeds of the hypothetical infinitely fast PC-98 on my outdated system, but it remains the most reliable configuration option when it comes to delivering ZUN's intended vision. If we ignore the BGM dropouts.
Just make sure to explicitly select the dynamic_nodhfpu variant, not the regular dynamic core. The latter is infamous for recompilation errors in FPU code that break TH01 gameplay. While that specific issue is ostensibly fixed, I still managed to occasionally run into smaller FPU-related bugs in current DOSBox-X versions. Unfortunately, I didn't manage to capture them on video; I would have reopened the issue on the spot if I did.
Of course, any performance measurement of an emulator with dynamic cycle adjustment can only ever represent a snapshot of the ever-changing adjustment state, and should therefore be taken with a grain of salt. Hence, these screenshots are purely decorative; I just added them because I'm sure that someone would have asked for exact numbers otherwise. Also, the exact relations between emulators are highly dependent on the workload…
And yes, that's a new benchmark! More about this one 📝 in part 2.
(Still, it's remarkable how close Anex86 gets despite its interpreter core, and how it even beats DOSBox-X in MOVS performance. I looked at Anex86's disassembly for 10 minutes and saw big tables of tiny per-instruction functions with custom calling conventions that make remarkably efficient use of the few registers you get in x86. Also, negative offsets? They must have written this entire x86-on-x86 core in ASM.)
While this is great news for players, the whole situation remains very unsatisfying at a technical level. Even if you don't care about the remaining BGM dropouts, running these games at the highest possible emulated clock speed means that you constantly spend 100% of all CPU cores assigned to your emulator just to avoid slowdown and lag in a few particularly CPU-intensive sections. Power saving might be the single best practical argument in favor of a port.
Also, all this complexity involved in dynamic cycle adjustment raises one question you might have had all along. Why don't we just leave our emulated CPUs at 66 MHz? After all, ZUN said that 66 MHz is enough to eliminate all slowdown in at least TH02 and TH03, so how about just living with whatever slowdown we'd still experience in TH04 and TH05? This is certainly a healthier approach, much more appropriate for these silly little indie games that were never meant to be obsessed about at this level, and we get rid of those last few BGM dropouts in DOSBox-X!
Well, if that statement was ever correct to begin with, it would have only applied to real hardware and not to emulators. mu021 reported that the final phase of TH02's Mima fight slowed down even at 78 MHz in Neko Project, and part 4 will contain 📝 even more examples of how 66 MHz slows down several effects in menus and cutscenes, and thus paints a wrong picture of them. Hence, choosing 66 MHz for a preconfigured emulator package might have a particularly annoying side effect: If people get used to how slow these effects run on emulators, they might be rather irritated once the modern ports will invariably run them at their intended speed denoted in the code. I can already imagine them yelling too fast!, inaccurate!, and literally unplayable!, oblivious to the fact that they had the wrong idea about these effects all along.
Or maybe it'll all be fine once part 4 has documented these issues in depth. I certainly wouldn't criticize a package for choosing 66 MHz. All choices are unsatisfying at some level…
If only we could optimize the games enough to remove any unwanted slowdown at 66 MHz. Then, people could freely choose one emulator over another for reasons unrelated to performance, because even cycle-limited emulators could then actually deliver on ZUN's statements in the README files…
And since we've defined debloating as an integral part of port development earlier, that's exactly what we're going to do.
But can we even do that within our high standards? Obviously, our ports should remain…
Frame-perfect
Since all five games are explicitly timed around VSync, it's immediately clear what we mean by this term:
Everything rendered to a single page of VRAM between two VSync wait loops defines one single logical frame.
If we are double-buffering correctly and the PC-98 system running the game is fast enough to finish rendering such a logical frame to VRAM within two VSync signals, everything is fine: The sequence of frames you can observe on your screen matches the logical sequence of internal frames, and we can easily record this sequence and compare the port against it.
But what about unintentional slowdown? In these cases, ZUN asks the system to do way more work than it can execute between two VSync signals. Notably, this also includes most loading times: Once we add disk access into the mix, we can't guarantee hitting any VSync deadlines anymore, and decompressing all these 640×400 images is quite expensive as well. Obviously, we don't want to abandon our goal of frame-perfection and the comparability of ports just because of this variability, so let's add another rule:
Individual defined frames may be shown on screen for any integer multiple of the frame time.
The reason for the integer restriction is obvious: If we start drawing to the screen in the middle of a frame, we get screen tearing and thus a non-perfect frame – not just because tearing looks bad, but also because the position of the tearing line always depends on the overall performance of the system you run the game on.
The combination of these two rules leads to an immediate consequence:
The games must only ever display complete logical frames.
And now we have a problem. Our rules have just outlawed screen tearing, but nearly every menu and cutscene screen in ZUN's original code has some kind of screen tearing issue. 📝 The Music Room of TH02-TH04 represents probably the worst example as it suffers from screen tearing on every single frame:
Also, how would you possibly preserve these tearing lines once you've ported the game? After all, modern platforms not only imply much faster CPUs, but also completely different rendering methods, especially once we add scaling into the mix.
This can only mean one thing:
It is fundamentally impossible to port the unmodified codebase of PC-98 Touhou and remain frame-perfect to the original release.
You could maybe get there by throwing out the integer multiple rule and accepting teared frames as legitimate. But then you'd have to decide on a particular model whose slowdown behavior you'd want to replicate and lock down exactly – and as I've stated in the section above, that's quite a silly and impractical proposition.
Resolving screen tearing
So, how do we get back to a comparable sequence of well-defined frames? This can only work if we leave the confines of real hardware and instead reach for the infinitely fast PC-98 that ZUN wanted to have anyway. Such a system would never exhibit screen tearing because it would naturally complete all rendering within the vertical blanking interval preceding each displayed frame. Once our code then ends a frame by entering a busy-waiting loop for the next VSync signal, the screen would then get to draw static and well-defined VRAM contents. This behavior is the whole reason why I get to classify screen tearing issues as landmines that must always be fixed, as opposed to bugs that a port could potentially retain.
If we actually had such an infinitely fast PC-98, we could just run ZUN's unmodified code on that system and be done now. But as we've seen above, not even DOSBox-X's dynamic core manages to run PC-98 Touhou at the infinitely fast level we'd need. Also, we wanted to get rid of relying on specific emulators and have already planned to optimize all this code anyway…
So let's defuse each screen tearing landmine one by one by rewriting its code to match the output of an infinitely fast PC-98. This is a lot more feasible than it sounds because these landmines aren't actually caused by a lack of CPU power. Every screen tearing issue comes down to ZUN misplacing certain screen-affecting operations within the hellscape of imperative hardware state mutations that is his menu and cutscene code. You can either hide the issue by throwing an infinite amount of processing power at the problem so that the order of mutations no longer observably matters, or you can just write good code.
In theory, we only have to follow a few rules:
All VRAM page flips and hardware palette changes must be moved to the vertical blanking interval.
Since TRAM is always single-buffered and ZUN rarely writes to the topmost rows, we can get by with merely moving TRAM writes close to the vertical blanking interval if we don't manage to hit the interval exactly.
On single-buffered screens, the same is true for VRAM. This category mainly includes menu screens whose upper VRAM rows thankfully remain static, so we also get some leeway here. Rewriting these screens to be double-buffered might sound better, but doing so at the high level where these landmines have to be fixed would only create more of a mess, 📝 for reasons I'll explain below.
In rare cases, ZUN placed expensive file load calls and draw calls on the same logical frame within a single-buffered screen. For an infinitely fast PC-98, this is no problem. But since all bets are off once disk access is involved, there is no way we can hide the draw calls and avoid the resulting screen tearing on real hardware and emulators while still sticking to ZUN's defined sequence of logical frames. Thus, we have to make an exception and insert an additional VSync delay loop after the load calls to separate loading and rendering, creating a new logical frame that did not exist in ZUN's original code.
This might sound very controversial. We've just come up with this mental model of an infinitely fast PC-98 to solve frame-perfection, only to now deviate from it again and snap back to reality? However:
As I'm going to describe in 📝 part 2, we're about to speed up loading and blitting by much more than this one added frame.
If we run this logical frame on the actual fastest real-hardware PC-98 system the community has to offer and even that system takes longer than 17.7 ms to render it, it's hard to argue against formalizing a delay you'd be getting on real hardware anyway.
The difficulty of actually pulling this off, however, can range anywhere from Easy to Lunatic, depending on the screen, because of course every one of them is different. Even after these 11 pushes, I'll be far from done. But in the end, we'll have perfect and easily verifiable frame parity between the PC-98 versions and the future ports, even though we had to bend the code a little. Or a lot. Oh well.
If you only opened this post for the required reading part, you can stop reading now. I've got a few more technical thoughts about a few implementation details of the future ports that tend to come up in discussions, but these aren't as essential as the high-level issues above.
So we've now decided on what to do in order to make the ports good, but what are the basic challenges we have to solve in order to port these games to modern systems in the first place? Let's start with a perhaps surprising list of non-issues that some people might perceive as challenges:
Sound. As people of culture, we can all agree that PCM recordings of sequenced sound are sacrilegious, so the ports will always use some kind of emulation here. Therefore, I'll simply ask sound people for the best YM2608 and PMD cores that won't get me canceled. If I still get canceled, we'll just resolve the disagreement with a violent flamew- I mean, a constructive discussion, or just offer multiple options if there are valid arguments for either choice – similar to how you can 📝 choose between real SC-88Pro or virtual Sound Canvas VA recordings for my Shuusou Gyoku build.
TH03's SPRITE16-powered in-game renderer. For a port, it does not matter at all how a sprite driver was originally implemented. ZUN already streamlined regular sprite blitting down to three common functions, which a port would simply need to implement differently. The game code still contains 21 additional calls to SPRITE16 functions for certain special effects, but none of these additional monochrome, masked, or overlapped blitting modes are unique to TH03.
In short: If the feature in question is consistently used through an API, it's not a challenge in itself. The hard parts are all the opposite cases – when ZUN suddenly starts writing to VRAM segments or I/O ports in the middle of gameplay code, like he does all over TH01. All of these instances need to be manually cleaned up and abstracted away. Conversely, this is also why 📝 TH02 remains by far the easiest individual game to port – it has the least amount of hand-written blitting code and mostly sticks to master.lib functions.
Instead, the biggest immediate challenge is something far more basic:
🎨 Palettized and planar graphics 🎨
After all, PC-98 Touhou doesn't just view the PC-98's graphics subsystem as an obstacle to overcome, but occasionally makes creative use of both palettes and individual bitplanes. How would we possibly cover these effects in a modern graphics API that will be far removed from these concepts? Three challenges immediately come to mind in that regard:
The whole concept of enforcing a single 16-color palette across the entire screen in a world where 32-bit RGBA is the only reliably available texture format. Shaders offer a simple solution: We simply wouldn't use traditional textures, and just write our own sampler that takes both the original palettized 16-color+alpha image and the global palette as input, and performs a lookup for each texel. But what are we supposed to do in SDL_Renderer's fixed-function pipeline? Use the CPU to update all loaded textures on every palette color change? Split each sprite into a separate texture for each color and consume 16× the amount of VRAM just so that we can use vertex colors for each individual color layer? Or break down every sprite into a point list to save the VRAM?
Any kind of sprite-shaped palette color bit flipping effect, such as 📝 the falling polygons in the Music Room. Effects like these could potentially be hardware-rendered even in a fixed-function pipeline if we split the background image into two and render the polygons using regular triangles with their UV coordinates matched to the pixel coordinates on every frame. But would all the involved interpolation reliably give us the original sharp edges without reaching for a shader to ensure that it does? In any case, this solution would need a completely different implementation for a modern port than it currently uses in ZUN's PC-98-native code, which gets by with less per-frame redraw than you'd think that this effect would need.
uth05win didn't even get to port the Music Room, which is probably not without reason.
TH01's square-shaped inverting effects used during bomb and entrance animations. Flipping a given bit of a pixel's palette index? Based on what's there before? No way around a shader for this one…
Note how the flipped cards rip holes into the square trails. I'm not even sure what the TH01 Anniversary Edition would change about the effect, or whether it even should change anything about it. Good luck porting this effect pixel-perfectly without pixel-level access.
However, writing all this custom graphics code for the modern port would run against my previously stated goal of sharing as much code as possible between PC-98 and modern platforms. While shaders are the conceptually simpler solution for all of these challenges, they aren't easy in practical terms, and I already 📝 decided against using them for Shuusou Gyoku for good reasons. Also, is all of this really worth the effort if these games demonstrably don't even need the performance of GPU rendering?
But that only leaves one conclusion:
The future ports of PC-98 Touhou to modern systems will software-render the graphics layer on the CPU.
I know, that sounds very shocking and probably disappointing at first. But at a closer look, it's really not all that bad. These games have been software-rendered all along by not only PC-98 emulators, but by real hardware at mid-90's CPU speeds. You might point to the GRCG and EGC chips as evidence for at least some capacity of hardware acceleration, but I see them more as workarounds for the unfortunate planar nature of VRAM on this Japanese business computer architecture. In the end, "software rendering" only means that the CPU receives access to every pixel in the framebuffer. Once all graphical functionality is neatly abstracted away and the game no longer directly accesses the four physical bitplanes, the ports can store sprites and the rendered graphics layer in the most performant way.
Also, note how I only said "graphics layer". Besides 📝 the obvious candidate of framebuffer scaling, the ports will use the GPU for two more important aspects:
The PC-98's text layer. With 8 fixed colors and glyphs drawn from a more or less static font ROM/gaiji texture, there is no reason not to render this layer entirely on the GPU. Even color reversing is as simple as defining a custom blend mode that inverts the alpha channel, which SDL supports for all of its renderer backends.
Vertical scrolling. 📝 The original games also reach for a PC-98 hardware feature here, and this feature can be replicated within 3D APIs in exactly the same way by adjusting the UV coordinates of the VRAM texture. This insight reduces the software renderer's required per-frame redraw to exactly the same amount as the PC-98 version, and should defeat any remaining concerns you might have about software rendering.
The still image in that post from two years ago doesn't demonstrate the PC-98 way of VRAM scrolling all too well, so here's a longer video that scrolls an entire screen's worth of tiles:
In the game logic, all entity positions represent the scrolled on-screen view, while the sprites are offset by the Y coordinate of the green line (representing the top of the scrolled screen) before they are blitted. Also note how ZUN never redraws the area between the yellow line (representing the bottom of the playfield) and the green line as part of the scrolling process, since it's always covered by a 16-pixel row of black TRAM cells. Any redraws there are a result of regular tile invalidation caused by overlapping sprites, and remain isolated to the VRAM page that the game rendered to when the overlap happened.
The gameplay is taken from 📝 ZUN's hidden TH05 Extra Stage Clear replay.
As a result, the software renderer of our hand-crafted ports would still internally produce a graphics and text layer that persists across frames and receives minimal redraws, just like the PC-98 originals did. In fact, it would have to produce the exact same graphics layer if we wanted to port the non-Anniversary Edition, including the tile source area. There's no technical need to keep tiles on the graphics layer in a port, but certain intense shake effects temporarily reveal individual tiles below the HUD:
This definitely counts as a bug to be fixed in this game's Anniversary Edition, but how would we fix this one on PC-98 where we do need the tile area in VRAM? Moving the tiles to another place and patching the 📝 .MAP at runtime?
Applying the palette to produce the final rendered image then raises another set of exciting engineering questions. Would we actually use a palettized 4bpp buffer in memory, storing two pixels in a byte? Perhaps with an 8-bit palette that maps each possible pair of pixels to a pair of 32-bit RGBA values, halving the amount of per-frame palette lookups? Or would we always store an RGBA image and merely offer a palettized API around it? As far as I'm concerned, these challenges are way more exciting than the prospect of locking ourselves into some shader language.
📃 Page flipping 📃
But wait. If the port produces a persistent graphics layer, shouldn't it produce two, one for each VRAM page on the PC-98? From the point of view of a modern port, we really don't need to. We only ever upload one "VRAM page" to the GPU anyway, which is then scrolled and scaled onto one of the GPU's backbuffers inside the swapchain. Then, the game can immediately continue drawing onto the same software-rendered VRAM buffer in the next frame without affecting the GPU output.
Obviously, this rendering paradigm doesn't translate back to the PC-98. There, we must render each frame to either the invisible or the visible page. Also, minimal redraw is crucial because we can neither afford the memory nor the performance to regularly copy an entire 128 KB of pixel data from whatever place to VRAM. As a result, page flips are a common sight in even the highest levels of menu and cutscene code, adding yet another unsightly piece of state you have to keep track of while reviewing and modding the code. I've grown to hate them quite a lot over the past four months because of just how often they are associated with bad code: In most menu and cutscene screens, ZUN just uses the second VRAM page as pixel storage for inter-page copies using the EGC, 📝 whose slowness is a regular topic on this blog. Once you've replaced these copies with optimized blits from conventional RAM, you've not only removed all these page flips and clearly revealed these screens as the single-buffered affairs they've always been, but you've also accelerated them enough to remove any screen tearing issues they might have had at 66 MHz.
Unfortunately, things are not that easy everywhere:
Sometimes, menus and cutscenes do require involved page flipping tricks to cleanly switch between two screens without tearing.
But a few of them are genuinely double-buffered. Their minimal redraw code must indeed always keep two alternating states of VRAM in mind, which effectively leaks a hardware detail – the length of the PC-98's "swapchain" – into the highest levels of game code.
Can we rewrite all of these cases in a way that high-level game code no longer has to care about pages? Can we perhaps even banish page flipping to a new lower level of the architecture that all menus and cutscenes are built on top of, and thus unconditionally double-buffer every screen while still maintaining minimal redraw? Or is none of this worth it and we'll just live with two VRAM pages on all platforms? I'm honestly not sure. And that's just a small preview of the porting challenges that still await us and were far beyond the scope of even these 11 pushes…
As for the commits that are formally assigned to this blog post: It was all maintenance, build system setup, and some debloating work on TH01 around its packfile support that I thought would be necessary but thankfully didn't yet need after all. More about that in, you guessed it, 📝 part 4.
Alright! Improving performance, fixing screen tearing issues, establishing better cross-platform interfaces, and cleaning up ZUN's code to facilitate all of that… I've got a lot to do now. Next up: Getting closer to our performance goals by optimizing all PC-98-native code surrounding the .PI files used for backgrounds and cutscene pictures, since we later want to draw our TH03 netplay menus on top.
Surprise! The last missing main menu in PC-98 Touhou was, in fact, not that hard. Finishing the rest of TH03's OP.EXE took slightly shorter than the expected 2 pushes, which left enough room to uncover an unexpected mystery and take important leaps in position independence…
For TH03, ZUN stepped up the visual quality of the main menu items by exchanging TH02's monospaced font with fixed, pre-composited strings of proportional text. While TH04 would later place its menu text in VRAM, TH03 still wanted to stay with TH02's approach of using gaiji to display the menu items on the PC-98 text layer. Since gaiji have a fixed size of 16×16 pixels, this requires the pre-composited bitmaps to be cut into blocks of that size and padded with blank pixels as necessary:
If your combined amount of text is short enough to fit into the PC-98's 256 gaiji slots, this is a nice way of using hardware features to replace the need for a proportional text renderer. It especially simplifies transitions between menus – simply wiping the entire TRAM is both cheap and certainly less error-prone than (un)blitting pixels in VRAM, which 📝 ZUN was always kind of sloppy at.
However, all this text still needs to be composited and cut into gaiji somewhere. If you do that manually, it's easy to lose sight of how the text is supposed to appear on screen, especially if you decide to horizontally center it. Then, you're in for some awkward coordinate fiddling as you try to place these 16-pixel bricks into the 8-pixel text grid to somehow make it all appear centered:
The VS Start menu actually is correctly centered.
Then again, did ZUN actually want to center the Option menu like this? Even the main menu looks kind of uncanny with perfect centering, probably because I'm so used to the original. Imperfect centering usually counts as a bug, but this case is quirky enough to leave it as is. We might want to perfectly center any future translations, but that would definitely cost a bit as I'd then actually need to write that proportional text renderer.
Apart from that, we're left with only a very short list of actual bugs and landmines:
The Cancel key is not handled inside the VS menu, arrgghh…! 🤬
ZUN almost managed to write a title screen and menu without a 📝 screen📝 tearing landmine, but a single one still managed to sneak into the first frame of the title screen's short fade-in animation. This one will blow up when returning from the Music Room, and can be entirely blamed on that screen's choice to leave 📝 a purple color in hardware palette slot 0. Replacing that color with black before returning would have completely hidden the potential tearing.
There might be another one in the long sliding animation, but I can only tell for sure once I've fully decompiled MAINL.EXE.
While the rest of the code is not free of the usual nitpicks, those don't matter in the grand scheme of things. The code for the sliding 東方夢時空
animation is even better: it makes decent use of the EGC and page flipping, and places the 📝 loading calls for the character selection portraits at sensible points where the animation naturally wants to have a delay anyway. We're definitely ending the main menus of PC-98 Touhou on a high note here.
You might have already spotted some unfamiliar text in the gaiji above, and indeed, we've got three pieces of unused text in these two menus! Starting from the top, the label is entirely unused as none of its gaiji IDs are referenced anywhere in the final code. The label's placement within the gaiji IDs would imply that this option was once part of the main menu, but nothing in the game suggests that the main menu ever had a bigger box that could fit a 7th element. On the contrary, every piece of menu code assumes that the box sprites loaded from OPWIN.BFT are exactly 128 pixels high:
Fun fact: The code doesn't even use the 16 pixels in the middle, and instead just assumes that the pixels between the X coordinates of [8; 16[ and [32; 40[ are identical.
The unused MIDI music option has already been widely documented elsewhere. Changing the first byte in YUME.CFG to 02 has no functional effect because ZUN removed most MIDI-related code before release. He did forget a few instances though, and the surviving dedicated switch case in the Option menu is now the entire reason why you can reveal this text without modifying the binary. Changing the option will always flip its value back to either off or FM(86).
Last but not least, we have the label and its associated numbers. These are the most interesting ones in my eyes; nobody talks about them, even though we have definite proof that they were used for the KeyConfig options at some earlier point in development:
That's all I've got about the menus, so let's talk characters and gameplay! When playing Story Mode, OP.EXE picks the opponents for all stages immediately after the 📝 Select screen has faded out. Each character fights a fixed and hardcoded opponent in Stage 7's Decisive Match:
Player
Stage 7 opponent
Reimu
Mima
Mima
Reimu
Marisa
Reimu
Ellen
Marisa
Kotohime
Reimu
Kana
Ellen
Rikako
Kana
Chiyuri
Kotohime
Yumemi
Rikako
The opponents for the first 6 stages, however, are indeed completely random, and picked by master.lib's reimplementation of the Borland RNG. The game only needs to ensure that no character is picked twice, which it does like this:
const int stage_7_opponent = HARDCODED_STAGE_7_OPPONENT_FOR[playchar];
bool opponent_seen[7] = { false };
for(int stage = 0; stage < 6; stage++) {
int candidate;
do {
// Pick a random character between Reimu and Rikako
candidate = (irand() % 7);
} while(opponent_seen[candidate] || (stage_7_opponent == candidate));
opponent_seen[candidate] = true;
story_opponent[stage] = candidate;
}
Characters are numbered from 0 ( Reimu) to 8 ( Yumemi), following the order in the Stage 7 table above.
Yup. For every stage, ZUN re-rolls until the RNG returns a character who hasn't yet been seen in a previous stage – even in Stage 6 where there's only one possible character left. Since each successive stage makes it harder for the inner loop to find a valid answer, you start to wonder if there is some unlucky combination of seed and player character that causes the game to just hang forever.
So I tested all possible 232 seed values for all 9 player characters and… nope, Borland's RNG is good enough to eventually always return the only possible answer. The inner loop for Stage 6 does occasionally run for a disproportionate number of iterations, with the worst case being 134 re-rolls when playing Rikako's Story Mode with a seed value of 0x099BDA86. But even that is many orders of magnitude away from manifesting as any kind of noticeable delay. And on average, it just takes 17.15 iterations to determine all 6 random opponents.
The attract demos are another intriguing aspect that I initially didn't even have on my radar for the main menu. touhou-memories raises an interesting question: The demos start at Gauge and Boss Attack level 9, which would imply Lunatic difficulty, but the enemy formations don't match what you'd normally get on Lunatic. So, which difficulty were they recorded on?
Our already RE'd code clears up the first part of that question. TH03's demos are not recordings, but simply regular VS rounds in CPU vs. CPU mode that automatically quit back to the title screen after 7,000 frames. They can only possibly appear pre-recorded because the game cycles through a mere four hardcoded character pairings with fixed RNG seeds:
Demo #
P1
P2
Seed
1
Mima
Reimu
600
2
Marisa
Rikako
1000
3
Ellen
Kana
3200
4
Kotohime
Marisa
500
Certainly an odd choice if your game already had the feature to let arbitrary CPU-controlled characters fight each other. That would have even naturally worked for the trial version, which doesn't contain demos at all.
Then again, even a "random" character selection would have appeared deterministic to an outside observer. As usual for PC-98 Touhou, the RNG seed is initialized to 0 at startup and then simply increments after every frame you spend on the title screen and inside the top-level main, Option, and character selection menus – and yes, it does stay constant inside the VS Start menu. But since these demos always start after waiting exactly 520 frames on the title screen without pressing any key to enter the main menu, there's no actual source of randomness anywhere. ZUN could have classically initialized the RNG with the current system time, which is what we used to do back in the day before operating systems had easily accessible APIs for true randomness, but he chose not to, for whatever reason.
The difficulty question, however, is not so easy to answer. The demo startup code in the main menu doesn't override the configured difficulty, and neither does any other of the binaries depending on the demo ID. This seems to suggest that the demos simply run at the difficulty you last configured in the Option menu, just like regular VS matches. But then, you'd expect them to run differently depending on that difficulty, which they demonstrably don't. They always start on Gauge and Boss Attack level 9, and their last frame before the exit animation is always identical, right down to the score, reinforcing the pre-recorded impression:
Note that it takes much longer than the expected 2:04 minutes for the game to reach this end state. Each WARNING!! You are forced to evade / Your life is in peril popup freezes gameplay for 26 frames which don't count toward the demo frame counter. That's why these popups will provide such a great 📝 resynchronization opportunity for netplay. It's almost as if Versus Touhou was designed from the start with rollback netcode in mind!
With quite a bit of time left over in the second push, it made sense to look at a bit of code around the Gauge and Boss Attack levels to hopefully get a better idea of what's going on there. The Gauge Attack levels are very straightforward – they can range from 1 to 16 inclusive, which matches the range that the game can depict with its gaiji, and all parts of the game agree about how they're interpreted:
Stored in GAMEFT.BFT.
The same can't be said about the Boss Attack level though, as the gauge and the WARNING!! popup interpret the same internal variable as two different levels?
This apparent inconsistency raises quite a few questions. After all, these gaiji have to be addressed by adding an offset from 0 to 15 to the ID of the 1 gaiji, but the levels are supposed to range from 1 to 16. Does this mean that one of these two displays has an off-by-one error? You can't fire a Level 0 Boss Attack because the level always increments before every attack, but would 0 still be a technically valid Boss Attack level?
Decompiling the static HUD code debunks at least the first question as ZUN resolves the apparent off-by-one error by explicitly capping the displayed level to 16. And indeed, if a round lasts until the maximum Boss Attack level, the two numbers end up matching:
This suggests that the popup indicates the level of the incoming attack while the gauge indicates the level of the next one to be fired by any player. That said, this theory not only needs tons of comments to explain it within the code, but also contradicts 夢時空.TXT, which explicitly describes the level next to the gauge as the 現在のBOSSアタックのレベル. Still, it remains our best bet until we've decompiled a few of the Boss Attacks and saw how they actually use this single variable.
So, what does this tell us about the demo difficulty? Now that we can search the code for these variables, we quickly come across the dedicated demo-specific branch that initializes these levels to the observable fixed values, along with two other variables I haven't researched so far. This confirms that demos run at a custom difficulty, as the two other variables receive slightly different values in regular gameplay.
However, it's still a good idea to check the code for any other potential effects of the difficulty setting. Maybe they're just hard to spot in demos? Doesn't difficulty typically affect a whole lot of other things in Touhou game code? Well, not in TH03 – MAIN.EXE only ever looks at the configured difficulty in three places, and all of them are part of the code that initializes a new round.
This reveals the true nature of difficulty in TH03: It's exclusively specified in terms of these five variables, and the Easy/Normal/Hard/Lunatic/"Demo" settings can be thought of as simply being presets for them. Story Mode adds 📝 the AI's number of safety frames to the list of variables and factors the current stage number into their values, but the concept stays the same. In this regard, TH03's design is unusually clean, making it perhaps the only Touhou game with not even a single "if difficulty is this, then do that" branch in script code. It's certainly the only PC-98 Touhou game with this property.
But it gets even better if we consider what this means for netplay. We now know that the configured difficulty is part of the match-defining parameters that must be synced between both players, just like the selected characters and the RNG seed. But why stop there? How about letting players not just choose between the presets, but allowing them to customize each of the five variables independently? Boom, we've just skyrocketed the replay value of netplay. 🚀 It's discoveries like these that justify my decision to start the road toward netplay by decompiling all of OP.EXE: In-engine menus are the cleanest and most friendly way of allowing players to configure all these variables, and now they're also the easiest and most natural choice from a technical point of view.
But wait, there's still some time left in that second push! The remaining fraction of the OP.EXE reverse-engineering contribution had repeating decimals, so let's do some quick TH02 PI work to remove the matching instance of repeating decimals from the backlog. This was very much a continuation of 📝 last year's light PI work; while the regular TH02 decompilation progress has focused and will continue to focus on the big features, it still left plenty of low-hanging PI fruit in boss code. Back then, we left with the positions of the Five Magic Stones, where ZUN's choice of storing them in arrays was almost revolutionary compared to what we saw in TH01. The same now applies to the state flags and total damage amount of not just the boss of Stage 3, but also the two independently damageable entities of the stage's midboss. In total, all of the newly identified arrays made up 3.36% of all memory references in TH02, and we're not even done with Stage 3.
Actually, you know what, let's round out that second push with even more low-hanging PI fruit and ensure 📝 technical position independence for TH03's MAINL.EXE. This was very helpful considering that I'm going to build netplay into the anniversary branch, whose debloated foundation 📝 aims to merge every game into as few executables as possible. Due to TH03's overall lower level of bloat and the dedicated SPRITE16-based rendering code in MAIN.EXE, it might not make as much sense to merge all three of TH03's .EXE binaries as it did for TH01, and MAIN.EXE's lack of position independence currently prevents this anyway. However, merging just OP.EXE and MAINL.EXE makes tremendous sense not just for TH03, but for the other three games as well. These binaries have a much smaller ratio of ZUN code to library code, and use the same file formats and subsystems.
But that's not even the best part! Once we've factored out all the invisible inconsistencies between the games, we get to share all of this code across all of the four games. Hence, technical position independence for TH03's MAINL.EXE also was the final obstacle in the way of a single consistent and ultimately portable version of all of this code. 🙌
So, how do we go from here to 📝 the short-term half-PC-98/half-modern netplay option that Ember2528 is now funding? Most of the netcode will be unrelated to TH03 in particular, but we'd obviously still want to reverse-engineer more of MAIN.EXE to ensure a high-quality integration. So how about alternating the upcoming deliveries between pure RE work and any new or modded code? Next up, therefore, I'll go for the latter and debloat OP.EXE so that I can later add the netplay features without pulling my hair out. At that point, it also makes sense to take the first steps into portability; I've got some initial ideas I'm excited to implement, and Congrio's tiny bit of funding just begs to be removed from the backlog.
(And I'm definitely going to defuse all the tearing landmines because my goodness are they infuriating when slowing down the game or working with screen recordings.)
P0304
TH02 RE (Stage / (mid)boss variables) + Decompilation (Bullets, part 1/2)
P0305
TH02 decompilation (Bullets, part 2/2 + Sparks, part 1/2)
P0306
TH02 decompilation (Player, part 1/2: Update/render functions + Miss animation) + Random TH04/TH05 finalization
💰 Funded by:
Yanga, iruleatgames, nrook, [Anonymous]
🏷️ Tags:
Sometimes, the gameplay community will come up with the most outlandish theories before they even begin to consider the idea that certain safespots might not be intentional and only work by accident to begin with. Want more details? Read on…
So, TH02's bullet system! At a high level, it marks an interesting transitional point: It's still very much based on TH01's design with its predefined static or aimed spreads, but also introduces a few features that would later return in TH04 and TH05. By transplanting the TH01 system into a double-buffered environment, ZUN eliminated the 📝 worst📝 unblitting-related parts that plagued TH01, ending up with the simplest and cleanest implementation of bullets I've seen so far. That's not to say it's good-code – far from it – but it also hasn't reached the messy levels that TH04 and especially TH05 would bring later. Of course, there's still TH03's system left to be done until I can say for sure, but TH02's is a pretty strong contender.
The more detailed overview of the system:
TH02 introduces the distinction between the white 8×8 pellets and the 16×16 sprite bullets that TH04 and TH05 would later expand upon.
The game has a single cap of 150 that is shared among both 8×8 and 16×16 bullets, unlike TH04 and TH05 where the cap is split for optimization reasons.
In 封魔録.TXT, ZUN claims that TH02 could even compete with DoDonPachi in terms of bullet amounts:
怒首領蜂もびっくりな判定の小ささ、弾の量。
Can it really, though? DoDonPachi spawns decidedly more bullets than TH02 throughout all of the game, and this pattern definitely exceeds 150 bullets. Hence, we can immediately debunk this claim as marketing hyperbole rather than a factual statement about the game. It would be nice to have a specific bullet cap number for DoDonPachi as well, but I can't find a decompilation project or annotated disassembly. Nor for any other CAVE game either, for that matter… 👀
TH01's decay and delay cloud effects were removed for TH02. Slightly unfortunate as it leaves bullets completely without any sprite effect, but hey, less code surface to mess up!
All bullets lose 0.625 pixels of per-frame speed on Easy and gain an extra 0.75 pixels of per-frame speed on Lunatic. Each bullet is clamped to a minimum speed of at least 1 pixel per frame; on Easy, the game also filters every second bullet that would have been slower. This mechanism mainly kicks in with the blob enemies at minimum rank during Stage 4.
TH02 sticks with the fixed 2-, 3-, 4-, and 5-way spreads that TH01 introduced, but adds a third delta angle variant on top of TH01's two "narrow" and "wide" ones. 2-spreads even get a fourth "ultrawide" angle, which Evil Eye Σ uses in the pellet corridor pattern during its last phase.
TH02 also adds predefined 4-, 8-, 16-, and 32-ring groups, all of which are used by bosses.
The game does not yet offer predefined stack groups, but has an auto-stacking system that automatically turns every spawned group into a potential 2-stack on Hard and Lunatic. This system forms the main way in which these difficulties differ from the easier ones, and is exactly why going from Normal to Hard roughly doubles the number of bullets fired. On Hard, the second bullet in each stack moves at half the speed of the primary bullet, while Lunatic adds another 0.5 pixels per frame onto that halved speed.
The game also has a function to apply a further multiplier on top of the difficulty-specific stack count, but only uses it to temporarily disable stacking during three patterns, one of them used by the Five Magic Stones and two of them used by Mima.
Just like all other games, TH02 offers a variety of special bullet motion types. For some reason, ZUN limited these to single 16×16 bullets in TH02; they are not supported for either 8×8 pellets or any of the multi-pellet groups. There is no technical reason for this, so ZUN likely did this as a deliberate game design choice. The upside is that you as a player can be certain that every 8×8 pellet moves in a straight line, which may or may not help reading patterns.
Chase bullets adjust their X/Y velocity by a configurable amount on every frame relative to the player's location. These are exclusively used by the 呪 bullets fired by the Stage 2 midboss.
Homing bullets work in a very similar way, re-aiming at the player more properly for a customizable number of frames after a bullet was spawned. These are completely unused.
Decelerating bullets reduce their speed to 0 by halving their velocity every 8 frames, and then turn and repeat this process a fixed number of times. In TH02, this movement type is only used in a symmetric green-ball pattern used by the eastern and western Magic Stones, but it would become really popular later on, showing up in 6 of TH04's midboss and/or boss patterns and 9 of TH05's.
Gravity bullets add a customizable acceleration factor to their Y position on every frame. Another movement type exclusive to a single green-ball pattern by the northern Magic Stone, and interestingly special-cased to bypass any difficulty- or rank-based speed tuning.
Drift bullets either add a remote-controlled angle and speed delta value to a bullet's angle and speed on every frame, or use that remote-controlled angle to chase toward the player using the same algorithm as the 呪 bullets. These two types are criminally underutilized and could have created some widely inventive patterns that you wouldn't have expected out of the first PC-98 Touhou shmup. Instead, they're only used for two of Marisa's rotating star patterns.
And finally, of course, we have bullets that bounce and flip their direction near the edge of the playfield. In this game, the bounce edges actually lie 8 pixels inside the playfield:The velocity flip only happens on the frame in which a bullet enters the red bounce margin zone. So, faster bullets might still travel a good deal toward the actual edge of the playfield before getting flipped.
This type is not only used by Meira's and Evil Eye Σ's red and purple billiard ball bullets, but also by some star bullet patterns during the Mima fight.
Pellet rendering is batched! For the first time, ZUN preserves the GRCG state for successively blitted pellets, avoiding the extra >168 cycles per pellet that master.lib's grcg_setcolor() and grcg_off() would cost on a 486. The caveat, however, lies in the words successively blitted. Without an architectural split between pellets and sprite bullets, the rendering code ends up looking like this:
While this definitely is suboptimal once you start mixing the two size types, it's not too bad in context. The actual bullet scripts in TH02 mostly stick to one of the two sprite types, and once the script switches from one to the other, the old and new bullets will occupy mostly contiguous areas of the bullet array anyway. The game doesn't actually mix 8×8 and 16×16 bullets within the same pattern until literally the last pattern of Mima's second form.
The four other ZUN quirks in the system are all related to clipping and aim point calculations. ZUN tries very hard to use constants that are supposed to work for both 8×8 and 16×16 bullets, but they never perfectly fit either of the two.
To find out where all these bullet types are used, I of course had to label all the individual pattern functions and assign them to their (mid)boss owners. As a side effect, we now also know the preferred boss decompilation order for this game!
Marisa
Mima
Evil Eye Σ
Meira
Rika
5 Magic Stones
Quite a satisfying order, if I may say so myself – burning off the big fireworks right in the beginning, getting slightly more unexciting later on, but then ending on arguably the best Touhou character ever conceived.
Each of these decompilations will be preceded by the stage's respective midboss. This includes the Extra Stage – you might not think that this stage has a midboss, but it technically does, in the form of this combination of patterns:
Lasting exactly these 420 frames.
There's nothing in TH02's code that mandates midbosses to have sprite-like entities or even something like an HP bar. Instead, the code-level definition of a midboss is all about these properties:
It assigns control functions to the same function pointers that the other stages use for their midbosses.
These functions are activated at a fixed, specific point throughout the stage.
Regular stage enemy spawns are deactivated until these control functions signal completion.
If a pattern manipulates stage tiles, it can only be part of a boss or midboss with custom C code, as this is not supported for regular stage enemy scripts.
Stage 5, on the other hand, indeed doesn't have anything that can be interpreted as a midboss.
Finally, and probably most importantly, hitboxes! The raw decompilation of TH02's bullet collision detection code looks like this:
However, if you aren't deeply familiar with the sizes of all involved sprites, these top-left positions slightly obscure the actual position of the hitbox. That top-left point might also not be where you think it is:
It's the red point.
So let's transform these checks to a more useful comparison of the respective center points against each other, and also fix that inconsistency of the right coordinates being compared with < instead of <= like the other values:
Now also revealing the horizontal asymmetry that ZUN's code was sneakily hiding.
TH02 has only 5 different bullet shapes and no directional or vector bullets, so we can exactly visualize all of them:
📝 As📝 usual, a bullet sprite has to be fully surrounded by the blue box for a hit to be registered.
Yup. Quite asymmetric indeed, and probably surprising no one.
While experimenting with the various hardcoded group types, I stumbled over a quite surprising quirk that you might have already noticed in the spread showcase video further above. For some reason, none of these spreads are perfectly symmetric, what the…?
By the time the bullets have reached the bottom of the playfield, the inaccuracy has compounded so much that the right lane ends up 6 pixels closer to the player's center position than the left lane. Depending on which of the two lanes actually gets the correct angle, this either means that the left lane is moving too far (2️⃣) or that the right lane is not moving far enough (3️⃣).
This is very weird because the angles that go into the velocity calculations are demonstrably correct. You'd therefore get this asymmetry for not only the hardcoded spreads, but also for code that does its own angle calculations and spawns each bullet manually. It's not something that can arise from the other known issue of 📝 Q12.4 quantization either, because that would affect all parts of a pattern equally.
Instead, the inaccuracy originates in the conversion from the polar coordinates of angles and speeds into the per-frame X/Y pixel velocities that the game uses for actual movement. The integer math algorithm that ZUN uses here is pretty much the single most fundamental piece of code shared by all 5 games:
// Using 📝 typical 8-bit angles.
int16_t polar_x(int16_t center, int16_t radius, uint8_t angle)
{
// Ensure that the multiplication below doesn't overflow
int32_t radius32 = radius;
// Get the cosine value from master.lib's lookup table, which scales the
// real-number range of [-1; +1] to the integer range of [-256; +256].
int16_t cosine = CosTable8[angle];
// The multiplication will include master.lib's 256× scaling factor, so
// divide the result to bring it within the intended radius.
return (((radius * cosine) >> 8) + center);
}
This exact algorithm is even recommended in the master.lib manual.
The pattern above uses TH02's medium delta angle for 2-spreads and moves at a Q12.4 subpixel speed of 2.5, which corresponds to a radius of 40 in the context of polar coordinate calculation. Let's step through it:
Angle
Cosine
Multiplied
In hex
Shift result
In decimal
In Q12.4
(0x40 - 6)
38
1520
000005F0
00000005
5
0.3125
(0x40 + 6)
-38
-1520
FFFFFA10
FFFFFFFA
-6
-0.3750
Whoa, talk about getting a basic lesson about how computers work! PC-98 Touhou has just taught us that signedness-preserving arithmetic bitshifts are not equivalent to the apparently corresponding division by a power of two, because the typical two's complement representation of negative numbers causes the result to effectively get rounded away from zero rather than toward zero like the corresponding positive value. In our example, this means that the right lane is correct and moves at the angle we passed in, while the left lane moves 1/16 pixels per frame further to the left than intended. Since we're talking about the most basic piece of trigonometry code here, this inaccuracy also applies to every other entity in PC-98 Touhou that moves left relative to its origin point – and/or up, because Y coordinates are calculated analogously. Imagine that… it's been 10 years since I decompiled the first variant of this function, and I'm only now noticing how fundamentally broken it is.
It's understandable why master.lib's manual recommends bitshifts instead of the more correct division here. On a 486, a single 32-bit IDIV takes a whopping >33 cycles, and it would have been even slower on the 286 systems that master.lib is geared toward. But there's no need to go that far: By simply rounding up negative numbers, we can emulate the rounding behavior of regular division while still using a bitshift:
int16_t polar_x(int16_t center, int16_t radius, uint8_t angle)
{
int32_t ret = (static_cast<int32_t>(radius) * CosTable8[angle]);
+ if(ret < 0) {
+ // Round the multiplication result so that the shift below will yield a number
+ // that's 1 closer to 0, thus rounding toward zero rather than away from zero as
+ // bitshifts with negative numbers would usually do. This ensures that we return
+ // the same absolute value after the bitshift that we would return if [ret] were
+ // positive, thus repairing certain broken symmetries in PC-98 Touhou.
+ ret += 255;
+ }
return ((ret >> 8) + center);
}
You could also do this in a branchless way, which is coincidentally very close to what current Clang would generate if you just wrote a regular division by 256. This branchless way does seem slightly slower on a 486 though, as it adds a constant >8 cycles worth of instructions. The branching implementation only adds >4 cycles for positive numbers and >3 for negative ones.
But that would be deep quirk-fixing territory. uth05win just uses floating-point math for this transformation, exchanging master.lib's 8-bit lookup tables for the C library's regular sin() and cos() functions, but bypassing the issue like this also forms the single biggest source of porting inaccuracy. Can't really win here… 🤷
Now it will be interesting to see whether ZUN worked around this inaccuracy in certain places by using slightly lower left- or up-pointing angles…
Alright, but aren't we still missing the single biggest quirk about bullets in TH02? What's with Reimu's hitbox misaligning when dying? I can't release a blog post about TH02's bullet system without solving the single most infamous bullet-related mystery that this game has to offer. So, time to start a third push for looking at all the player movement, rendering, and death sequence code…
If you remember the code above, there is no way that a hitbox defined using hardcoded numbers can ever shift in response to anything. Any so-called hitbox misalignment would therefore be a player position misalignment, which sounds even harder to believe. And sure enough, after decompiling all of it, there's nothing of that sort to be found in the player code either.
If we take player position misalignment literally, we're only left with one other place where it could possibly somehow come from: the strange vertical shaking you can observe right in the first few frames of most stages. So let's visualize the hitbox and… nope, the shaking is purely a scrolling bug, nothing about it changes the internal player position used for collision detection.
So, uh, what are people even talking about? It doesn't help that noone cites any source for this claim and just presents it as a natural and seemingly self-evident fact, as if it was the most obvious and most easily verified property about the game.
Thankfully though, there have been two relativelyrecent videos about the issue, but both of them only showcase the supposed hitbox shifting in relation to a specific safespot at the end of the Extra Stage midboss. So is that what's been going on here? The community taking the game's behavior in just a single instance of collision detection within a single stage, and extending it to a general claim about the game as a whole?
But indeed, the described behavior cleanly reproduces every time. Enter the spot with 2 remaining lives and you survive, but enter with 1 remaining life and you die:
Whatever this is about, it's not due to a difference in hitboxes because Reimu's position demonstrably stays identical. But if we switch between these two videos, we can easily spot that it's the patterns that are different! With 1 life left, the pattern moves at an ever so slightly slower speed, which apparently adds up to a life-or-death difference at that specific spot.
And that's what the supposed hitbox shifting ultimately boils down to: The natural impact of rank on patterns, adjusting bullet speed with a factor of ((playperf + 48) / 48) times 1/16 pixels. And nothing else.
Let's visualize the hitbox and also track one of the bullets:
If we look at the respective frames in the playperf = +2 case, we see that the bullet misses the hitbox by either one or two pixels on three successive frames:
That's not a safespot, that's Reimu barely surviving only thanks to rounding.
So, for once, this is not a quirk, and doesn't even qualify as a "funny ZUN code moment" if you ask me. This is the game working exactly as designed, and it's the players who are instead making wild assumptions about safespots that only hold when the rank system plugs very specific numbers into the game's fixed-point math.
If anything, you could make the stronger case that this safespot should not work under any circumstance. If the game tested the whole parallelogram covered by a bullet's trajectory between two successive frames instead of just looking at a bullet's current position, it would consistently detect this collision regardless of rank. But even the later games don't go to these lengths.
By testing with parallelograms, the game would not only look at the distinct bullet positions in green, but also detect that the bullet traveled through the position highlighted in cyan, which does lie fully within the hitbox.
Amusingly, if you die twice before this pattern and reach a rank of -2, bullet speed drops enough for the safespot to work again:
It's even the same bullet that fails to hit Reimu, although coming in 5 frames later.
If you're now sad because you liked the idea of ZUN deliberately putting hitbox-shifting code into the game, you don't have to be! You might have already noticed it in the 1-life videos above, but TH02 does have one funny but inconsequential instance of death-induced player position shifting. In the 19 frames between the end of the animation and Reimu respawning at the bottom of the playfield, ZUN just adds 4 pixels to Reimu's Y position. You don't really notice it because the game doesn't render Reimu's sprite during these frames, but this modified position still partakes in collision detection, causing bullets to be removed accordingly.
Hilariously, ZUN was well aware that this shift could move the player's Y position beyond the bottom of the playfield, and thus cause sparks to be spawned at Y coordinates larger than 400. So he just… wrapped these spark spawn coordinates back into the visible range of VRAM, thus moving them to the top of the playfield…
The off-center spawn point of these sparks was the only actual bug in this delivery, by the way.
To round out the third push, I took some of the Anything budget towards finalizing random bits of previously RE'd TH04 and TH05 code that wouldn't add anything more to this blog post. These posts aren't really meant to be a reference – that's the job of the code, the actual primary source of the facts discussed here – but people have still started to use them as such. So it makes sense to try focusing them a bit more in the future, and not bundle all too many topics into a single one.
This finalization work was mostly centered on some tile rendering and .STD file loading boilerplate, but it also covered some of TH05's unfortunately undecompilable HUD number display code. The irony is that it's actually quite good ASM code that makes smart register choices and uses secondary side effects of certain instructions in a way that's clever but not overly incomprehensible. Too bad that these optimizations have no right to exist in logic code that is called way less than once per frame…
Next up: An unexpected quick return to the Shuusou Gyoku Linux port, as Arch Linux is bullying us onto SDL 3 faster than I would have liked.
Remember when ReC98 was about researching the PC-98 Touhou games? After over half a year, we're finally back with some actual RE and decompilation work. The 📝 build system improvement break was definitely worth it though, the new system is a pure joy to use and injected some newfound excitement into day-to-day development.
And what game would be better suited for this occasion than TH03, which currently has the highest number of individual backers interested in it. Funding the full decompilation of TH03's OP.EXE is the clearest signal you can send me that 📝 you want your future TH03 netplay to be as seamlessly integrated and user-friendly as possible. We're just two menu screens away from reaching that goal anyway, and the character selection screen fits nicely into a single push.
The code of a menu typically starts with loading all its graphics, and TH03's character selection already stands out in that regard due to the sheer amount of image data it involves. Each of the game's 9 selectable characters comes with
a 192×192-pixel portrait (??SL.CD2),
a 32×44-pixel pictogram describing her Extra Attack (in SLEX.CD2), and
a 128×16-pixel image of her name (in CHNAME.BFT). While this image just consists of regular boldfaced versions of font ROM glyphs that the game could just render procedurally, pre-rendering these names and keeping them around in memory does make sense for performance reasons, as we're soon going to see. What doesn't make sense, though, is the fact that this is a 16-color BFNT image instead of a monochrome one, wasting both memory and rendering time.
Luckily, ZUN was sane enough to draw each character's stats programmatically. If you've ever looked through this game's data, you might have wondered where the game stores the sprite for an individual stat star. There's SLWIN.CDG, but that file just contains a full stat window with five stars in all three rows. And sure enough, ZUN renders each character's stats not by blitting sprites, but by painting (5 - value) yellow rectangles over the existing stars in that image.
The only stat-related image you will find as part of the game files. The number of stat stars per character is hardcoded and not based on any other internal constant we know about.
Together with the EXTRA🎔 window and the question mark portrait for Story Mode, all of this sums up to 255,216 bytes of image data across 14 files. You could remove the unnecessary alpha plane from SLEX.CD2 (-1,584 bytes) or store CHNAME.BFT in a 1-bit format (-6,912 bytes), but using 3.3% less memory barely makes a difference in the grand scheme of things.
From the code, we can assume that loading such an amount of data all at once would have led to a noticeable pause on the game's target PC-98 models. The obvious alternative would be to just start out with the initially visible images and lazy-load the data for other characters as the cursors move through the menu, but the resulting mini-latencies would have been bound to cause minor frame drops as well. Instead, ZUN opted for a rather creative solution: By segmenting the loading process into four parts and moving three of these parts ahead into the main menu, we instead get four smaller latencies in places where they don't stick out as much, if at all:
The loading process starts at the logo animation, with Ellen's, Kotohime's, and Kana's portraits getting loaded after the 東方夢時空 letters finished sliding in. Why ZUN chose to start with characters #3, #4, and #5 is anyone's guess.
Reimu's, Mima's, and Marisa's portraits as well as all 9 EXTRA🎔 attack pictograms are loaded at the end of the flash animation once the full title image is shown on screen and before the game is waiting for the player to press a key.
The stat and EXTRA🎔 windows are loaded at the end of the main menu's slide-in animation… together with the question mark portrait for Story Mode, even though the player might not actually want to play Story Mode.
Finally, the game loads Rikako's, Chiyuri's, and Yumemi's portraits after it cleared VRAM upon entering the Select screen, regardless of whether the latter two are even unlocked.
I don't like how ZUN implemented this split by using three separately named standalone functions with their own copy-pasted character loop, and the load calls for specific files could have also been arranged in a more optimal order. But otherwise, this has all the ingredients of good-code. As usual, though, ZUN then definitively ruins it all by counteracting the intended latency hiding with… deliberately added latency frames:
The entire initialization process of the character selection screen, including Step #4 of image loading, is enforced to take at least 30 frames, with the count starting before the switch to the Selection theme. Presumably, this is meant to give the player enough time to release the Z key that entered this menu, because holding it would immediately select Reimu (in Story mode) or the previously selected 1P character (in VS modes) on the very first frame. But this is a workaround at best – and a completely unnecessary one at that, given that regular navigation in this menu already needs to lock keys until they're released. In the end, you can still auto-select the default choice by just not releasing the Z key.
And if that wasn't enough, the 1P vs. 2P variant of the menu adds 16 more frames of startup delay on top.
Sure, maybe loading the fourth part's 69,120 bytes from a highly fragmented hard drive might have even taken longer than 30 frames on a period-correct PC-98, but the point still stands that these delays don't solve the problem they are supposed to solve.
But the unquestionable main attraction of this menu is its fancy background animation. Mathematically, it consists of Lissajous curves with a twist: Instead of calculating each point as
x = sin((fx·t)+ẟx)y = sin((fy·t)+ẟy), TH03 effectively calculates its points as
x = cos(fx·((t+ẟx) % 0xFF))y = sin(fy·((t+ẟy) % 0xFF)), due to t and ẟ being 📝 8-bit angles. Since the result of the addition remains 8-bit as well, it can and will regularly overflow before the frequency scaling factors fx and fy are applied, thus leading to sudden jumps between both ends of the 8-bit value range. The combination of this overflow and the gradual changes to fx and fy create all these interesting splits along the 360° of the curve:
At a high level, there really is just one big curve and one small curve, plus an array of trailing curves that approximate motion blur by subtracting from ẟx and ẟy.
In a rather unusual display of mathematical purity, ZUN fully re-calculates all variables and every point on every frame from just the single byte of state that indicates the current time within the animation's 128-frame cycle. However, that beauty is quickly tarnished by the sheer cost of fully recalculating these curves every frame:
In total, the effect calculates, clips, and plots 16 curves: 2 main ones, with up to 7×2 = 14 darker trailing curves.
Each of these curves is made up of the 256 maximum possible points you can get with 8-bit angles, giving us 4,096 points in total.
Each of these points takes at least 333 cycles on a 486 if it passes all clipping checks, not including VRAM latencies or the performance impact of the 📝 GRCG's RMW mode.
Due to the larger curve's diameter of 440 pixels, a few of the points at its edges are needlessly calculated only to then be discarded by the clipping checks as they don't fit within the 400 VRAM rows. Still, >1.3 million cycles for a single frame remains a reasonable ballpark assumption.
This is decidedly more than the 1.17 million cycles we have between each VSync on the game's target 66 MHz CPUs. So it's not surprising that this effect is not rendered at 56.4 FPS, but instead drops the frame rate of the entire menu by targeting a hardcoded 1 frame per 3 VSync interrupts, or 18.8 FPS. Accordingly, I reduced the frame rate of the video above to represent the actual animation cycle as cleanly as possible.
Apparently, ZUN also tested the game on the 33 MHz PC-98 model that he targeted with TH01, and realized that 4,096 points were way too much even at 18.8 FPS. So he also added a mechanism that decrements the number of trailing curves if the last frame took ≥5 VSync interrupts, down to a minimum of only a single extra curve. You can see this in action by underclocking the CPU in your Neko Project fork of choice.
But were any of these measures really necessary? Couldn't ZUN just have allocated a 12 KiB ring buffer to keep the coordinates of previous curves, thus reducing per-frame calculations to just 512 points? Well, he could have, but we now can't use such a buffer to optimize the original animation. The 8-bit main angle offset/animation cycle variable advances by 0x02 every frame, but some of the trailing curves subtract odd numbers from this variable and thus fall between two frames of the main curves.
So let's shelve the idea of high-level algorithmic optimizations. In this particular case though, even micro-optimizations can have massive benefits. The sheer number of points magnifies the performance impact of every suboptimal code generation decision within the inner point loop:
Frequency scaling works by multiplying the 8-bit angles with a fixed-point Q8.8 factor. The result is then scaled back to regular integers via… two divisions by 256 rather than two bitshifts? That's another ≥46 cycles where ≥10 would have sufficed. Edit (2025-08-29): The initial version of this post miscounted the number of required cycles as ≥4, or 2× the cycle count of a single SAR instruction. That number didn't consider that the frequency scaling multiplication occasionally produces negative numbers, which 📝 must be conditionally rounded up when replacing signed divisions with arithmetic bitshifts to still produce the exact original animation. This conditional rounding adds ≥8 cycles in the more common positive case, and ≥6 in the rarer negative case.
The biggest gains, however, would come from inlining the two far calls to the 5-instruction function that calculates one dimension of a polar coordinate, saving another ≥100 cycles.
Multiplied by the number of points, even these low-hanging fruit already save a whopping ≥729,088 cycles per frame on an i486, without writing a single line of ASM! On Pentium CPUs such as the one in the PC-9821Xa7 that ZUN supposedly developed this game on, the savings are slightly smaller because far calls are much faster, but still come in at a hefty ≥466,944 cycles. Thus, this animation easily beats 📝 TH01's sprite blitting and unblitting code, which just barely hit the 6-digit mark of wasted cycles, and snatches the crown of being the single most unoptimized code in all of PC-98 Touhou.
The incredible irony here is that TH03 is the point where ZUN 📝 really📝 started📝 going📝 overboard with useless ASM micro-optimizations, yet he didn't even begin to optimize the one thing that would have actually benefitted from it. Maybe he 📝 once again went for the 📽️ cinematic look 📽️ on purpose?
Unlike TH01's sprites though, all this wasted performance doesn't really matter much in the end. Sure, optimizing the animation would give us more trailing curves on slower PC-98 models, but any attempt to increase the frame rate by interpolating angles would send us straight into fanfiction territory. Due to the 0x02/2.8125° increment per cycle, tripling the frame rate of this animation would require a change to a very awkward (log2384) = 8.58-bit angle format, complete with a new 384-entry sine/cosine lookup table. And honestly, the effect does look quite impressive even at 18.8 FPS.
There are three more bugs and quirks in this animation that are unrelated to performance:
If you've tried counting the number of trailing dots in the video above, you might have noticed that the very first frame actually renders 8×2 trailing curves instead of 7×2, thus rendering an even higher 4,608 points. What's going on there is that ZUN actually requested 8 trailing curves, but then forgot to reset the VSync counter after the initial 30-frame delay. As a result, the game always thinks that the first frame of the menu took ≥30 VSync interrupts to render, thus causing the decrement mechanism to kick in and deterministically reduce the trailing curve count to 7.
This is a textbook example of my definition of a ZUN bug: The code unmistakably says 8, and we only don't get 8 because ZUN forgot to mutate a piece of global state.
The small trailing curves have a noticeable discontinuity where they suddenly get rotated by ±90° between the last and first frame of the animation cycle.
This quirk comes down to the small curve's ẟy angle offset being calculated as ((c/2)-i), with i being the number of the trailing curve. Halving the main cycle variable effectively restricts this smaller curve to only the first half of the sine oscillation, between [0x00, 0x80[. For the main curve, this is fine as i is always zero. But once the trailing curves leave us with a negative value after the subtraction, the resulting angle suddenly flips over into the second half of the sine oscillation that the regular curve never touches. And if you recall how a sine wave looks, the resulting visual rotation immediately makes sense:
Removing the division would be the most obvious fix, but that would double the speed of the sine oscillation and change the shape of the curve way beyond ZUN's intentions. The second-most obvious fix involves matching the trailing curves to the movement of the main one by restricting the subtraction to the first half of the oscillation, i.e., calculating ẟy as (((c/2)-i) % 0x80) instead. With c increasing by 0x02 on each frame of the animation, this fix would only affect the first 8 frames.
ZUN decided to plot the darker trailing curves on top of the lighter main ones. Maybe it should have been the other way round?
Now with the full 18 curves, a direction change of the smaller trailing curves at the end of the loop that only looks slightly odd, and a reversed and more natural plotting order.
Now that we fully understand how the curve animation works, there's one more issue left to investigate. Let's actually try holding the Z key to auto-select Reimu on the very first frame of the Story Mode Select screen:
The confirmation flash even happens before the menu's first page flip.
Stepping through the individual frames of the video above reveals quite a bit of tearing, particularly when VRAM is cleared in frame 1 and during the menu's first page flip in frame 49. This might remind you of 📝 the tearing issues in the Music Rooms – and indeed, this tearing is once again the expected result of ZUN landmines in the code, not an emulation bug. In fact, quite the contrary: Scanline-based rendering is a mark of quality in an emulator, as it always requires more coding effort and processing power than not doing it. Everyone's favorite two PC-98 emulators from 20 years ago might look nicer on a per-frame basis, but only because they effectively hide ZUN's frequent confusion around VRAM page flips.
To understand these tearing issues, we need to consider two more code details:
If a frame took longer than 3 VSync interrupts to render, ZUN flips the VRAM pages immediately without waiting for the next VSync interrupt.
The hardware palette fade-out is the last thing done at the end of the per-frame rendering loop, but before busy-waiting for the VSync interrupt.
The combination of 1) and the aforementioned 30-frame delay quirk explains Frame 49. There, the page flip happens within the second frame of the three-frame chunk while the electron beam is drawing row #156. DOSBox-X doesn't try to be cycle-accurate to specific CPUs, but 1 menu frame taking 1.39 real-time frames at 56.4 FPS is roughly in line with the cycle counting we did earlier.
Frame 97 is the much more intriguing one, though. While it's mildly amusing to see the palette actually go brighter for a single frame before it fades out, the interesting aspect here is that 2) practically guarantees its palette changes to happen mid-frame. And since the CRT's electron beam might be anywhere at that point… yup, that's how you'd get more than 16 colors out of the PC-98's 16-color graphics mode. 🎨
Let's exaggerate the brightness difference a bit in case the original difference doesn't come across too clearly on your display:
Probably not too much of a reason for demosceners to get excited; generic PC-98 code that doesn't try to target specific CPUs would still need a way of reliably timing such mid-frame palette changes. Bit 6 (0x40) of I/O port 0xA0 indicates HBlank, and the usual documentation suggests that you could just busy-wait for that bit to flip, but an HBlank interrupt would be much nicer.
This reproduces on both DOSBox-X and Neko Project 21/W, although the latter needs the Screen → Real palettes option enabled to actually emulate a CRT electron beam. Unfortunately, I couldn't confirm it on real hardware because my PC-9821Nw133's screen vinegar'd at the beginning of the year. But just as with the image loading times, TH03's remaining code sorts of indicate that mid-frame palette changes were noticeable on real hardware, by means of this little flag I RE'd way back in March 2019. Sure, palette_show() takes >2,850 cycles on a 486 to downconvert master.lib's 8-bit palette to the GDC's 4-bit format and send it over, and that might add up with more than one palette-changing effect per frame. But tearing is a way more likely explanation for deferring all palette updates until after VSync and to the next frame.
And that completes another menu, placing us a very likely 2 pushes away from completing TH03's OP.EXE! Not many of those left now…
To balance out this heavy research into a comparatively small amount of code, I slotted in 2024's Part 2 of my usual bi-annual website improvements. This time, they went toward future-proofing the blog and making it a lot more navigable. You've probably already noticed the changes, but here's the full changelog:
The Progress blog link in the main navigation bar now points to a new list page with just the post headers and each post's table of contents, instead of directly overwhelming your browser with a view of every blog post ever on a single page.
If you've been reading this blog regularly, you've probably been starting to dread clicking this link just as much as I've been. 14 MB of initially loaded content isn't too bad for 136 posts with an increasing amount of media content, but laying out the now 2 MB of HTML sure takes a while, leaving you with a sluggish and unresponsive browser in the meantime. The old one-page view is still available at a dedicated URL in case you want to Ctrl-F over the entire history from time to time, but it's no longer the default.
The new 🔼 and 🔽 buttons now allow quick jumps between blog posts without going through the table of contents or the old one-page view. These work as expected on all views of the blog: On single-post pages, the buttons link to the adjacent single-post pages, whereas they jump up and down within the same page on the list of posts or the tag-filtered and one-page views.
The header section of each post now shows the individual goals of each push that the post documents, providing a sort of title. This is much more useful than wasting space with meaningless commit hashes; just like in the log, links to the commit diffs don't need to be longer than a GitHub icon.
The web feeds that 📝 handlerug implemented two years ago are now prominently displayed in the new blog navigation sub-header. Listing them using <link rel="alternate"> tags in the HTML <head> is usually enough for integrated feed reader extensions to automatically discover their presence, but it can't hurt to draw more attention to them. Especially now that Twitter has been locking out unregistered users for quite some time…
Speaking of microblogging platforms, I've now also followed a good chunk of the Touhou community to Bluesky! The algorithms there seem to treat my posts much more favorably than Twitter has been doing lately, despite me having less than 1/10 of mostly automatically migrated followers there. For now, I'm going to cross-post new stuff to both platforms, but I might eventually spend a push to migrate my entire tweet history over to a self-hosted PDS to own the primary source of this data.
Next up: Staying with main menus, but jumping forward to TH04 and TH05 and finalizing some code there. Should be a quick one.
P0280
TH03 RE (Coordinate transformations / Player entity movement / Global shared hitbox / Hit circles)
💰 Funded by:
Blue Bolt, JonathKane, [Anonymous]
🏷️ Tags:
TH03 gameplay! 📝 It's been over two years. People have been investing some decent money with the intention of eventually getting netplay, so let's cover some more foundations around player movement… and quickly notice that there's almost no overlap between gameplay RE and netplay preparations?
That makes for a fitting opportunity to think about what TH03 netplay would look like. Regardless of how we implement them into TH03 in particular, these features should always be part of the netcode:
You'd want UDP rather than TCP for both its low latency and its NAT hole-punching ability
However, raw UDP does not guarantee that the packets arrive in order, or that they even arrive at all
WebRTC implements these reliability guarantees on top of UDP in a modern package, providing the best of both worlds
NAT traversal via public or self-hosted STUN/TURN servers is built into the connection establishment protocol and APIs, so you don't even have to understand the underlying issue
I'm not too deep into networking to argue here, and it clearly works for Ju.N.Owen. If we do explore other options, it would mainly be because I can't easily get something as modern as WebRTC to natively run on Windows 9x or DOS, if we decide to go for that route.
Matchmaking: I like Ju.N.Owen's initial way of copy-pasting signaling codes into chat clients to establish a peer-to-peer connection without a dedicated matchmaking server. progre eventually implemented rooms on the AWS cloud, but signaling codes are still used for spectating and the Pure P2P mode. We'll probably copy the same evolution, with a slight preference for Pure P2P – if only because you would have to check a GDPR consent box before I can put the combination of your room name and IP address into a database. Server costs shouldn't be an issue at the scale I expect this to have.
Rollback: In emulators, rollback netcode can be and has been implemented by keeping savestates of the last few frames together with the local player's inputs and then replaying the emulation with updated inputs of the remote player if a prediction turned out to be incorrect. This technique is a great fit for TH03 for two reasons:
All game state is contained within a relatively small bit of memory. The only heap allocations done in MAIN.EXE are the 📝 .MRS images for gauge attack portraits and bomb backgrounds, and the enemy scripts and formations, both of which remain constant throughout a round. All other state is statically allocated, which can reduce per-frame snapshots from the naive 640 KiB of conventional DOS memory to just the 37 KiB of MAIN.EXE's data segment. And that's the upper bound – this number is only going to go down as we move towards 100% PI, figure out how TH03 uses all its static data, and get to consolidate all mutated data into an even smaller block of memory.
For input prediction, we could even let the game's existing AI play the remote player until the actual inputs come in, guaranteeing perfect play until the remote inputs prove otherwise. Then again… probably only while the remote player is not moving, because the chance for a human to replicate the AI's infamous erratic dodging is fairly low.
The only issue with rollback in specifically a PC-98 emulator is its implications for performance. Rendering is way more computationally expensive on PC-98 than it is on consoles with hardware sprites, involving lots of memory writes to the disjointed 4 bitplane segments that make up the 128 KB framebuffer, and equally as many reads and bitshift operations on sprite data. TH03 lessens the impact somewhat thanks to most of its rendering being EGC-accelerated and thus running inside the emulator as optimized native code, but we'd still be emulating all the x86 code surrounding the EGC accesses – from the emulator's point of view, it looks no different than game logic. Let's take my aging i5 system for example:
With the Screen → No wait option, Neko Project 21/W can emulate TH03 gameplay at 260 FPS, or 4.6× its regular speed.
This leaves room for each frame to contain 3.6 frames of rollback in addition to the frame that's supposed to be displayed,
which results in a maximum safe network latency of ≈63 ms, or a ping of ≈126 ms. According to this site, that's enough for a smooth connection from Germany to any other place in Europe and even out to the US Midwest. At this ping, my system could still run the game without slowdown even if every single frame required a rollback, which is highly unlikely.
Any higher ping, however, could occasionally lead to a rollback queue that's too large for my system to process within a single frame at the intended 56.4 FPS rate. As a result, me playing anyone in the western US is highly likely to involve at least occasional slowdowns. Delaying inputs on purpose is the usual workaround, but isn't Touhou that kind of game series where people use vpatch to get rid of even the default input delay in the Windows games?
So we'd ideally want to put TH03 into an update-only mode that skips all rendering calls during re-simulation of rolled-back frames. Ironically, this means that netplay-focused RE would actually focus on the game's rendering code and ensure that it doesn't mutate any statically allocated data, allowing it to be freely skipped without affecting the game. Imagine palette-based flashing animations that are implemented by gradually mutating statically allocated values – these would cause wrong colors for the rest of the game if the animation doesn't run on every frame.
The integration of all of this into TH03 can be approached from several angles. Of course, as long as we don't port the game, netplay will still require a PC-98 emulator to run on modern systems. PC-98 emulation is typically regarded as difficult to set up and the additional configuration required for some of these methods would only make it harder. However, yksoft1 demonstrates that it doesn't have to be: By compiling the (potentially modified) PC-98 emulator to WebAssembly, running any of these non-native methods becomes as simple as opening a website. To stay legally safe, I wouldn't host the game myself, so you'd still have to drag your th03.hdi onto that browser tab. But if you're happy with playing in a browser, this would be as user-friendly as it gets.
Here's an overview of the various approaches with their most important pros and cons:
Depending on what the backers prefer, we can go for one, a few, or all of these.
Generic PC-98 netcode for one or more emulators
This is the most basic and puristic variant that implements generic netplay for PC-98 games in general by effectively providing remote control of the emulated keyboard and joypad. The emulator will be unaware of the game, and the game will be unaware of being netplayed, which makes this solution particularly interesting for the non-Touhou PC-98 scene, or competitive players who absolutely insist on using ZUN's original binaries and won't trust any of my modded game builds.
Applied to TH03, this means that players would select the regular hot-seat 1P vs 2P mode and then initiate a match through a new menu in the emulator UI. The same UI must then provide an option to manually remap incoming key and button presses to the 2P controls (newly introducing remapping to the emulator if necessary), as well as blocking any non-2P keys. The host then sends an initial savestate to the guest to ensure an identical starting state, and starts synchronizing and rolling back inputs at VSync boundaries.
This generic nature means that we don't get to include any of the TH03-specific rollback optimizations mentioned above, leading to the highest CPU and memory requirements out of all the variants. It sure is the easiest to implement though, as we get to freely use modern C++ WebRTC libraries that are designed to work with the network stack of the underlying OS.
I can try to build this netcode as a generic library that can work with any PC-98 emulator, but it would ultimately be up to the respective upstream developers to integrate it into official releases. Therefore, expect this variant to require separate funding and custom builds for each individual emulator codebase that we'd like to support.
Emulator-level netcode with game-specific hooks
Takes the generic netcode developed in 1) and adds the possibility for the game to control it via a special interrupt API. This enables several improvements:
Online matches could be initiated through new options in TH03's main menu rather than the emulator's UI.
The game could communicate the memory region that should be backed up every frame, cutting down memory usage as described above.
The exchanged input data could use the game's internal format instead of keyboard or joypad inputs. This removes the need for key remapping at the emulator level and naturally prevents the inherent issue of remote control where players could mess with each other's controls.
The game could be aware of the rollbacks, allowing it to jump over its rendering code while processing the queue of remote inputs and thus gain some performance as explained above.
The game could add synchronization points that block gameplay until both players have reached them, preventing the rollback queue from growing infinitely. This solves the issue of 1) not having any inherent way of working around desyncs and the resulting growth of the rollback queue. As an example, if one of the two emulators in 1) took, say, 2 seconds longer to load the game due to a random CPU spike caused by some bloatware on their system, the two players would be out of sync by 2 seconds for the rest of the session, forcing the faster system to render 113 frames every time an input prediction turned out to be incorrect.
Good places for synchronization points include the beginning of each round, the WARNING!! You are forced to evade / Your life is in peril popups that pause the game for a few frames anyway, and whenever the game is paused via the ESC key.
During such pauses, the game could then also block the resuming ESC key of the player who didn't pause the game.
Emulated serial port communicating over named pipes with a standalone netplay tool
This approach would take the netcode developed in 2) out of the emulator and into a separate application running on the (modern) host OS, just like Ju.N.Owen or Adonis. The previous interrupt API would then be turned into a binary protocol communicated over the PC-98's serial port, while the rollback snapshots would be stored inside the emulated PC-98 in EMS or XMS/Protected Mode memory. Netplay data would then move through these stages:
🖥️ PC-98 game logic ⇄ Serial port ⇄ Emulator ⇄ Named pipe ⇄ Netcode logic ⇄ WebRTC Data Channel ⇄ Internet 🛜
All green steps run natively on the host OS.
Sending serial port data over named pipes is only a semi-common feature in PC-98 emulators, and would currently restrict netplay to Neko Project 21/W and NP2kai on Windows. This is a pretty clean and generally useful feature to have in an emulator though, and emulator maintainers will be much more likely to include this than the custom netplay code I proposed in 1) and 2). DOSBox-X has an open issue that we could help implement, and the NP2kai Linux port would probably also appreciate a mkfifo(3) implementation.
This could even work with emulators that only implement PC-98 serial ports in terms of, well, native Windows serial ports. This group currently includes Neko Project II fmgen, SL9821, T98-Next, and rare bundles of Anex86 that replace MIDI support with COM port emulation. These would require separately installed and configured virtual serial port software in place of the named pipe connection, as well as support for actual serial ports in the netplay tool itself. In fact, this is the only way that die-hard Anex86 and T98-Next fans could enjoy any kind of netplay on these two ancient emulators.
If it works though, it's the optimal solution for the emulated use case if we don't want to fork the emulator. From the point of view of the PC-98, the serial port is the cheapest way to send a couple of bytes to some external thing, and named pipes are one of many native ways for two Windows/Linux applications to efficiently communicate.
The only slight drawback of this approach is the expected high DOS memory requirement for rollback. Unless we find a way to really compress game state snapshots to just a few KB, this approach will require a more modern DOS setup with EMS/XMS support instead of the pre-installed MS-DOS 3.30C on a certain widely circulated .HDI copy. But apart from that, all you'd need to do is run the separate netplay tool, pick the same pipe name in both the tool and the emulator, and you're good to go.
It could even work for real hardware, but would require the PC-98 to be linked to the separately running modern system via a null modem cable.
Native PC-98 Windows 9x netcode (only for real PC-98 hardware equipped with an Ethernet card)
Equivalent in features to 2), but pulls the netcode into the PC-98 system itself. The tool developed in 3) would then as a separate 32-bit or 16-bit Windows application that somehow communicates with the game running in a DOS window. The handful of real-hardware owners who have actually equipped their PC-98 with a network card such as the LGY-98 would then no longer require the modern PC from 3) as a bridge in the middle.
This specific card also happens to be low-level-emulated by the 21/W fork of Neko Project. However, it makes little sense to use this technique in an emulator when compared to 3), as NP21/W requires a separately installed and configured TAP driver to actually be able to access your native Windows Internet connection. While the setup is well-documented and I did manage to get a working Internet connection inside an emulated Windows 95, it's definitely not foolproof. Not to mention DOSBox-X, which currently emulates the apparently hardware-compatible NE2000 card, but disables its emulation in PC-98 mode, most likely because its I/O ports clash with the typical peripherals of a PC-98 system.
And that's not the end of the drawbacks:
Netplay would depend on the PC-98 versions of Windows 9x and its full network stack, nothing of which is required for the game itself.
Porting libdatachannel (and especially the required transport encryption) to Windows 95 will probably involve a bit of effort as well.
As would actually finding a way to access V86 mode memory from a 32-bit or 16-bit Windows process, particularly due to how isolated DOS processes are from the rest of the system and even each other. A quick investigation revealed three potential approaches:
A 32-bit process could read the memory out of the address space of the console host process (WINOA32.MOD). There seems to be no way of locating the specific base address of a DOS process, but you could always do a brute-force search through the memory map.
If started before Windows, TSRs will share their resident memory with both DOS and Win16 processes. The segment pointer would then be retrieved through a typical interrupt API.
Writing a VxD driver 😩
Correctly setting up TH03 to run within Windows 95 to begin with can be rather tricky. The GDC clock speed check needs to be either patched out or overridden using mode-setting tools, Windows needs to be blocked from accessing the FM chip, and even then, MAIN.EXE might still immediately crash during the first frame and leave all of VRAM corrupted:
This is probably a bug in the latest ver0.86 rev92β3 version of Neko Project 21/W; I got it to work fine on real hardware. 📝 StormySpace did run on the same emulated Windows 95 system without any issues, though. Regardless, it's still worth mentioning as a symbol of everything that can go wrong.
A matchmaking server would be much more of a requirement than in any of the emulator variants. Players are unlikely to run their favorite chat client on the same PC-98 system, and the signaling codes are way too unwieldy to type them in manually. (Then again, IRC is always an option, and the people who would fund this variant are probably the exact same people who are already running IRC clients on their PC-98.)
Native PC-98 DOS netcode (only for real PC-98 hardware equipped with an Ethernet card)
Conceptually the same as 4), but going yet another level deeper, replacing the Windows 9x network stack with a DOS-based one. This might look even more intimidating and error-prone, but after I got pingand even Telnet working, I was pleasantly surprised at how much simpler it is when compared to the Windows variant. The whole stack consists of just one LGY-98 hardware information tool, a LGY-98 packet driver TSR, and a TSR that implements TCP/IP/UDP/DNS/ICMP and is configured with a plaintext file. I don't have any deep experience with these protocols, so I was quite surprised that you can implement all of them in a single 40 KiB binary. Installed as TSRs, the entire stack takes up an acceptable 82 KiB of conventional memory, leaving more than enough space for the game itself. And since both of the TSRs are open-source, we can even legally bundle them with the future modified game binaries.
The matchmaking issue from the Windows 9x approach remains though, along with the following issues:
Porting libdatachannel and the required transport encryption to the TEEN stack seems even more time-consuming than a Windows 95 port.
The TEEN stack has no UI for specifying the system's or gateway's IP addresses outside of its plaintext configuration file. This provides a nice opportunity for adding a new Internet settings menu with great error feedback to the game itself. Great for UX, but it's another thing I'd have to write.
As always, this is the premium option. If the entire game already runs as a standalone executable on a modern system, we can just put all the netcode into the same binary and have the most seamless integration possible.
That leaves us with these prerequisites:
1), by definition, needs nothing from ReC98, and I could theoretically start implementing it right now. If you're interested in funding it, just tell me via the usual Twitter or Discord channels.
2) through 5) require at least 100% RE of TH03's OP.EXE to facilitate the new menu code. Reverse-engineering all rendering-related code in MAIN.EXE would be nice for performance, but we don't strictly need all of it before we start. Re-simulated frames can just skip over the few pieces of rendering code we do know, and we can gradually increase the skipped area of code in future pushes. 100% PI won't be a requirement either, as I expect the MAIN.EXE part of the interfacing netcode layer to be thin enough that it can easily fit within the original game's code layout.
Therefore, funding TH03 OP.EXE RE is the clearest way you can signal to me that you want netplay with nice UX.
6), obviously, requires all of TH03 to be RE'd, decompiled, cleaned up, and ported to modern systems. Currently, TH03 appears to be the second-easiest game to port behind TH02:
Although TH03 already has more needlessly micro-optimized ASM code than TH02 and there's even more to come, it still appears to have way less than TH04 or TH05.
Its game logic and rendering code seem to be somewhat neatly separated from each other, unlike TH01 which deeply intertwines them.
Its graphics seem free of obvious bugs, unlike – again — the flicker-fest that is TH01.
But still, it's the game with the least amount of RE%. Decompilation might get easier once I've worked myself up to the higher levels of game code, and even more so if we're lucky and all of the 9 characters are coded in a similar way, but I can't promise anything at this point.
Once we've reached any of these prerequisites, I'll set up a separate campaign funding method that runs parallel to the cap. As netplay is one of those big features where incremental progress makes little sense and we can expect wide community support for the idea, I'll go for a more classic crowdfunding model with a fixed goal for the minimum feature set and stretch goals for optional quality-of-life features. Since I've still got two other big projects waiting to be finished, I'd like to at least complete the Shuusou Gyoku Linux port before I start working on TH03 netplay, even if we manage to hit any of the funding goals before that.
For the first time in a long while, the actual content of this push can be listed fairly quickly. I've now RE'd:
conversions from playfield-relative coordinates to screen coordinates and back (a first in PC-98 Touhou; even TH02 uses screen space for every coordinate I've seen so far),
the low-level code that moves the player entity across the screen,
a copy of the per-round frame counter that, for some reason, resets to 0 at the start of the Win/Lose animation, resetting a bunch of animations with it,
a global hitbox with one variable that sometimes stores the center of an entity, and sometimes its top-left corner,
and the 48×48 hit circles from EN2.PI.
It's also the third TH03 gameplay push in a row that features inappropriate ASM code in places that really, really didn't need any. As usual, the code is worse than what Turbo C++ 4.0J would generate for idiomatic C code, and the surrounding code remains full of untapped and quick optimization opportunities anyway. This time, the biggest joke is the sprite offset calculation in the hit circle rendering code:
A multiplication with 6 would have compiled into a single IMUL instruction. This compiles into 4 MOVs, one IMUL (with 2), and two ADDs. This surely must have been left in on purpose for us to laugh about it one day?
But while we've all come to expect the usual share of ZUN bloat by now, this is also the first push without either a ZUN bug or a landmine since I started using these terms! 🎉 It does contain a single ZUN quirk though, which can also be found in the hit circles. This animation comes in two types with different caps: 12 animation slots across both playfields for the enemy circles shown in alternating bright/dark yellow colors, whereas the white animation for the player characters has a cap of… 1? P2 takes precedence over P1 because its update code always runs last, which explains what happens when both players get hit within the 16 frames of the animation:
If they both get hit on the exact same frame, the animation for P1 never plays, as P2 takes precedence.
If the other player gets hit within 16 frames of an active white circle animation, the animation is reinitialized for the other player as there's only a single slot to hold it. Is this supposed to telegraph that the other player got hit without them having to look over to the other playfield? After all, they're drawn on top of most other entities, but below the player.
SPRITE16 uses the PC-98's EGC to draw these single-color sprites. If the EGC is already set up, it can be set into a GRCG-equivalent RMW mode using the pattern/read plane register (0x4A2) and foreground color register (0x4A6), together with setting the mode register (0x4A4) to 0x0CAC. Unlike the typical blitting operations that involve its 16-dot pattern register, the EGC even supports 8- or 32-bit writes in this mode, just like the GRCG. 📝 As expected for EGC features beyond the most ordinary ones though, T98-Next simply sets every written pixel to black on a 32-bit write. Comparing the actual performance of such writes to the GRCG would be 📝 yet another interesting question to benchmark.
Next up: I think it's time for ReC98's build system to reach its final form.
For almost 5 years, I've been using an unreleased sane build system on a parallel private branch that was just missing some final polish and bugfixes. Meanwhile, the public repo is still using the project's initial Makefile that, 📝 as typical for Makefiles, is so unreliable that BUILD16B.BAT force-rebuilds everything by default anyway. While my build system has scaled decently over the years, something even better happened in the meantime: MS-DOS Player, a DOS emulator exclusively meant for seamless integration of CLI programs into the Windows console, has been forked and enhanced enough to finally run Turbo C++ 4.0J at an acceptable speed. So let's remove DOSBox from the equation, merge the 32-bit and 16-bit build steps into a single 32-bit one, set all of this up in a user-friendly way, and maybe squeeze even more performance out of MS-DOS Player specifically for this use case.
P0264
TH03/TH04/TH05 decompilation (Music Rooms, part 1/2)
P0265
TH03/TH04/TH05 decompilation (Music Rooms, part 2/2 + MAINE.EXE main()) + TH02 PI/RE (Boss damage and position)
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷️ Tags:
Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.
But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!
As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.
The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:
Let's call this "the blank image".
That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right?
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.
On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.
Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:
// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);
// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);
// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);
// We're done, turn off the GRCG.
outportb(0x7C, 0x00);
This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.
As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:
Colors 0 or 1 can't be used, because those don't include any of the bits that can stay constant between frames.
If the lowest bit of a palette color index has no effect on the displayed color, text drawn in either of the two colors won't be visually affected by the polygon animation and will always appear on top. TH04 and TH05 rely on this property with their colors 2/3, 4/5, and 6/7 being identical, but this would work in TH02 and TH03 as well.
But this doesn't apply to TH02 and TH03's palettes, so how do they do it? The secret: They simply include all text pixels in nopoly_B. This allows text to use any color with an odd palette index – the lowest bit then won't be affected by the polygons ORed into the first bitplane, and the other bitplanes remain unchanged.
TH04 is a curious case. Ostensibly, it seems to remove support for odd text colors, probably because the new 10-frame fade-in animation on the comment text would require at least the comment area in VRAM to be captured into nopoly_B on every one of the 10 frames. However, the initial pixels of the tracklist are still included in nopoly_B, which would allow those to still use any odd color in this game. ZUN only removed those from nopoly_B in TH05, where it had to be changed because that game lets you scroll and browse through multiple tracklists.
The contents of nopoly_B with each game's first track selected.
Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:
Due to the polygon animation, the Music Room is one of the few double-buffered menus in PC-98 Touhou, rendering to both VRAM pages on alternate frames instead of using the other page to store a background image. Unfortunately though, this doesn't actually translate to tearing-free rendering because ZUN's initial implementation for TH02 mixed up the order of the required operations. You're supposed to first wait for the GDC's VSync interrupt and then, within the display's vertical blanking interval, write to the relevant I/O ports to flip the accessed and shown pages. Doing it the other way around and flipping as soon as you're finished with the last draw call of a frame means that you'll very likely hit a point where the (real or emulated) electron beam is still traveling across the screen. This ensures that there will be a tearing line somewhere on the screen on all but the fastest PC-98 models that can render an entire frame of the Music Room completely within the vertical blanking interval, causing the very issue that double-buffering was supposed to prevent.
ZUN only fixed this landmine in TH05. Edit (2025-09-06): The 📝 2025-09-06 blog post contains a visualization of this tearing landmine.
The polygons have a fixed vertex count and radius depending on their index, everything else is randomized. They are also never reinitialized while OP.EXE is running – if you leave the Music Room and reenter it, they will continue animating from the same position.
TH02 and TH04 don't handle it at all, causing held keys to be processed again after about a second.
TH03 and TH05 correctly work around the quirk, at the usual cost of a 614.4 µs delay per frame. Except that the delay is actually twice as long in frames in which a previously held key is released, because this code is a mess.
But even in 2024, DOSBox-X is the only emulator that actually replicates this detail of real hardware. On anything else, keyboard input will behave as ZUN intended it to. At least I've now mentioned this once for every game, and can just link back to this blog post for the other menus we still have to go through, in case their game-specific behavior matches this one.
TH02 is the only game that
separately lists the stage and boss themes of the main game, rather than following the in-game order of appearance,
continues playing the selected track when leaving the Music Room,
always loads both MIDI and PMD versions, regardless of the currently selected mode, and
does not stop the currently playing track before loading the new one into the PMD and MMD drivers.
The combination of 2) and 3) allows you to leave the Music Room and change the music mode in the Option menu to listen to the same track in the other version, without the game changing back to the title screen theme. 4), however, might cause the PMD and MMD drivers to play garbage for a short while if the music data is loaded from a slow storage device that takes longer than a single period of the OPN timer to fill the driver's song buffer. Probably not worth mentioning anymore though, now that people no longer try fitting PC-98 Touhou games on floppy disks.
Exactly 40 (TH02/TH03) / 38 (TH04/TH05) visible bytes per line,
padded with 2 bytes that can hold a CR/LF newline sequence for easier editing.
Every track starts with a title line that mostly just duplicates the names from the hardcoded tracklist,
followed by a fixed 19 (TH02/TH03/TH04) / 9 (TH05) comment lines.
In TH04 and TH05, lines can start with a semicolon (;) to prevent them from being rendered. This is purely a performance hint, and is visually equivalent to filling the line with spaces.
All in all, the quality of the code is even slightly below the already poor standard for PC-98 Touhou: More VRAM page copies than necessary, conditional logic that is nested way too deeply, a distinct avoidance of state in favor of loops within loops, and – of course – a couple of gotos to jump around as needed.
In TH05, this gets so bad with the scrolling and game-changing tracklist that it all gives birth to a wonderfully obscure inconsistency: When pressing both ⬆️/⬇️ and ⬅️/➡️ at the same time, the game first processes the vertical input and then the horizontal one in the next frame, making it appear as if the latter takes precedence. Except when the cursor is highlighting the first (⬆️ ) or 12th (⬇️ ) element of the list, and said list element is not the first track (⬆️ ) or the quit option (⬇️ ), in which case the horizontal input is ignored.
And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier…
For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".
With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌
The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.
Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there!
Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.
This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!
These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.
Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.
And we're back to PC-98 Touhou for a brief interruption of the ongoing Shuusou Gyoku Linux port.
Let's clear some of the Touhou-related progress from the backlog, and use
the unconstrained nature of these contributions to prepare the
📝 upcoming non-ASCII translations commissioned by Touhou Patch Center.
The current budget won't cover all of my ambitions, but it would at least be
nice if all text in these games was feasibly translatable by the time I
officially start working on that project.
At a little over 3 pushes, it might be surprising to see that this took
longer than the
📝 TH03/TH04/TH05 cutscene system. It's
obvious that TH02 started out with a different system for in-game dialog,
but while TH04 and TH05 look identical on the surface, they only
actually share 30% of their dialog code. So this felt more like decompiling
2.4 distinct systems, as opposed to one identical base with tons of
game-specific differences on top.
The table of contents was pretty popular last time around, so let's have
another one:
Let's start with the ones from TH04 and TH05, since they are not that
broken. For TH04, ZUN started out by copy-pasting the cutscene system,
causing the result to inherit many of the caveats I already described in the
cutscene blog post:
It's still a plaintext format geared exclusively toward full-width
Japanese text.
The parser still ignores all whitespace, forcing ASCII text into hacks
with unassigned Shift-JIS lead bytes outside the second byte of a 2-byte
chunk.
Commands are still preceded by a 0x5C byte, which renders
as either a \ or a ¥ depending on your font and
interpretation of Shift-JIS.
Command parameters are parsed in exactly the same way, with all the same
limits.
A lot of the same script commands are identical, including 7 of them
that were not used in TH04's original dialog scripts.
Then, however, he greatly simplified the system. Mainly, this was done by
moving text rendering from the PC-98 graphics chip to the text chip, which
avoids the need for any text-related unblitting code, but ZUN also added a
bunch of smaller changes:
The player must advance through every dialog box by releasing any held
keys and then pressing any key mapped to a game action. There are no
timeouts.
The delay for every 2 bytes of text was doubled to 2 frames, and can't
be overridden.
Instead of holding ESC to fast-forward, pressing any key
will immediately print the entire rest of a text box.
Dialogs run in their own single-buffered frame loop, interrupting the
rest of the game. The other VRAM page keeps the background pixels required
for unblitting the face images.
All script commands that affect the graphics layer are preceded by a
1-frame delay. ZUN most likely did this because of the single-buffered
nature, as it prevents tearing on the first frame by waiting for the CRT
beam to return to the top-left corner before changing any pixels.
Both boxes are intended to contain up to 30 half-width characters on
each of their up to 3 lines, but nothing in the code enforces these limits.
There is no support for automatic line breaks or starting new boxes.
While it would seem that TH05 has no issues with ASCII 0x20
spaces, the text as a whole is still blindly processed two bytes at a
time, and any commands can only appear at even byte positions within a
line. I dimmed the VRAM pixels to 25% of their original brightness to make the
text easier to read.
The same text backported to TH04, additionally demonstrating how that
game's dialog system inherited the whitespace skipping behavior of
TH03's cutscene system. Just like there, ASCII 0x20 spaces
only work at odd byte positions because the game treats them as the
trailing byte of a full-width Shift-JIS codepoint. I don't know how
large the budget for the upcoming non-ASCII translations will be, but
I'm going to fix this even in the very basic fully static variant.
I dimmed the VRAM pixels to 25% of their original brightness to make the
text easier to read.
TH05 then moved from TH04's plaintext scripts to the binary
.TX2 format while removing all the unused commands copy-pasted
from the cutscene system. Except for a
single additional command intended to clear a text box, TH05's dialog
system only supports a strict subset of the features of TH04's system.
This change also introduced the following differences compared to TH04:
The game now stores the dialog of all 4 playable characters in the same
file, with a (4 + 1)-word header that indicates the byte offset
and length of each character's script. This way, it can load only the one
script for the currently played character.
Since there is no need for whitespace in a binary format, you can now
use ASCII 0x20 spaces even as the first byte of a 2-byte text
chunk! 🥳
All command parameters are now mandatory.
Filenames are now passed directly by pointer to the respective game
function. Therefore, they now need to be null-terminated, but can in turn be
as long as
📝 the number of remaining bytes in the allocated dialog segment.
In practice though, the game still runs on DOS and shares its restriction of
8.3 filenames…
When starting a new dialog box, any existing text in the other box is
now colored blue.
Thanks to ZUN messing up the return values of the command-interpreting
switch function, you can effectively use only line break and gaiji commands in the middle of text. All other
commands do execute, but the interpreter then also treats their command byte
as a Shift-JIS lead byte and places it in text RAM together with whatever
other byte follows in the script.
This is why TH04 can and does put its \= commandsinto the boxes
started with the 0 or 1 commands, but TH05 has to
put its 0x02 commands before the equivalent 0x0D.
Writing the 0x02 byte to text RAM results in an character, which is simply the PC-98 font ROM's glyph for that
Shift-JIS codepoint. Also note how each face change is now
preceded by two frames of delay.
No problem in TH04. Note how the dialog also runs a bit faster – TH04
only adds the aforementioned one frame of delay to each face change, and
has fewer two-byte chunks of text to display overall.
For modding these files, you probably want to use TXDEF from
-Tom-'s MysticTK. It decodes these
files into a text representation, and its encoder then takes care of the
character-specific byte offsets in the 10-byte header. This text
representation simplifies the format a lot by avoiding all corner cases and
landmines you'd experience during hex-editing – most notably by interpreting
the box-starting 0x0D as a
command to show text that takes a string parameter, avoiding the broken
calls to script commands in the middle of text. However, you'd still have to
manually ensure an even number of bytes on every line of text.
In the entry function of TH05's dialog loop, we also encounter the hack that
is responsible for properly handling
📝 ZUN's hidden Extra Stage replay. Since the
dialog loop doesn't access the replay inputs but still requires key presses
to advance through the boxes, ZUN chose to just skip the dialog altogether in the
specific case of the Extra Stage replay being active, and replicated all
sprite management commands from the dialog script by just hardcoding
them.
And you know what? Not only do I not mind this hack, but I would have
preferred it over the actual dialog system! The aforementioned sprite
management commands effectively boil down to manual memory management,
deallocating all stage enemy and midboss sprites and thus ensuring that the
boss sprites end up at specific master.lib sprite IDs (patnums). The
hardcoded boss rendering function then expects these sprites to be available
at these exact IDs… which means that the otherwise hardcoded bosses can't
render properly without the dialog script running before them.
There is absolutely no excuse for the game to burden dialog scripts with
this functionality. Sure, delayed deallocation would allow them to blit
stage-specific sprites, but the original games don't do that; probably
because none of the two games feature an unblitting command. And even if
they did, it would have still been cleaner to expose the boss-specific
sprite setup as a single script command that can then also be called from
game code if the script didn't do so. Commands like these just are a recipe
for crashes, especially with parsers that expect fullwidth Shift-JIS
text and where misaligned ASCII text can easily cause these commands to be
skipped.
But then again, it does make for funny screenshot material if you
accidentally the deallocation and then see bosses being turned into stage
enemies:
Some of the more amusing consequences of not calling the
sprite-deallocating
\c /
0x04 command inside a dialog
script.
In the case of 4️⃣, the game then even crashes on this frame at the end
of the dialog, in a way that resembles the infamous
📝 TH04 crash before Stage 5 Yuuka if no EMS driver is loaded.
Both the stage- and boss-specific BFNT sprites are loaded into memory at
this point, leaving no room for the 256×256-pixel background image on
the size-limited master.lib heap.
With all the general details out of the way, here's the command reference:
0 1
0x00 0x01
Selects either the player character (0) or the boss (1) as the
currently speaking character, and moves the cursor to the beginning of
the text box. In TH04, this command also directly starts the new dialog
box, which is probably why it's not prefixed with a \ as it
only makes sense outside of text. TH05 requires a separate 0x0D command to do the
same.
\=1
0x02 0x!!
Replaces the face portrait of the currently active speaking
character with image #1 within her .CD2
file.
\=255
0x02 0xFF
Removes the face portrait from the currently active text box.
\l,filename
0x03 filename 0x00
Calls master.lib's super_entry_bfnt() function, which
loads sprites from a BFNT file to consecutive IDs starting at the
current patnum write cursor.
\c
0x04
Deallocates all stage-specific BFNT sprites (i.e., stage enemies and
midbosses), freeing up conventional RAM for the boss sprites and
ensuring that master.lib's patnum write cursor ends up at
128 /
180.
In TH05's Extra Stage, this command also replaces
📝 the sprites loaded from MIKO16.BFT with the ones from ST06_16.BFT.
\d
Deallocates all face portrait images.
The game automatically does this at the end of each dialog sequence.
However, ZUN wanted to load Stage 6 Yuuka's 76 KiB of additional
animations inside the script via \l, and would have once again
run up against the master.lib heap size limit without that extra free
memory.
\m,filename
0x05 filename 0x00
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
\m$
0x05 $ 0x00
Stops the currently playing BGM.
Note that TH05 interprets $ as a null-terminated filename as
well.
\m*
Restarts playback of the currently loaded BGM from the
beginning.
\b0,0,0
0x06 0x!!!!0x!!!!0x!!
Blits the master.lib patnum with the ID indicated by the third
parameter to the current VRAM page at the top-left screen position
indicated by the first two parameters.
\e0
Plays the sound effect with the given ID.
\t100
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
\fo1
\fi1
Calls master.lib's palette_black_out() or
palette_black_in() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\wo1
\wi1
0x09 0x!!
0x0A 0x!!
Calls master.lib's palette_white_out() or
palette_white_in() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps. The
TH05 version of 0x09 also clears the text in both boxes
before the animation.
\n
0x0B
Starts a new line by resetting the X coordinate of the TRAM cursor
to the left edge of the text area and incrementing the Y coordinate.
The new line will always be the next one below the last one that was
properly started, regardless of whether the text previously wrapped to
the next TRAM row at the edge of the screen.
\g8
Plays a blocking 8-frame screen shake
animation. Copy-pasted from the cutscene parser, but actually used right
at the end of the dialog shown before TH04's Bad Ending.
\ga0
0x0C 0x!!
Shows the gaiji with the given ID from 0 to 255
at the current cursor position, ignoring the per-glyph delay.
\k0
Waits 0 frames (0 = forever) for any key
to be pressed before continuing script execution.
Takes the current dialog cursor as the top-left corner of a
240×48-pixel rectangle, and replaces all text RAM characters within that
rectangle with whitespace.
This is only used to clear the player character's text box before
Shinki's final いくよ‼ box. Shinki has two
consecutive text boxes in all 4 scripts here, and ZUN probably wanted to
clear the otherwise blue text to imply a dramatic pause before Shinki's
final sentence. Nice touch.
(You could, however, also use it after a
box-ending 0xFF command to mess with text RAM in
general.)
\#
Quits the currently running loop. This returns from either the text
loop to the command loop, or it ends the dialog sequence by returning
from the command loop back to gameplay. If this stage of the game later
starts another dialog sequence, it will start at the next script
byte.
\$
Like \#, but first waits for any key to be
pressed.
0xFF
Behaves like TH04's \$ in the text loop, and like
\# in the command loop. Hence, it's not possible in TH05 to
automatically end a text box and advance to the next one without waiting
for a key press.
Unused commands are in gray.
At the end of the day, you might criticize the system for how its landmines
make it annoying to mod in ASCII text, but it all works and does what it's
supposed to. ZUN could have written the cleanest single and central
Shift-JIS iterator that properly chunks a byte buffer into halfwidth and
fullwidth codepoints, and I'd still be throwing it out for the upcoming
non-ASCII translations in favor of something that either also supports UTF-8
or performs dictionary lookups with a full box of text.
The only actual bug can be found in the input detection, which once
again doesn't correctly handle the infamous key
up/key down scancode quirk of PC-98 keyboards. All it takes
is one wrongly placed input polling call, and suddenly you have to think
about how the update cycle behind the PC-98 keyboard state bytes
might cause the game to run the regular 2-frame delay for a single
2-byte chunk of text before it shows the full text of a box after
all… But even this bug is highly theoretical and could probably only be
observed very, very rarely, and exclusively on real hardware.
The same can't be said about TH02 though, but more on that later. Let's
first take a look at its data, which started out much simpler in that game.
The STAGE?.TXT files contain just raw Shift-JIS text with no
trace of commands or structure. Turning on the whitespace display feature in
your editor reveals how the dialog system even assumes a fixed byte
length for each box: 36 bytes per line which will appear on screen, followed
by 4 bytes of padding, which the original files conveniently use to visually
split the lines via a CR/LF newline sequence. Make sure to disable trimming
of trailing whitespace in your editor to not ruin the file when modding the
text…
Two boxes from TH02's STAGE5.TXT with visualized whitespace.
These also demonstrate how the CR/LF newlines only make up 2 of the 4
padding bytes, and require each line to be padded with two more bytes; you
could not use these trailing spaces for actual text. Also note how
the exquisite mixture of fullwidth and halfwidth spaces demands the text to
be viewed with only the most metrically consistent monospace fonts to
preserve the intended alignment. 🍷 It appears quite misaligned on my phone.
Consequently, everything else is hardcoded – every effect shown between text
boxes, the face portrait shown for each box, and even how many boxes are
part of each dialog sequence. Which means that the source code now contains
a
long hardcoded list of face IDs for most of the text boxes in the game,
with the rest being part of the
dedicated hardcoded dialog scripts for 2/3 of the
game's stages.
Without the restriction to a fixed set of scripting commands, TH02 naturally
gravitated to having the most varied dialog sequences of all PC-98 Touhou
games. This flexibility certainly facilitated Mima's grand entrance
animation in Stage 4, or the different lines in Stage 4 and 5 depending on
whether you already used a continue or not. Marisa's post-boss dialog even
inserts the number of continues into the text itself – by, you guessed it,
writing to hardcoded byte offsets inside the dialog text before printing it
to the screen. But once again, I have nothing to
criticize here – not even the fact that the alternate dialog scripts have to
mutate the "box cursor" to jump to the intended boxes within the file. I
know that some people in my audience like VMs, but I would have considered
it more bloated if ZUN had implemented a full-blown scripting
language just to handle all these special cases.
Another unique aspect of TH02 is the way it stores its face portraits, which
are infamous for how hard they are to find in the original data files. These
sprites are actually map tiles, stored in MIKO_K.MPN,
and drawn using the same functions used to blit the regular map tiles to the
📝 tile source area in VRAM. We can only guess
why ZUN chose this one out of the three graphics formats he used in TH02:
BFNT supports transparency, but sacrifices one of the 16 colors to do
so. ZUN only used 15 colors for the face portraits, but might have wanted to
keep open the option to use that 16th color. The detailed
backgrounds also suggest that these images were never supposed to be
transparent to begin with.
PI is used for all bigger and non-transparent images, but ZUN would have
had to write a separate small function to blit a 48×48 subsection of such an
image. That certainly wouldn't have stopped him in the TH01 days, but he
probably was already past that point by this game.
That only leaves .MPN. Sure, he did have to slice each face into 9
separate 16×16 "map" tiles to use this format, but that's a small price to
pay in exchange for not having to write any new low-level blitting code,
especially since he must have already had an asset pipeline to generate
these files.
TH02's MIKO_K.PTN, arranged into a 16×16-tile layout that
reveals how these tiles are combined into face portraits. MPNDEF from -Tom-'s MysticTK conveniently uses
this exact layout in its .BMP output. Earlier MPNDEF
versions crashed when converting this file as its 256 tiles led to an
8-bit overflow bug, so make sure you've updated to the current version
from the end of October 2023 if you want to convert this file yourself.
The format stores the 4 bitplanes of each 16×16 tile in order, so good
luck finding a different planar image viewer that would support both
such a tiled layout and a custom palette. Sometimes, a weird
internal format is the best type of obfuscation.
And since you're certainly wondering about all these black tiles at the
edges: Yes, these are not only part of the file and pad it from the required
240×192 pixels to 256×256, but also kept in memory during a stage, wasting
9.5 KiB of conventional RAM. That's 172 seconds of potential input
replay data, just for those people who might still think that we need EMS
for replays.
Alright, we've got the text, we've got the faces, let's slide in the box and
display it all on screen. Apparently though, we also have to blit the player
and option sprites using raw, low-level master.lib function calls in the
process? This can't be right, especially because ZUN
always blits the option sprite associated with the Reimu-A shot type,
regardless of which one the player actually selected. And if you keep moving
above the box area before the dialog starts, you get to see exactly how
wrong this is:
Let's look closer at Reimu's sprite during the slide-in animation, and in
the two frames before:
This one image shows off no less than 4 bugs:
ZUN blits the stationary player sprite here, regardless of whether the
player was previously moving left or right. This is a nice way of indicating
that Reimu stops moving once the dialog starts, but maybe ZUN should
have unblitted the old sprite so that the new one wouldn't have appeared on
top. The game only unblits the 384×64 pixels covered by the dialog box on
every frame of the slide-in animation, so Reimu would only appear correctly
if her sprite happened to be entirely located within that area.
All sprites are shifted up by 1 pixel in frame 2️⃣. This one is not a
bug in the dialog system, but in the main game loop. The game runs the
relevant actions in the following order:
Invalidate any map tiles covered by entities
Redraw invalidated tiles
Decrement the Y coordinate at the top of VRAM according to the
scroll speed
Update and render all game entities
Scroll in new tiles as necessary according to the scroll speed, and
report whether the game has scrolled one pixel past the end of the
map
If that happened, pretend it didn't by incrementing the value
calculated in #3 for all further frames and skipping to
#8.
Issue a GDC SCROLL command to reflect the line
calculated in #3 on the display
Wait for VSync
Flip VRAM pages
Start boss if we're past the end of the map
The problem here: Once the dialog starts, the game has already rendered
an entire new frame, with all sprites being offset by a new Y scroll
offset, without adjusting the graphics GDC's scroll registers to
compensate. Hence, the Y position in 3️⃣ is the correct one, and the
whole existence of frame 2️⃣ is a bug in itself. (Well… OK, probably a
quirk because speedrunning exists, and it would be pretty annoying to
synchronize any video regression tests of the future TH02 Anniversary
Edition if it renders one fewer frame in the middle of a stage.)
ZUN blits the option sprites to their position from frame 1️⃣. This
brings us back to
📝 TH02's special way of retaining the previous and current position in a two-element array, indexed with a VRAM page ID.
Normally, this would be equivalent to using dedicated prev and
cur structure fields and you'd just index it with the back page
for every rendering call. But if you then decide to go single-buffered for
dialogs and render them onto the front page instead…
Note that fixing bug #2 would not cancel out this one – the sprites would
then simply be rendered to their position in the frame before 1️⃣.
And of course, the fixed option sprite ID also counts as a bug.
As for the boxes themselves, it's yet another loop that prints 2-byte chunks
of Shift-JIS text at an even slower fixed interval of 3 frames. In an
interesting quirk though, ZUN assumes that every box starts with the name of
the speaking character in its first two fullwidth Shift-JIS characters,
followed by a fullwidth colon. These 6 bytes are displayed immediately at
the start of every box, without the usual delay. The resulting alignment
looks rather janky with Genjii, whose single right-padded 亀
kanji looks quite awkward with the fullwidth space between the name
and the colon. Kind of makes you wonder why ZUN just didn't spell out his
proper name, 玄爺, instead, but I get the stylistic
difference.
In Stage 4, the two-kanji assumption then breaks with Marisa's three-kanji
name, which causes the full-width colon to be printed as the first delayed
character in each of her boxes:
That's all the issues and quirks in the system itself. The scripts
themselves don't leave much room for bugs as they basically just loop over
the hardcoded face ID array at this level… until we reach the end of the
game. Previously, the slide-in animation could simply use the tile
invalidation and re-rendering system to unblit the box on each frame, which
also explained why Reimu had to be separately rendered on top. But this no
longer works with a custom-rendered boss background, and so the game just
chooses to flood-fill the area with graphics chip color #0:
Then again, transferring pixels from the back page would be just
as wrong as they lag one frame behind. No way around capturing these 384×64
pixels to main memory here… Oh well, this flood-fill at least adds even more
legibility on top of the already half-transparent text box. A property that
the following dialog sequence unfortunately lacks…
For Mima's final defeat dialog though, ZUN chose to not even show the box.
He might have realized the issue by that point, or simply preferred the more
dramatic effect this had on the lines. The resulting issues, however, might
even have ramifications for such un-technical things as lore and
character dynamics. As it turns out, the code
for this dialog sequence does in fact render Mima's smiling face for all
boxes?! You only don't see it in the original game because it's rendered to
the other VRAM page that remains invisible during the dialog sequence:
Caution, flashing lights.
Here's how I interpret the situation:
The function that launches into the final part of the dialog script
starts with dedicated
code to re-render Mima to the back page, on top of the previously
rendered planet background. Since the entire script runs on the front
page (and thus, on top of the previous frame) and the game launches into
the ending immediately after, you don't ever get to see this new partial
frame in the original game.
Showing this partial frame would also ensure that you can actually
read the dialog text without a surrounding box. Then, the white
letters won't ever be put on top of any white bullets – or, worse, be completely invisible if the
dialog is triggered in the middle of Reimu-B's bomb animation, which
fills VRAM with lots of white pixels.
Hence, we've got enough evidence to classify not showing the back page
as a ZUN
bug. 🐞
However, Mima's smiling face jars with the words she says here. Adding
the face would deviate more significantly from the original game than
removing the player shot, item, bullet, or spark sprites would. It's
imaginable that ZUN just forgot about the dedicated code that
re-rendered just Mima to the back page, but the faces add
something to the dialog, and ZUN would have clearly noticed and
fixed it if their absence wasn't intended. Heck, ZUN might have just put
something related to Mima into the code because TH02's dialog system has
no way of not drawing a face for a dialog box. Filling the face
area with graphics chip color #0, as seen in the first and third boxes
of the Extra Stage pre-boss dialog, would have been an alternative, but
that would have been equally wrong with regard to the background.
Hence, the invisible face portrait from the original game is a ZUN
quirk. 🎺
So, the future TH02 Anniversary Edition will fix the bug by showing
the back page, but retain the quirk by rewriting the dialog code to
not blit the face.
And with that, we've secured all in-game dialog for the upcoming non-ASCII
translations! The remaining 2/3 of the last push made
for a good occasion to also decompile the small amount of code related to
TH03's win messages, stored in the @0?TX.TXT files. Similar to
TH02's dialog format, these files are also split into fixed-size blocks of
3×60 bytes. But this time, TH03 loads all 60 bytes of a line, including the
CR/LF line breaking codepoints in the original files, into the statically
allocated buffer that it renders from. These control characters are then
only filtered to whitespace by ZUN's graph_putsa_fx() function.
If you remove the line breaks, you get to use the full 60 bytes on every
line.
The final commits went to the MIKO.CFG loading and saving
functions used in TH04's and TH05's OP.EXE, as well as TH04's
game startup code to finally catch up with
📝 TH05's counterpart from over 3 years ago.
This brought us right in front of the main menu rendering code in both TH04
and TH05, which is identical in both games and will be tackled in the next
PC-98 Touhou delivery.
Next up, though: Returning to Shuusou Gyoku, and adding support for SC-88Pro
recordings as BGM. Which may or may not come with a slight controversy…
P0240
TH04 PI/RE (Stage 5 star rendering + Stage 6 Yuuka checkerboard + Custom entity structures, part 1/2)
P0241
TH04 PI/RE (Custom entity structures, part 2/2 + Thick laser structure + PI false positives + .STD loading)
💰 Funded by:
JonathKane, Blue Bolt, [Anonymous]
🏷️ Tags:
Well, well. My original plan was to ship the first step of Shuusou Gyoku
OpenGL support on the next day after this delivery. But unfortunately, the
complications just kept piling up, to a point where the required solutions
definitely blow the current budget for that goal. I'm currently sitting on
over 70 commits that would take at least 5 pushes to deliver as a meaningful
release, and all of that is just rearchitecting work, preparing the
game for a not too Windows-specific OpenGL backend in the first place. I
haven't even written a single line of OpenGL yet… 🥲
This shifts the intended Big Release Month™ to June after all. Now I know
that the next round of Shuusou Gyoku features should better start with the
SC-88Pro recordings, which are much more likely to get done within their
current budget. At least I've already completed the configuration versioning
system required for that goal, which leaves only the actual audio part.
So, TH04 position independence. Thanks to a bit of funding for stage
dialogue RE, non-ASCII translations will soon become viable, which finally
presents a reason to push TH04 to 100% position independence after
📝 TH05 had been there for almost 3 years. I
haven't heard back from Touhou Patch Center about how much they want to be
involved in funding this goal, if at all, but maybe other backers are
interested as well.
And sure, it would be entirely possible to implement non-ASCII translations
in a way that retains the layout of the original binaries and can be easily
compared at a binary level, in case we consider translations to be a
critical piece of infrastructure. This wouldn't even just be an exercise in
needless perfectionism, and we only have to look to Shuusou Gyoku to realize
why: Players expected
that my builds were compatible with existing SpoilerAL SSG files, which
was something I hadn't even considered the need for. I mean, the game is
open-source 📝 and I made it easy to build.
You can just fork the code, implement all the practice features you want in
a much more efficient way, and I'd probably even merge your code into my
builds then?
But I get it – recompiling the game yields just yet another build that can't
be easily compared to the original release. A cheat table is much more
trustworthy in giving players the confidence that they're still practicing
the same original game. And given the current priorities of my backers,
it'll still take a while for me to implement proof by replay validation,
which will ultimately free every part of the community from depending on the
original builds of both Seihou and PC-98 Touhou.
However, such an implementation within the original binary layout would
significantly drive up the budget of non-ASCII translations, and I sure
don't want to constantly maintain this layout during development. So, let's
chase TH04 position independence like it's 2020, and quickly cover a larger
amount of PI-relevant structures and functions at a shallow level. The only
parts I decompiled for now contain calculations whose intent can't be
clearly communicated in ASM. Hitbox visualizations or other more in-depth
research would have to wait until I get to the proper decompilation of these
features.
But even this shallow work left us with a large amount of TH04-exclusive
code that had its worst parts RE'd and could be decompiled fairly quickly.
If you want to see big TH04 finalization% gains, general TH04 progress would
be a very good investment.
The first push went to the often-mentioned stage-specific custom entities
that share a single statically allocated buffer. Back in 2020, I
📝 wrongly claimed that these were a TH05 innovation,
but the system actually originated in TH04. Both games use a 26-byte
structure, but TH04 only allocates a 32-element array rather than TH05's
64-element one. The conclusions from back then still apply, but I also kept
wondering why these games used a static array for these entities to begin
with. You know what they call an area of memory that you can cleanly
repurpose for things? That's right, a heap!
And absolutely no one would mind one additional heap allocation at the start
of a stage, next to the ones for all the sprites and portraits.
However, we are still running in Real Mode with segmented memory. Accessing
anything outside a common data segment involves modifying segment registers,
which has a nonzero CPU cycle cost, and Turbo C++ 4.0J is terrible at
optimizing away the respective instructions. Does this matter? Probably not,
but you don't take "risks" like these if you're in a permanent
micro-optimization mindset…
In TH04, this system is used for:
Kurumi's symmetric bullet spawn rays, fired from her hands towards the left
and right edges of the playfield. These are rather infamous for being the
last thing you see before
📝 the Divide Error crash that can happen in ZUN's original build.
Capped to 6 entities.
The 4 📝 bits used in Marisa's Stage 4 boss
fight. Coincidentally also related to the rare Divide Error
crash in that fight.
Stage 4 Reimu's spinning orbs. Note how the game uses two different sets
of sprites just to have two different outline colors. This was probably
better than messing with the palette, which can easily cause unintended
effects if you only have 16 colors to work with. Heck, I have an entire blog post tag just to highlight
these cases. Capped to the full 32 entities.
The chasing cross bullets, seen in Phase 14 of the same Stage 6 Yuuka
fight. Featuring some smart sprite work, making use of point symmetry to
achieve a fluid animation in just 4 frames. This is
good-code in sprite form. Capped to 31 entities, because the 32nd custom entity during this fight is defined to be…
The single purple pulsating and shrinking safety circle, seen in Phase 4 of
the same fight. The most interesting aspect here is actually still related
to the cross bullets, whose spawn function is wrongly limited to 32 entities
and could theoretically overwrite this circle. This
is strictly landmine territory though:
Yuuka never uses these bullets and the safety circle
simultaneously
She never spawns more than 24 cross bullets
All cross bullets are fast enough to have left the screen by the
time Yuuka restarts the corresponding subpattern
The cross bullets spawn at Yuuka's center position, and assign its
Q12.4 coordinates to structure fields that the safety circle interprets
as raw pixels. The game does try to render the circle afterward, but
since Yuuka's static position during this phase is nowhere near a valid
pixel coordinate, it is immediately clipped.
The flashing lines seen in Phase 5 of the Gengetsu fight,
telegraphing the slightly random bullet columns.
These structures only took 1 push to reverse-engineer rather than the 2 I
needed for their TH05 counterparts because they are much simpler in this
game. The "structure" for Gengetsu's lines literally uses just a single X
position, with the remaining 24 bytes being basically padding. The only
minor bug I found on this shallow level concerns Marisa's bits, which are
clipped at the right and bottom edges of the playfield 16 pixels earlier
than you would expect:
The remaining push went to a bunch of smaller structures and functions:
The structure for the up to 2 "thick" (a.k.a. "Master Spark") lasers. Much
saner than the
📝 madness of TH05's laser system while being
equally customizable in width and duration.
The structure for the various monochrome 16×16 shapes in the background of
the Stage 6 Yuuka fight, drawn on top of the checkerboard.
The rendering code for the three falling stars in the background of Stage 5.
The effect here is entirely palette-related: After blitting the stage tiles,
the 📝 1bpp star image is ORed
into only the 4th VRAM plane, which is equivalent to setting the
highest bit in the palette color index of every pixel within the star-shaped
region. This of course raises the question of how the stage would look like
if it was fully illuminated:
The full tile map of TH04's Stage 5, in both dark and fully
illuminated views. Since the illumination effect depends on two
matching sets of palette colors that are distinguished by a single
bit, the illuminated view is limited to only 8 of the 16 colors. The
dark view, on the other hand, can freely use colors from the
illuminated set, since those are unaffected by the OR
operation.
Most code that modifies a stage's tile map, and directly specifies tiles via
their top-left offset in VRAM.
Thanks to code alignment reasons, this forced a much longer detour into the
.STD format loader. Nothing all too noteworthy there since we're still
missing the enemy script and spawn structures before we can call .STD
"reverse-engineered", but maybe still helpful if you're looking for an
overview of the format. Also features a buffer overflow landmine if a .STD
file happens to contain more than 32 enemy scripts… you know, the usual
stuff.
To top off the second push, we've got the vertically scrolling checkerboard
background during the Stage 6 Yuuka fight, made up of 32×32 squares. This
one deserves a special highlight just because of its needless complexity.
You'd think that even a performant implementation would be pretty simple:
Set the GRCG to TDW mode
Set the GRCG tile to one of the two square colors
Start with Y as the current scroll offset, and X
as some indicator of which color is currently shown at the start of each row
of squares
Iterate over all lines of the playfield, filling in all pixels that
should be displayed in the current color, skipping over the other ones
Count down Y for each line drawn
If Y reaches 0, reset it to 32 and flip X
At the bottom of the playfield, change the GRCG tile to the other color,
and repeat with the initial value of X flipped
The most important aspect of this algorithm is how it reduces GRCG state
changes to a minimum, avoiding the costly port I/O that we've identified
time and time again as one of the main bottlenecks in TH01. With just 2
state variables and 3 loops, the resulting code isn't that complex either. A
naive implementation that just drew the squares from top to bottom in a
single pass would barely be simpler, but much slower: By changing the GRCG
tile on every color, such an implementation would burn a low 5-digit number
of CPU cycles per frame for the 12×11.5-square checkerboard used in the
game.
And indeed, ZUN retained all important aspects of this algorithm… but still
implemented it all in ASM, with a ridiculous layer of x86 segment arithmetic
on top? Which blows up the complexity to 4 state
variables, 5 nested loops, and a bunch of constants in unusual units. I'm
not sure what this code is supposed to optimize for, especially with that
rather questionable register allocation that nevertheless leaves one of the
general-purpose registers unused. Fortunately,
the function was still decompilable without too many code generation hacks,
and retains the 5 nested loops in all their goto-connected
glory. If you want to add a checkerboard to your next PC-98
demo, just stick to the algorithm I gave above.
(Using a single XOR for flipping the starting X offset between 32 and 64
pixels is pretty nice though, I have to give him that.)
This makes for a good occasion to talk about the third and final GRCG mode,
completing the series I started with my previous coverage of the
📝 RMW and
📝 TCR modes. The TDW (Tile Data Write) mode
is the simplest of the three and just writes the 8×1 GRCG tile into VRAM
as-is, without applying any alpha bitmask. This makes it perfect for
clearing rectangular areas of pixels – or even all of VRAM by doing a single
memset():
// Set up the GRCG in TDW mode.
outportb(0x7C, 0x80);
// Fill the tile register with color #7 (0111 in binary).
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0xFF); // Plane 2: (G): (********)
outportb(0x7E, 0x00); // Plane 3: (E): ( )
// Set the 32 pixels at the top-left corner of VRAM to the exact contents of
// the tile register, effectively repeating the tile 4 times. In TDW mode, the
// GRCG ignores the CPU-supplied operand, so we might as well just pass the
// contents of a register with the intended width. This eliminates useless load
// instructions in the compiled assembly, and even sort of signals to readers
// of this code that we do not care about the source value.
*reinterpret_cast<uint32_t far *>(MK_FP(0xA800, 0)) = _EAX;
// Fill the entirety of VRAM with the GRCG tile. A simple C one-liner that will
// probably compile into a single `REP STOS` instruction. Unfortunately, Turbo
// C++ 4.0J only ever generates the 16-bit `REP STOSW` here, even when using
// the `__memset__` intrinsic and when compiling in 386 mode. When targeting
// that CPU and above, you'd ideally want `REP STOSD` for twice the speed.
memset(MK_FP(0xA800, 0), _AL, ((640 / 8) * 400));
However, this might make you wonder why TDW mode is even necessary. If it's
functionally equivalent to RMW mode with a CPU-supplied bitmask made up
entirely of 1 bits (i.e., 0xFF, 0xFFFF, or
0xFFFFFFFF), what's the point? The difference lies in the
hardware implementation: If all you need to do is write tile data to
VRAM, you don't need the read and modify parts of RMW mode
which require additional processing time. The PC-9801 Programmers'
Bible claims a speedup of almost 2× when using TDW mode over equivalent
operations in RMW mode.
And that's the only performance claim I found, because none of these old
PC-98 hardware and programming books did any benchmarks. Then again, it's
not too interesting of a question to benchmark either, as the byte-aligned
nature of TDW blitting severely limits its use in a game engine anyway.
Sure, maybe it makes sense to temporarily switch from RMW to TDW mode
if you've identified a large rectangular and byte-aligned section within a
sprite that could be blitted without a bitmask? But the necessary
identification work likely nullifies the performance gained from TDW mode,
I'd say. In any case, that's pretty deep
micro-optimization territory. Just use TDW mode for the
few cases it's good at, and stick to RMW mode for the rest.
So is this all that can be said about the GRCG? Not quite, because there are
4 bits I haven't talked about yet…
And now we're just 5.37% away from 100% position independence for TH04! From
this point, another 2 pushes should be enough to reach this goal. It might
not look like we're that close based on the current estimate, but a
big chunk of the remaining numbers are false positives from the player shot
control functions. Since we've got a very special deadline to hit, I'm going
to cobble these two pushes together from the two current general
subscriptions and the rest of the backlog. But you can, of course, still
invest in this goal to allow the existing contributions to go to something
else.
… Well, if the store was actually open. So I'd better
continue with a quick task to free up some capacity sooner rather than
later. Next up, therefore: Back to TH02, and its item and player systems.
Shouldn't take that long, I'm not expecting any surprises there. (Yeah, I
know, famous last words…)
More than three months without any reverse-engineering progress! It's been
way too long. Coincidentally, we're at least back with a surprising 1.25% of
overall RE, achieved within just 3 pushes. The ending script system is not
only more or less the same in TH04 and TH05, but actually originated in
TH03, where it's also used for the cutscenes before stages 8 and 9. This
means that it was one of the final pieces of code shared between three of
the four remaining games, which I got to decompile at roughly 3× the usual
speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The
Music Room is largely equivalent in all three remaining games as well, and
the sound device selection, ZUN Soft logo screens, and main/option menus are
the same in TH04 and TH05. A lot of that code is in the "technically RE'd
but not yet decompiled" ASM form though, so it would shift Finalized% more
significantly than RE%. Therefore, make sure to order the new
Finalization option rather than Reverse-engineering if you
want to make number go up.
So, cutscenes. On the surface, the .TXT files look simple enough: You
directly write the text that should appear on the screen into the file
without any special markup, and add commands to define visuals, music, and
other effects at any place within the script. Let's start with the basics of
how text is rendered, which are the same in all three games:
First off, the text area has a size of 480×64 pixels. This means that it
does not correspond to the tiled area painted into TH05's
EDBK?.PI images:
The yellow area is designated for character names.
Since the font weight can be customized, all text is rendered to VRAM.
This also includes gaiji, despite them ignoring the font weight
setting.
The system supports automatic line breaks on a per-glyph basis, which
move the text cursor to the beginning of the red text area. This might seem like a piece of long-forgotten
ancient wisdom at first, considering the absence of automatic line breaks in
Windows Touhou. However, ZUN probably implemented it more out of pure
necessity: Text in VRAM needs to be unblitted when starting a new box, which
is way more straightforward and performant if you only need to worry
about a fixed area.
The system also automatically starts a new (key press-separated) text
box after the end of the 4th line. However, the text cursor is
also unconditionally moved to the top-left corner of the yellow name
area when this happens, which is almost certainly not what you expect, given
that automatic line breaks stay within the red area. A script author might
as well add the necessary text box change commands manually, if you're
forced to anticipate the automatic ones anyway…
Due to ZUN forgetting an unblitting call during the TH05 refactoring of the
box background buffer, this feature is even completely broken in that game,
as any new text will simply be blitted on top of the old one:
Wait, why are we already talking about game-specific differences after
all? Also, note how the ⏎ animation appears one line below where you'd
expect it.
Overall, the system is geared toward exclusively full-width text. As
exemplified by the 2014 static English patches and the screenshots in this
blog post, half-width text is possible, but comes with a lot of
asterisks attached:
Each loop of the script interpreter starts by looking at the next
byte to distinguish commands from text. However, this step also skips
over every ASCII space and control character, i.e., every byte
≤ 32. If you only intend to display full-width glyphs anyway, this
sort of makes sense: You gain complete freedom when it comes to the
physical layout of these script files, and it especially allows commands
to be freely separated with spaces and line breaks for improved
readability. Still, enforcing commands to be separated exclusively by
line breaks might have been even better for readability, and would have
freed up ASCII spaces for regular text…
Non-command text is blindly processed and rendered two bytes at a
time. The rendering function interprets these bytes as a Shift-JIS
string, so you can use half-width characters here. While the
second byte can even be an ASCII 0x20 space due to the
parser's blindness, all half-width characters must still occur in pairs
that can't be interrupted by commands:
As a workaround for at least the ASCII space issue, you can replace
them with any of the unassigned
Shift-JIS lead bytes – 0x80, 0xA0, or
anything between 0xF0 and 0xFF inclusive.
That's what you see in all screenshots of this post that display
half-width spaces.
Finally, did you know that you can hold ESC to fast-forward
through these cutscenes, which skips most frame delays and reduces the rest?
Due to the blocking nature of all commands, the ESC key state is
only updated between commands or 2-byte text groups though, so it can't
interrupt an ongoing delay.
Superficially, the list of game-specific differences doesn't look too long,
and can be summarized in a rather short table:
It's when you get into the implementation that the combined three systems
reveal themselves as a giant mess, with more like 56 differences between the
games. Every single new weird line of code opened up
another can of worms, which ultimately made all of this end up with 24
pieces of bloat and 14 bugs. The worst of these should be quite interesting
for the general PC-98 homebrew developers among my audience:
The final official 0.23 release of master.lib has a bug in
graph_gaiji_put*(). To calculate the JIS X 0208 code point for
a gaiji, it is enough to ADD 5680h onto the gaiji ID. However,
these functions accidentally use ADC instead, which incorrectly
adds the x86 carry flag on top, causing weird off-by-one errors based on the
previous program state. ZUN did fix this bug directly inside master.lib for
TH04 and TH05, but still needed to work around it in TH03 by subtracting 1
from the intended gaiji ID. Anyone up for maintaining a bug-fixed master.lib
repository?
The worst piece of bloat comes from TH03 and TH04 needlessly
switching the visibility of VRAM pages while blitting a new 320×200 picture.
This makes it much harder to understand the code, as the mere existence of
these page switches is enough to suggest a more complex interplay between
the two VRAM pages which doesn't actually exist. Outside this visibility
switch, page 0 is always supposed to be shown, and page 1 is always used
for temporarily storing pixels that are later crossfaded onto page 0. This
is also the only reason why TH03 has to render text and gaiji onto both VRAM
pages to begin with… and because TH04 doesn't, changing the picture in the
middle of a string of text is technically bugged in that game, even though
you only get to temporarily see the new text on very underclocked PC-98
systems.
These performance implications made me wonder why cutscenes even bother with
writing to the second VRAM page anyway, before copying each crossfade step
to the visible one.
📝 We learned in June how costly EGC-"accelerated" inter-page copies are;
shouldn't it be faster to just blit the image once rather than twice?
Well, master.lib decodes .PI images into a packed-pixel format, and
unpacking such a representation into bitplanes on the fly is just about the
worst way of blitting you could possibly imagine on a PC-98. EGC inter-page
copies are already fairly disappointing at 42 cycles for every 16 pixels, if
we look at the i486 and ignore VRAM latencies. But under the same
conditions, packed-pixel unpacking comes in at 81 cycles for every 8
pixels, or almost 4× slower. On lower-end systems, that can easily sum up to
more than one frame for a 320×200 image. While I'd argue that the resulting
tearing could have been an acceptable part of the transition between two
images, it's understandable why you'd want to avoid it in favor of the
pure effect on a slower framerate.
Really makes me wonder why master.lib didn't just directly decode .PI images
into bitplanes. The performance impact on load times should have been
negligible? It's such a good format for
the often dithered 16-color artwork you typically see on PC-98, and
deserves better than master.lib's implementation which is both slow to
decode and slow to blit.
That brings us to the individual script commands… and yes, I'm going to
document every single one of them. Some of their interactions and edge cases
are not clear at all from just looking at the code.
Almost all commands are preceded by… well, a 0x5C lead byte.
Which raises the question of whether we should
document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded
¥ yen sign. From a gaijin perspective, it seems obvious that it's a
backslash, as it's consistently displayed as one in most of the editors you
would actually use nowadays. But interestingly, iconv
-f shift-jis -t utf-8 does convert any 0x5C
lead bytes to actual ¥ U+00A5 YEN SIGN code points
.
Ultimately, the distinction comes down to the font. There are fonts
that still render 0x5C as ¥, but mainly do so out
of an obvious concern about backward compatibility to JIS X 0201, where this
mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho,
the old Japanese fonts from Windows 3.1, but even Meiryo and Yu
Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much
every other modern font, and freely licensed ones in particular, render this
code point as \, even if you set your editor to Shift-JIS. And
while ZUN most definitely saw it as a ¥, documenting this code
point as \ is less ambiguous in the long run. It can only
possibly correspond to one specific code point in either Shift-JIS or UTF-8,
and will remain correct even if we later mod the cutscene system to support
full-blown Unicode.
Now we've only got to clarify the parameter syntax, and then we can look at
the big table of commands:
Numeric parameters are read as sequences of up to 3 ASCII digits. This
limits them to a range from 0 to 999 inclusive, with 000 and
0 being equivalent. Because there's no further sentinel
character, any further digit from the 4th one onwards is
interpreted as regular text.
Filename parameters must be terminated with a space or newline and are
limited to 12 characters, which translates to 8.3 basenames without any
directory component. Any further characters are ignored and displayed as
text as well.
Each .PI image can contain up to four 320×200 pictures ("quarters") for
the cutscene picture area. In the script commands, they are numbered like
this:
0
1
2
3
\@
Clears both VRAM pages by filling them with VRAM color 0. 🐞
In TH03 and TH04, this command does not update the internal text area
background used for unblitting. This bug effectively restricts usage of
this command to either the beginning of a script (before the first
background image is shown) or its end (after no more new text boxes are
started). See the image below for an
example of using it anywhere else.
\b2
Sets the font weight to a value between 0 (raw font ROM glyphs) to 3
(very thicc). Specifying any other value has no effect.
🐞 In TH04 and TH05, \b3 leads to glitched pixels when
rendering half-width glyphs due to a bug in the newly micro-optimized
ASM version of
📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the
graph_putsa_fx() effect function, removing the sanity check
that was present in TH03. In exchange, you can also access the four
dissolve masks for the bold font (\b2) by specifying a
parameter between 4 (fewest pixels) to 7 (most
pixels). Demo video below.
\c15
Changes the text color to VRAM color 15.
\c=字,15
Adds a color map entry: If 字 is the first code point
inside the name area on a new line, the text color is automatically set
to 15. Up to 8 such entries can be registered
before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
\e0
Plays the sound effect with the given ID.
\f
(no-op)
\fi1
\fo1
Calls master.lib's palette_black_in() or
palette_black_out() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\fm1
Fades out BGM volume via PMD's AH=02h interrupt call,
in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to
AH=02h's fade-in feature, which can't be used from cutscene
scripts because it requires BGM volume to first be lowered via
AH=19h, and there is no command to do that.
\g8
Plays a blocking 8-frame screen shake
animation.
\ga0
Shows the gaiji with the given ID from 0 to 255
at the current cursor position. Even in TH03, gaiji always ignore the
text delay interval configured with \v.
@3
TH05's replacement for the \ga command from TH03 and
TH04. The default ID of 3 corresponds to the
gaiji. Not to be confused with \@, which starts with a backslash,
unlike this command.
@h
Shows the gaiji.
@t
Shows the gaiji.
@!
Shows the gaiji.
@?
Shows the gaiji.
@!!
Shows the gaiji.
@!?
Shows the gaiji.
\k0
Waits 0 frames (0 = forever) for an advance key to be pressed before
continuing script execution. Before waiting, TH05 crossfades in any new
text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time.
As a workaround, \vp1 can be
used before \k to immediately display that text without a
fade-in animation.
\m$
Stops the currently playing BGM.
\m*
Restarts playback of the currently loaded BGM from the
beginning.
\m,filename
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
\n
Starts a new line at the leftmost X coordinate of the box, i.e., the
start of the name area. This is how scripts can "change" the name of the
currently speaking character, or use the entire 480×64 pixels without
being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line.
Using this command at the "end" of a line with the maximum number of 30
full-width glyphs would therefore start a second new line and leave the
previously started line empty.
If this command moved the cursor into the 5th line of a box,
\s is executed afterward, with
any of \n's parameters passed to \s.
\p
(no-op)
\p-
Deallocates the loaded .PI image.
\p,filename
Loads the .PI image with the given file into the single .PI slot
available to cutscenes. TH04 and TH05 automatically deallocate any
previous image, 🐞 TH03 would leak memory without a manual prior call to
\p-.
\pp
Sets the hardware palette to the one of the loaded .PI image.
\p@
Sets the loaded .PI image as the full-screen 640×400 background
image and overwrites both VRAM pages with its pixels, retaining the
current hardware palette.
\p=
Runs \pp followed by \p@.
\s0
\s-
Ends a text box and starts a new one. Fades in any text rendered to
the invisible VRAM page, then waits 0 frames
(0 = forever) for an advance key to be
pressed. Afterward, the new text box is started with the cursor moved to
the top-left corner of the name area. \s- skips the wait time and starts the new box
immediately.
\t100
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
Preceded by a 1-frame delay unless ESC is held.
\v1
Sets the number of frames to wait between every 2 bytes of rendered
text.
Sets the number of frames to spend on each of the 4 fade
steps when crossfading between old and new text. The game-specific
default value is also used before the first use of this command.
\v2
\vp0
Shows VRAM page 0. Completely useless in
TH03 (this game always synchronizes both VRAM pages at a command
boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to
their intended shown page before every blitting operation anyway. A
debloated mod of this game would just remove this command, as it exposes
an implementation detail that script authors should not need to worry
about. None of the original scripts use it anyway.
\w64
\w and \wk wait for the given number
of frames
\wm and \wmk wait until PMD has played
back the current BGM for the total number of measures, including
loops, given in the first parameter, and fall back on calling
\w and \wk with the second parameter as
the frame number if BGM is disabled.
🐞 Neither PMD nor MMD reset the internal measure when stopping
playback. If no BGM is playing and the previous BGM hasn't been
played back for at least the given number of measures, this command
will deadlock.
Since both TH04 and TH05 fade in any new text from the invisible VRAM
page, these commands can be used to simulate TH03's typing effect in
those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would
simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be
interrupted if ⏎ Return or Shot are held down.
TH04 and TH05 recognize the k as well, but removed its
functionality.
All of these commands have no effect if ESC is held.
\wm64,64
\wk64
\wmk64,64
\wi1
\wo1
Calls master.lib's palette_white_in() or
palette_white_out() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
\=4
Immediately displays the given quarter of the loaded .PI image in
the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
\==4,1
Crossfades the picture area between its current content and quarter
#4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless
ESC is held. Any value ≥ 4 is
replaced with quarter #0.
\$
Stops script execution. Must be called at the end of each file;
otherwise, execution continues into whatever lies after the script
buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04
require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter
is omitted; \c is therefore
equivalent to \c15.
The \@ bug. Yes, the ¥ is fake. It
was easier to GIMP it than to reword the sentences so that the backslashes
landed on the second byte of a 2-byte half-width character pair.
The font weights and effects available through \b, including the glitch with
\b3 in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if
you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels
at the left side of each glyph, which would extend into the previous
VRAM byte. Ironically, the TH04/TH05 version is more correct in
this regard: For half-width glyphs, it preserves any further pixel
columns generated by the weight functions in the high byte of the 16-dot
glyph variable. Unlike TH03, which still cuts them off when rendering
text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them
towards their correct place (4️⃣). It's only at byte-aligned X positions
(2️⃣) where they remain at their internally calculated place, and appear
on screen as these glitched pixel columns, 15 pixels away from the glyph
they belong to. It's easy to blame bugs like these on micro-optimized
ASM code, but in this instance, you really can't argue against it if the
original C++ version was equally incorrect.
Combining \b and s- into a partial dissolve
animation. The speed can be controlled with \v.
Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we
also get an additional fade animation
after the box ends.
So yeah, that's the cutscene system. I'm dreading the moment I will have to
deal with the other command interpreter in these games, i.e., the
stage enemy system. Luckily, that one is completely disconnected from any
other system, so I won't have to deal with it until we're close to finishing
MAIN.EXE… that is, unless someone requests it before. And it
won't involve text encodings or unblitting…
The cutscene system got me thinking in greater detail about how I would
implement translations, being one of the main dependencies behind them. This
goal has been on the order form for a while and could soon be implemented
for these cutscenes, with 100% PI being right around the corner for the TH03
and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching
for Latin-script languages could be implemented fairly quickly:
Establish basic UTF-8 parsing for less painful manual editing of the
source files
Procedurally generate glyphs for the few required additional letters
based on existing font ROM glyphs. For example, we'd generate ä
by painting two short lines on top of the font ROM's a glyph,
or generate ¿ by vertically flipping the question mark. This
way, the text retains a consistent look regardless of whether the translated
game is run with an NEC or EPSON font ROM, or the that Neko Project II auto-generates if you
don't provide either.
(Optional) Change automatic line breaks to work on a per-word
basis, rather than per-glyph
That's it – script editing and distribution would be handled by your local
translation group. It might seem as if this would also work for Greek and
Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not
sure if I want to attempt procedurally shrinking these glyphs from 16×16 to
8×16… For any more thorough solution, we'd need to go for a more "Chad" kind
of full-blown translation support:
Implement text subdivisions at a sensible granularity while retaining
automatic line and box breaks
Compile translatable text into a Japanese→target language dictionary
(I'm too old to develop any further translation systems that would overwrite
modded source text with translations of the original text)
Implement a custom Unicode font system (glyphs would be taken from GNU
Unifont unless translators provide a different 8×16 font for their
language)
Combine the text compiler with the font compiler to only store needed
glyphs as part of the translation's font file (dealing with a multi-MB font
file would be rather ugly in a Real Mode game)
Write a simple install/update/patch stacking tool that supports both
.HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to
warrant a separate tool – each patch stack would be statically compiled into
a single package file in the game's directory)
Add a nice language selection option to the main menu
(Optional) Support proportional fonts
Which sounds more like a separate project to be commissioned from
Touhou Patch Center's Open Collective funds, separate from the ReC98 cap.
This way, we can make sure that the feature is completely implemented, and I
can talk with every interested translator to make sure that their language
works.
It's still cheaper overall to do this on PC-98 than to first port the games
to a modern system and then translate them. On the other hand, most
of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with
the difficulty of getting arbitrary Unicode characters to work natively in a
PC-98 DOS game at all, and would be either unnecessary or trivial if we had
already ported the game. Depending on where the patrons' interests lie, it
may not be worth it. So let's see what all of you think about which
way we should go, or whether it's worth doing at all. (Edit
(2022-12-01): With Splashman's
order towards the stage dialogue system, we've pretty much confirmed that it
is.) Maybe we want to meet in the middle – using e.g. procedural glyph
generation for dynamic translations to keep text rendering consistent with
the rest of the PC-98 system, and just not support non-Latin-script
languages in the beginning? In any case, I've added both options to the
order form. Edit (2023-07-28):Touhou Patch Center has agreed to fund
a basic feature set somewhere between the Virgin and Chad level. Check the
📝 dedicated announcement blog post for more
details and ideas, and to find out how you can support this goal!
Surprisingly, there was still a bit of RE work left in the third push after
all of this, which I filled with some small rendering boilerplate. Since I
also wanted to include TH02's playfield overlay functions,
1/15 of that last push went towards getting a
TH02-exclusive function out of the way, which also ended up including that
game in this delivery.
The other small function pointed out how TH05's Stage 5 midboss pops into
the playfield quite suddenly, since its clipping test thinks it's only 32
pixels tall rather than 64:
Good chance that the pop-in might have been intended. Edit (2023-06-30): Actually, it's a
📝 systematic consequence of ZUN having to work around the lack of clipping in master.lib's sprite functions.
There's even another quirk here: The white flash during its first frame
is actually carried over from the previous midboss, which the
game still considers as actively getting hit by the player shot that
defeated it. It's the regular boilerplate code for rendering a
midboss that resets the responsible damage variable, and that code
doesn't run during the defeat explosion animation.
Next up: Staying with TH05 and looking at more of the pattern code of its
boss fights. Given the remaining TH05 budget, it makes the most sense to
continue in in-game order, with Sara and the Stage 2 midboss. If more money
comes in towards this goal, I could alternatively go for the Mai & Yuki
fight and immediately develop a pretty fix for the cheeto storage
glitch. Also, there's a rather intricate
pull request for direct ZMBV decoding on the website that I've still got
to review…
TH05 has passed the 50% RE mark, with both MAIN.EXE and the
game as a whole! With that, we've also reached what -Tom-
wanted out of the project, so he's suspending his discount offer for a
bit.
Curve bullets are now officially called cheetos! 76.7% of
fans prefer this term, and it fits into the 8.3 DOS filename scheme much
better than homing lasers (as they're called in
OMAKE.TXT) or Taito
lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That
left enough budget to also add the Stage 1 midboss on top.
So, Shinki! As far as final boss code is concerned, she's surprisingly
economical, with 📝 her background animations
making up more than ⅓ of her entire code. Going straight from TH01's
📝 final📝 bosses
to TH05's final boss definitely showed how much ZUN had streamlined
danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there
is still room for improvement: TH05 not only
📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month,
but also uses them 4× as often, and even for midbosses. Most importantly
though, defining danmaku patterns using a single global instance of the
group template structure is just bad no matter how you look at it:
The script code ends up rather bloated, with a single MOV
instruction for setting one of the fields taking up 5 bytes. By comparison,
the entire structure for regular bullets is 14 bytes large, while the
template structure for Shinki's 32×32 ball bullets could have easily been
reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set
one of the required fields for a group type. The resulting danmaku group
then reuses these values from the last time they were set… which might have
been as far back as another boss fight from a previous stage.
And of course, I wouldn't point this out if it
didn't actually happen in Shinki's pattern code. Twice.
Declaring a separate structure instance with the static data for every
pattern would be both safer and more space-efficient, and there's
more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow.
The "devil"
patternis significantly more complex than the others, but still
far from TH01's final bosses at their worst. I especially like the clear
architectural separation between "one-shot pattern" functions that return
true once they're done, and "looping pattern" functions that
run as long as they're being called from a boss's main function. Not many
all too interesting things in these pattern functions for the most part,
except for two pieces of evidence that Shinki was coded after Yumeko:
The gather animation function in the first two phases contains a bullet
group configuration that looks like it's part of an unused danmaku
pattern. It quickly turns out to just be copy-pasted from a similar function
in Yumeko's fight though, where it is turned into actual
bullets.
As one of the two places where ZUN forgot to set a template field, the
lasers at the end of the white wing preparation pattern reuse the 6-pixel
width of Yumeko's final laser pattern. This actually has an effect on
gameplay: Since these lasers are active for the first 8 frames after
Shinki's wings appear on screen, the player can get hit by them in the last
2 frames after they grew to their final width.
Of course, there are more than enough safespots between the lasers.
Speaking about that wing sprite: If you look at ST05.BB2 (or
any other file with a large sprite, for that matter), you notice a rather
weird file layout:
A large sprite split into multiple smaller ones with a width of
64 pixels each? What's this, hardware sprite limitations? On my
PC-98?!
And it's not a limitation of the sprite width field in the BFNT+ header
either. Instead, it's master.lib's BFNT functions which are limited to
sprite widths up to 64 pixels… or at least that's what
MASTER.MAN claims. Whatever the restriction was, it seems to be
completely nonexistent as of master.lib version 0.23, and none of the
master.lib functions used by the games have any issues with larger
sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the
game that expects Shinki's winged form to consist of 4 physical
sprites, not just 1. Any conversion from another, more logical sprite sheet
layout back into BFNT+ must therefore replicate the original number of
sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly
loaded sprite no longer match ZUN's hardcoded IDs, causing the game to
crash. This is exactly what used to happen with -Tom-'s
MysticTK automation scripts,
which combined these exact sprites into a single large one. This issue has
now been fixed – just in case there are some underground modders out there
who used these scripts and wonder why their game crashed as soon as the
Shinki fight started.
And then the code quality takes a nosedive with Shinki's main function.
Even in TH05, these boss and midboss update
functions are still very imperative:
The origin point of all bullet types used by a boss must be manually set
to the current boss/midboss position; there is no concept of a bullet type
tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is
abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05
even "innovates" here by giving the boss update function exclusive ownership
of that variable, in contrast to TH04 where that ownership is given out to
the player shot collision detection (?!) and boss defeat helper
functions.
Speaking about collision detection: That is done by calling different
functions depending on whether the boss is supposed to be invincible or
not.
Timeout conditions? No standard way either, and all done with manual
if statements. In combination with the regular phase end
condition of lowering (mid)boss HP to a certain value, this leads to quite a
convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that
phases can end outside of HP boundaries… by manually incrementing the
phase variable and resetting the phase frame variable to 0.
The biggest WTF in there, however, goes to using one of the 16 state bytes
as a "relative phase" variable for differentiating between boss phases that
share the same branch within the switch(boss.phase)
statement. While it's commendable that ZUN tried to reduce code duplication
for once, he could have just branched depending on the actual
boss.phase variable? The same state byte is then reused in the
"devil" pattern to track the activity state of the big jerky lasers in the
second half of the pattern. If you somehow managed to end the phase after
the first few bullets of the pattern, but before these lasers are up,
Shinki's update function would think that you're still in the phase
before the "devil" pattern. The main function then sequence-breaks
right to the defeat phase, skipping the final pattern with the burning Makai
background. Luckily, the HP boundaries are far away enough to make this
impossible in practice.
The takeaway here: If you want to use the state bytes for your custom
boss script mods, alias them to your own 16-byte structure, and limit each
of the bytes to a clearly defined meaning across your entire boss script.
One final discovery that doesn't seem to be documented anywhere yet: Shinki
actually has a hidden bomb shield during her two purple-wing phases.
uth05win got this part slightly wrong though: It's not a complete
shield, and hitting Shinki will still deal 1 point of chip damage per
frame. For comparison, the first phase lasts for 3,000 HP, and the "devil"
pattern phase lasts for 5,800 HP.
And there we go, 3rd PC-98 Touhou boss
script* decompiled, 28 to go! 🎉 In case you were expecting a fix for
the Shinki death glitch: That one
is more appropriately fixed as part of the Mai & Yuki script. It also
requires new code, should ideally look a bit prettier than just removing
cheetos between one frame and the next, and I'd still like it to fit within
the original position-dependent code layout… Let's do that some other
time.
Not much to say about the Stage 1 midboss, or midbosses in general even,
except that their update functions have to imperatively handle even more
subsystems, due to the relative lack of helper functions.
The remaining ¾ of the third push went to a bunch of smaller RE and
finalization work that would have hardly got any attention otherwise, to
help secure that 50% RE mark. The nicest piece of code in there shows off
what looks like the optimal way of setting up the
📝 GRCG tile register for monochrome blitting
in a variable color:
mov ah, palette_index ; Any other non-AL 8-bit register works too.
; (x86 only supports AL as the source operand for OUTs.)
rept 4 ; For all 4 bitplanes…
shr ah, 1 ; Shift the next color bit into the x86 carry flag
sbb al, al ; Extend the carry flag to a full byte
; (CF=0 → 0x00, CF=1 → 0xFF)
out 7Eh, al ; Write AL to the GRCG tile register
endm
Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles
into a surprisingly nice one-liner. What a beautiful micro-optimization, at
a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there,
becoming increasingly dumb and undecompilable. Was it really necessary to
save 4 x86 instructions in the highly unlikely case of a new spark sprite
being spawned outside the playfield? That one 2D polar→Cartesian
conversion function then pointed out Turbo C++ 4.0J's woefully limited
support for 32-bit micro-optimizations. The code generation for 32-bit
📝 pseudo-registers is so bad that they almost
aren't worth using for arithmetic operations, and the inline assembler just
flat out doesn't support anything 32-bit. No use in decompiling a function
that you'd have to entirely spell out in machine code, especially if the
same function already exists in multiple other, more idiomatic C++
variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC
replay file reading code, which should finally prove that nothing about the
game's original replay system could serve as even just the foundation for
community-usable replays. Just in case anyone was still thinking that.
Next up: Back to TH01, with the Elis fight! Got a bit of room left in the
cap again, and there are a lot of things that would make a lot of
sense now:
TH04 would really enjoy a large number of dedicated pushes to catch up
with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value
for your money. Shinki would have taken even less than 2 pushes if she
hadn't been the first boss I looked at.
Oh, and I also added Seihou as a selectable goal, for the two people out
there who genuinely like it. If I ever want to quit my day job, I need to
branch out into safer territory that isn't threatened by takedowns, after
all.
Been 📝 a while since we last looked at any of
TH03's game code! But before that, we need to talk about Y coordinates.
During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its
line-doubled 640×200 resolution, which gives the in-game portion its
distinctive stretched low-res look. This lower resolution is a consequence
of using 📝 Promisence Soft's SPRITE16 driver:
Its performance simply stems from the fact that it expects sprites to be
stored in the bottom half of VRAM, which allows them to be blitted using the
same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all
other games. Reducing the visible resolution also means that the sprites can
be stored on both VRAM pages, allowing the game to still be double-buffered.
If you force the graphics chip to run at 640×400, you can see them:
The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
•
Note that the text chip still displays its overlaid contents at 640×400,
which means that TH03's in-game portion technically runs at two
resolutions at the same time.
But that means that any mention of a Y coordinate is ambiguous: Does it
refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially
people who have known about the line doubling for years might almost expect
technical blog posts on this game to use undoubled VRAM coordinates. So,
let's introduce a new formatting convention for both on-screen
640×400 and undoubled 640×200 coordinates,
and always write out both to minimize the confusion.
Alright, now what's the thing gonna be? The enemy structure is highly
overloaded, being used for enemies, fireballs, and explosions with seemingly
different semantics for each. Maybe a bit too much to be figured out in what
should ideally be a single push, especially with all the functions that
would need to be decompiled? Bullet code would be easier, but not exactly
single-push material either. As it turns out though, there's something more
fundamental left to be done first, which both of these subsystems depend on:
collision detection!
And it's implemented exactly how I always naively imagined collision
detection to be implemented in a fixed-resolution 2D bullet hell game with
small hitboxes: By keeping a separate 1bpp bitmap of both playfields in
memory, drawing in the collidable regions of all entities on every frame,
and then checking whether any pixels at the current location of the player's
hitbox are set to 1. It's probably not done in the other games because their
single data segment was already too packed for the necessary 17,664 bytes to
store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at
Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit
Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly
useful, as the AI also uses it to elegantly learn what's on the playfield.
By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total
of 6,624 bytes of memory for the collision bitmaps of both playfields.
So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX,
BX, CX, and DX registers can also be
accessed as two 8-bit registers, calculations that change the semantic
meaning behind the value of a register, or just straight-up reassignments of
different values to the same small set of registers. Sure, in some way it is
impressive, and it all does work and correctly covers every edge
case, but come on. This could have all been a lot more readable in
exchange for just a few CPU cycles.
What's most interesting though are the actual shapes that these functions
draw into the collision bitmap. On the surface, we have:
vertical slopes at any angle across the whole playfield; exclusively
used for Chiyuri's diagonal laser EX attack
straight vertical lines, with a width of 1 tile; exclusively used for
the 2×2 / 2×1 hitboxes of bullets
rectangles at arbitrary sizes
But only 2) actually draws a full solid line. 1) and 3) are only ever drawn
as horizontal stripes, with a hardcoded distance of 2 vertical tiles
between every stripe of a slope, and 4 vertical tiles between every stripe
of a rectangle. That's 66-75% of each rectangular entity's intended hitbox
not actually taking part in collision detection. Now, if player hitboxes
were ≤ 6 / 3 pixels, we'd have one
possible explanation of how the AI can "cheat", because it could just
precisely move through those blank regions at TAS speeds. So, let's make
this two pushes after all and tell the complete story, since this is one of
the more interesting aspects to still be documented in this game.
And the code only gets worse. While the player
collision detection function is decompilable, it might as well not
have been, because it's just more of the same "optimized", hard-to-follow
assembly. With the four splittable 16-bit registers having a total of 20
different meanings in this function, I would have almost preferred
self-modifying code…
In fact, it was so bad that it prompted some maintenance work on my inline
assembly coding standards as a whole. Turns out that the _asm
keyword is not only still supported in modern Visual Studio compilers, but
also in Clang with the -fms-extensions flag, and compiles fine
there even for 64-bit targets. While that might sound like amazing news at
first ("awesome, no need to rewrite this stuff for my x86_64 Linux
port!"), you quickly realize that almost all inline assembly in this
codebase assumes either PC-98 hardware, segmented 16-bit memory addressing,
or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s
register pseudovariables where possible. While they certainly have their
drawbacks, being a non-standard extension that's not supported in other
x86-targeting C compilers, their advantages are quite significant: They
allow this code to stay in the same language, and provide slightly more
immediate portability to any other architecture, together with
📝 readability and maintainability improvements that can get quite significant when combined with inlining:
// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);
However, register pseudovariables might cause potential portability issues
as soon as they are mixed with inline assembly instructions that rely on
their state. The lazy way of "supporting pseudo-registers" in other
compilers would involve declaring the full set as global variables, which
would immediately break every one of those instances:
_DI = 0;
_AX = 0xFFFF;
// Special x86 instruction doing the equivalent of
//
// *reinterpret_cast<uint16_t far *>(MK_FP(_ES, _DI)) = _AX;
// _DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }
What's also not all too standardized, though, are certain variants of
the asm keyword. That's why I've now introduced a distinction
between the _asm keyword for "decently sane" inline assembly,
and the slightly less standard asm keyword for inline assembly
that relies on the contents of pseudo-registers, and should break on
compilers that don't support them. So yeah, have some minor
portability work in exchange for these two pushes not having all that much
in RE'd content.
With that out of the way and the function deciphered, we can confirm the
player hitboxes to be a constant 8×8 /
8×4 pixels, and prove that the hit stripes are nothing but
an adequate optimization that doesn't affect gameplay in any way.
And what's the obvious thing to immediately do if you have both the
collision bitmap and the player hitbox? Writing a "real hitbox" mod, of
course:
Reorder the calls to rendering functions so that player and shot sprites
are rendered after bullets
Blank out all player sprite pixels outside an
8×8 / 8×4 box around the center
point
After the bullet rendering function, turn on the GRCG in RMW mode and
set the tile register set to the background color
Stretch the negated contents of collision bitmap onto each playfield,
leaving only collidable pixels untouched
Do the same with the actual, non-negated contents and a white color, for
extra contrast against the background. This also makes sure to show any
collidable areas whose sprite pixels are transparent, such as with the moon
enemy. (Yeah, how unfair.) Doing that also loses a lot of information about
the playfield, such as enemy HP indicated by their color, but what can you
do:
A decently busy TH03 in-game frame and its underlying collision bitmap,
showing off all three different collision shapes together with the
player hitboxes.
2022-02-18-TH03-real-hitbox.zip
The secret for writing such mods before having reached a sufficient level of
position independence? Put your new code segment into DGROUP,
past the end of the uninitialized data section. That's why this modded
MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these
uninitialized 0 bytes between the end of the data segment and the first
instruction of the mod code – normally, this number is simply a part of the
MZ EXE header, and doesn't need to be redundantly stored on disk. Check the
th03_real_hitbox
branch for the code.
And now we know why so many "real hitbox" mods for the Windows Touhou games
are inaccurate: The games would simply be unplayable otherwise – or can
you dodge rapidly moving 2×2 /
2×1 blocks as an 8×8 /
8×4 rectangle that is smaller than your shot sprites,
especially without focused movement? I can't.
Maybe it will feel more playable after making explosions visible, but that
would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both
playfields per frame doesn't significantly drop the game's frame rate – so
why did the drawing functions have to be micro-optimized again? It
would be possible in one pass by using the GRCG's TDW mode, which
should theoretically be 8× faster, but I have to stop somewhere.
Next up: The final missing piece of TH04's and TH05's
bullet-moving code, which will include a certain other
type of projectile as well.
TH03 finally passed 20% RE, and the newly decompiled code contains no
serious ZUN bugs! What a nice way to end the year.
There's only a single unlockable feature in TH03: Chiyuri and Yumemi as
playable characters, unlocked after a 1CC on any difficulty. Just like the
Extra Stages in TH04 and TH05, YUME.NEM contains a single
designated variable for this unlocked feature, making it trivial to craft a
fully unlocked score file without recording any high scores that others
would have to compete against. So, we can now put together a complete set
for all PC-98 Touhou games: 2021-12-27-Fully-unlocked-clean-score-files.zip
It would have been cool to set the randomly generated encryption keys in
these files to a fixed value so that they cancel out and end up not actually
encrypting the file. Too bad that TH03 also started feeding each encrypted
byte back into its stream cipher, which makes this impossible.
The main loading and saving code turned out to be the second-cleanest
implementation of a score file format in PC-98 Touhou, just behind TH02.
Only two of the YUME.NEM functions come with nonsensical
differences between OP.EXE and MAINL.EXE, rather
than 📝 all of them, as in TH01 or
📝 too many of them, as in TH04 and TH05. As
for the rest of the per-difficulty structure though… well, it quickly
becomes clear why this was the final score file format to be RE'd. The name,
score, and stage fields are directly stored in terms of the internal
REGI*.BFT sprite IDs used on the high score screen. TH03 also
stores 10 score digits for each place rather than the 9 possible ones, keeps
any leading 0 digits, and stores the letters of entered names in reverse
order… yeah, let's decompile the high score screen as well, for a full
understanding of why ZUN might have done all that. (Answer: For no reason at
all. )
And wow, what a breath of fresh air. It's surely not
good-code: The overlapping shadows resulting from using
a 24-pixel letterspacing with 32-pixel glyphs in the name column led ZUN to
do quite a lot of unnecessary and slightly confusing rendering work when
moving the cursor back and forth, and he even forgot about the EGC there.
But it's nowhere close to the level of jank we saw in
📝 TH01's high score menu last year. Good to
see that ZUN had learned a thing or two by his third game – especially when
it comes to storing the character map cursor in terms of a character ID,
and improving the layout of the character map:
That's almost a nicely regular grid there. With the question mark and the
double-wide SP, BS, and END options, the cursor
movement code only comes with a reasonable two exceptions, which are easily
handled. And while I didn't get this screen completely decompiled,
one additional push was enough to cover all important code there.
The only potential glitch on this screen is a result of ZUN's continued use
of binary-coded
decimal digits without any bounds check or cap. Like the in-game HUD
score display in TH04 and TH05, TH03's high score screen simply uses the
next glyph in the character set for the most significant digit of any score
above 1,000,000,000 points – in this case, the period. Still, it only
really gets bad at 8,000,000,000 points: Once the glyphs are
exhausted, the blitting function ends up accessing garbage data and filling
the entire screen with garbage pixels. For comparison though, the current world record
is 133,650,710 points, so good luck getting 8 billion in the first
place.
Next up: Starting 2022 with the long-awaited decompilation of TH01's Sariel
fight! Due to the 📝 recent price increase,
we now got a window in the cap that
is going to remain open until tomorrow, providing an early opportunity to
set a new priority after Sariel is done.
P0148
TH04/TH05 decompilation (Text popups, gather circle rendering, player position clamping)
💰 Funded by:
[Anonymous]
🏷️ Tags:
Back after taking way too long to get Touhou Patch Center's MediaWiki
update feature complete… I'm still waiting for more translators to test and
review the new translation interface before delivering and deploying it
all, which will most likely lead to another break from ReC98 within the
next few months. For now though, I'm happy to have mostly addressed the
nagging responsibility I still had after willing that site into existence,
and to be back working on ReC98. 🙂
As announced, the next few pushes will focus on TH04's and TH05's bullet
spawning code, before I get to put all that accumulated TH01 money towards
finishing all of konngara's code in TH01. For a full
picture of what's happening with bullets, we'd really also like to
have the bullet update function as readable C code though.
Clearing all bullets on the playfield will trigger a Bonus!! popup,
displayed as 📝 gaiji in that proportional
font. Unfortunately, TLINK refused to link the code as soon as I referenced
the function for animating the popups at the top of the playfield? Which
can only mean that we have to decompile that function first…
So, let's turn that piece of technical debt into a full push, and first
decompile another random set of previously reverse-engineered TH04 and TH05
functions. Most of these are stored in a different place within the two
MAIN.EXE binaries, and the tried-and-true method of matching
segment names would therefore have introduced several unnecessary
translation units. So I resorted to a segment splitting technique I should
have started using way earlier: Simply creating new segments with names
derived from their functions, at the exact positions they're needed. All
the new segment start and end directives do bloat the ASM code somewhat,
and certainly contributed to this push barely removing any actual lines of
code. However, what we get in return is total freedom as far as
decompilation order is concerned,
📝 which should be the case for any ReC project, really.
And in the end, all these tiny code segments will cancel out anyway.
If only we could do the same with the data segment…
The popup function happened to be the final one I RE'd before my long break
in the spring of 2019. Back then, I didn't even bother looking into that
64-frame delay between changing popups, and what that meant for the game.
Each of these popups stays on screen for 128 frames, during which, of
course, another popup-worthy event might happen. Handling this cleanly
without removing previous popups too early would involve some sort of event
queue, whose size might even be meaningfully limited to the number of
distinct events that can happen. But still, that'd be a data structure, and
we're not gonna have that! Instead, ZUN
simply keeps two variables for the new and current popup ID. During an
active popup, any change to that ID will only be committed once the current
popup has been shown for at least 64 frames. And during that time,
that new ID can be freely overwritten with a different one, which drops any
previous, undisplayed event. But surely, there won't be more than two
events happening within 63 frames, right?
The rest was fairly uneventful – no newly RE'd functions in this push,
after all – until I reached the widely used helper function for applying
the current vertical scrolling offset to a Y coordinate. Its combination of
a function parameter, the pascal calling convention, and no
stack frame was previously thought to be undecompilable… except that it
isn't, and the decompilation didn't even require any new workarounds to be
developed? Good thing that I already forgot how impossible it was to
decompile the first function I looked at that fell into this category!
Oh well, this discovery wasn't too groundbreaking. Looking back at
all the other functions with that combination only revealed a grand total
of 1 additional one where a decompilation made sense: TH05's version of
snd_kaja_interrupt(), which is now compiled from the same C++
file for all 4 games that use it. And well, looks like some quirks really
remain unnoticed and undocumented until you look at a function for the 11th
time: Its return value is undefined if BGM is inactive – that is, if the
user disabled it, or if no FM board is installed. Not that it matters for
the original code, which never uses this function to retrieve anything from
KAJA's drivers. But people apparently do copy ReC98 code into their own
projects, so it is something to keep in mind.
All in all, nothing quite at jank level in this one, but we were surely grazing that tag. Next up, with that out of the way: The bullet update/step function! Very soon in fact, since I've mostly got it done already.
Technical debt, part 10… in which two of the PMD-related functions came
with such complex ramifications that they required one full push after
all, leaving no room for the additional decompilations I wanted to do. At
least, this did end up being the final one, completing all
SHARED segments for the time being.
The first one of these functions determines the BGM and sound effect
modes, combining the resident type of the PMD driver with the Option menu
setting. The TH04 and TH05 version is apparently coded quite smartly, as
PC-98 Touhou only needs to distinguish "OPN- /
PC-9801-26K-compatible sound sources handled by PMD.COM"
from "everything else", since all other PMD varieties are
OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's
AH=09h function. I'll leave a comprehensive, fully documented
enum to interested contributors, since that would involve research into
basically the entire history of the PC-9800 series, and even the clearly
out-of-scope PC-88VA. After all, distinguishing between more versions of
the PMD driver in the Option menu (and adding new sprites for them!) is
strictly mod territory.
The honor of being the final decompiled function in any SHARED
segment went to TH04's snd_load(). TH04 contains by far the
sanest version of this function: Readable C code, no new ZUN bugs (and
still missing file I/O error handling, of course)… but wait, what about
that actual file read syscall, using the INT 21h, AH=3Fh DOS
file read API? Reading up to a hardcoded number of bytes into PMD's or
MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in
TH05… that's kind of weird. About time we looked closer into this.
Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one
memory segment for these, as especially TH05's code might suggest to
anyone unfamiliar with these drivers. Instead,
you can customize the size of these buffers on its command line. In
GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound
effects, and 12 KiB for MMD files in TH02… which means that the hardcoded
sizes in snd_load() are completely wrong, no matter how you
look at them. Consequently, this read syscall
will overflow PMD's or MMD's song or sound effect buffer if the
given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT
instead, and it would have been fine. As it also turns out though,
PMD has an API function (AH=22h) to retrieve the actual
buffer sizes, provided for exactly that purpose. There is little excuse
not to use it, as it also gives you PMD's default sizes if you don't
specify any yourself.
(Unless your build process enumerates all PMD files that are part of the
game, and bakes the largest size into both snd_load() and
GAME.BAT. That would even work with MMD, which doesn't have
an equivalent for AH=22h.)
What'd be the consequence of loading a larger file then? Well, since we
don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single
memory segment. As a result, the limit for the combined size of the song,
instrument, and sound effect buffer is determined by the amount of
code in the driver itself. In PMD86 version 4.8o (bundled with TH04
and TH05) for example, the remaining size for these buffers is exactly
45,555 bytes. Being an actually good programmer who doesn't blindly trust
user input, KAJA thankfully validates the sizes given via the
/M, /V, and /E command-line options
before letting the driver reside in memory, and shuts down with an error
message if they exceed 40 KiB. Would have been even better if he calculated
the exact size – even in the current
PMD version 4.8s from
January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect
is down to the INT 21h, AH=3Fh implementation in the
underlying DOS version. DOS 3.3 treats the destination address as linear
and reads past the end of the segment,
DOS
5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining
space in the segment, and maybe there's even a DOS that wraps around
and ends up overwriting the PMD driver code. In any case: You will
overwrite what's after the driver in memory – typically, the game .EXE and
its master.lib functions.
It almost feels like a happy accident that this doesn't cause issues in
the original games. The largest PMD file in any of the 4 games, the -86
version of 幽夢 ~ Inanimate Dream, takes up 8,099 bytes,
just under the 8,192 byte limit for BGM. For modders, I'd really recommend
implementing this properly, with PMD's AH=22h function and
error handling, once position independence has been reached.
Whew, didn't think I'd be doing more research into KAJA's drivers during
regular ReC98 development! That's probably been the final time though, as
all involved functions are now decompiled, and I'm unlikely to iterate
over them again.
And that's it! Repaid the biggest chunk of technical debt, time for some
actual progress again. Next up: Reopening the store tomorrow, and waiting
for new priorities. If we got nothing by Sunday, I'm going to put the
pending [Anonymous] pushes towards some work on the website.
P0138
Separating translation units, part 9/10 (focused around TH03 / TH04) + TH04 RE (.MPN format)
💰 Funded by:
[Anonymous], Blue Bolt
🏷️ Tags:
Technical debt, part 9… and as it turns out, it's highly impractical to
repay 100% of it at this point in development. 😕
The reason: graph_putsa_fx(), ZUN's function for rendering
optionally boldfaced text to VRAM using the font ROM glyphs, in its
ridiculously micro-optimized TH04 and TH05 version. This one sets the
"callback function" for applying the boldface effect by self-modifying
the target of two CALL rel16 instructions… because
there really wasn't any free register left for an indirect
CALL, eh? The necessary distance, from the call site to the
function itself, has to be calculated at assembly time, by subtracting the
target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting
lookup tables in the .DATA segment. With code segments, we
can easily split them at pretty much any point between functions because
there are multiple of them. But there's only a single .DATA
segment, with all ZUN and master.lib data sandwiched between Borland C++'s
crt0 at the
top, and Borland C++'s library functions at the bottom of the segment.
Adding another split point would require all data after that point to be
moved to its own translation unit, which in turn requires
EXTERN references in the big .ASM file to all that moved
data… in short, it would turn the codebase into an even greater
mess.
Declaring the labels as EXTERN wouldn't work either, since
the linker can't do fancy arithmetic and is limited to simply replacing
address placeholders with one single address. So, we're now stuck with
this function at the bottom of the SHARED segment, for the
foreseeable future.
We can still continue to separate functions off the top of that segment,
though. Pretty much the only thing noteworthy there, so far: TH04's code
for loading stage tile images from .MPN files, which we hadn't
reverse-engineered so far, and which nicely fit into one of
Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved
the RE% bars again! If only for a tiny bit.
Both TH02 and TH05 simply store one pointer to one dynamically allocated
memory block for all tile images, as well as the number of images, in the
data segment. TH04, on the other hand, reserves memory for 8 .MPN slots,
complete with their color palettes, even though it only ever uses the
first one of these. There goes another 458 bytes of conventional RAM… I
should start summing up all the waste we've seen so far. Let's put the
next website contribution towards a tagging system for these blog posts.
At 86% of technical debt in the SHARED segment repaid, we
aren't quite done yet, but the rest is mostly just TH04 needing to catch
up with functions we've already separated. Next up: Getting to that
practical 98.5% point. Since this is very likely to not require a full
push, I'll also decompile some more actual TH04 and TH05 game code I
previously reverse-engineered – and after that, reopen the store!
P0137
Separating translation units, part 8/10 (focused around TH03) + Segment alignment research
💰 Funded by:
[Anonymous]
🏷️ Tags:
Whoops, the build was broken again? Since
P0127 from
mid-November 2020, on TASM32 version 5.3, which also happens to be the
one in the DevKit… That version changed the alignment for the default
segments of certain memory models when requesting .386
support. And since redefining segment alignment apparently is highly
illegal and absolutely has to be a build error, some of the stand-alone
.ASM translation units didn't assemble anymore on this version. I've only
spotted this on my own because I casually compiled ReC98 somewhere else –
on my development system, I happened to have TASM32 version 5.0 in the
PATH during all this time.
At least this was a good occasion to
get rid of some
weird segment alignment workarounds from 2015, and replace them with the
superior convention of using the USE16 modifier for the
.MODEL directive.
ReC98 would highly benefit from a build server – both in order to
immediately spot issues like this one, and as a service for modders.
Even more so than the usual open-source project of its size, I would say.
But that might be exactly
because it doesn't seem like something you can trivially outsource
to one of the big CI providers for open-source projects, and quickly set
it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular
64-bit Windows 10 and DOSBox running the exact build tools from the DevKit.
Ideally, though, such a server should really run the optimal configuration
of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step
to run natively, which already is something that no popular CI service out
there offers. Then, we'd optimally expand to Linux, every other Windows
version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd
be a lot. An experimental project all on its own, with additional hosting
costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest
there is once the store reopens (which will be at the beginning of May, at
the latest). That aside, it would 📝 also be
a great project for outside contributors!
So, technical debt, part 8… and right away, we're faced with TH03's
low-level input function, which
📝 once📝 again📝 insists on being word-aligned in a way we
can't fake without duplicating translation units.
Being undecompilable isn't exactly the best property for a function that
has been interesting to modders in the past: In 2018,
spaztron64 created an
ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human
multiplayer mode: 2021-04-04-TH03-WASD-2player.zip
However, this remapping attempt remained quite limited, since we hadn't
(and still haven't) reached full position independence for TH03 yet.
There's quite some potential for size optimizations in this function, which
would allow more BIOS key groups to already be used right now, but it's not
all that obvious to modders who aren't intimately familiar with x86 ASM.
Therefore, I really wouldn't want to keep such a long and important
function in ASM if we don't absolutely have to…
… and apparently, that's all the motivation I needed? So I took the risk,
and spent the first half of this push on reverse-engineering
TCC.EXE, to hopefully find a way to get word-aligned code
segments out of Turbo C++ after all.
And there is! The -WX option, used for creating
DPMI
applications, messes up all sorts of code generation aspects in weird
ways, but does in fact mark the code segment as word-aligned. We can
consider ourselves quite lucky that we get to use Turbo C++ 4.0, because
this feature isn't available in any previous version of Borland's C++
compilers.
That allowed us to restore all the decompilations I previously threw away…
well, two of the three, that lookup table generator was too much of a mess
in C. But what an abuse this is. The
subtly different code generation has basically required one creative
workaround per usage of -WX. For example, enabling that option
causes the regular PUSH BP and POP BP prolog and
epilog instructions to be wrapped with INC BP and
DEC BP, for some reason:
a_function_compiled_with_wx proc
inc bp ; ???
push bp
mov bp, sp
; [… function code …]
pop bp
dec bp ; ???
ret
a_function_compiled_with_wx endp
Luckily again, all the functions that currently require -WX
don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty
close: snd_se_reset(void) is one of the functions that require
word alignment. Previously, it shared a translation unit with the
immediately following snd_se_play(int new_se), which does take
a parameter, and therefore would have had its prolog and epilog code messed
up by -WX.
Since the latter function has a consistent (and thus, fakeable) alignment,
I simply split that code segment into two, with a new -WX
translation unit for just snd_se_reset(void). Problem solved –
after all, two C++ translation units are still better than one ASM
translation unit. Especially with all the
previous #include improvements.
The rest was more of the usual, getting us 74% done with repaying the
technical debt in the SHARED segment. A lot of the remaining
26% is TH04 needing to catch up with TH03 and TH05, which takes
comparatively little time. With some good luck, we might get this
done within the next push… that is, if we aren't confronted with all too
many more disgusting decompilations, like the two functions that ended this
push.
If we are, we might be needing 10 pushes to complete this after all, but
that piece of research was definitely worth the delay. Next up: One more of
these.
P0135
Separating translation units, part 6/10 (TH05 PMD loading / Music Room piano)
P0136
Separating translation units, part 7/10 (starting to catch up with TH04)
💰 Funded by:
[Anonymous]
🏷️ Tags:
Alright, no more big code maintenance tasks that absolutely need to be
done right now. Time to really focus on parts 6 and 7 of repaying
technical debt, right? Except that we don't get to speed up just yet, as
TH05's barely decompilable PMD file loading function is rather…
complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98
Touhou, I first consult the disassembly of Wolfenstein 3D. That game was
originally compiled with the quite similar Borland C++ 3.0, so it's quite
helpful to compare its ASM to the
officially released source
code. If I find the instructions in question, they mostly come from
that game's ASM code, leading to the amusing realization that "even John
Carmack was unable to get these instructions out of this compiler"
This time though, Wolfenstein 3D did point me
to Borland's intrinsics for common C functions like memcpy()
and strchr(), available via #pragma intrinsic.
Bu~t those unfortunately still generate worse code than what ZUN
micro-optimized here. Commenting how these sequences of instructions
should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely
though, clarifying the control flow, and clearly exposing a ZUN
bug: TH05's snd_load() will hang in an infinite loop when
trying to load a non-existing -86 BGM file (with a .M2
extension) if the corresponding -26 BGM file (with a .M
extension) doesn't exist either.
Unsurprisingly, the PMD channel monitoring code in TH05's Music Room
remains undecompilable outside the two most "high-level" initialization
and rendering functions. And it's not because there's data in the
middle of the code segment – that would have actually been possible with
some #pragmas to ensure that the data and code segments have
the same name. As soon as the SI and DI registers are referenced
anywhere, Turbo C++ insists on emitting prolog code to save these
on the stack at the beginning of the function, and epilog code to restore
them from there before returning.
Found that out in
September 2019, and confirmed that there's no way around it. All the
small helper functions here are quite simply too optimized, throwing away
any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate
that I do try.
Within that same 6th push though, we've finally reached the one function
in TH05 that was blocking further progress in TH04, allowing that game
to finally catch up with the others in terms of separated translation
units. Feels good to finally delete more of those .ASM files we've
decompiled a while ago… finally!
But since that was just getting started, the most satisfying development
in both of these pushes actually came from some more experiments with
macros and inline functions for near-ASM code. By adding
"unused" dummy parameters for all relevant registers, the exact input
registers are made more explicit, which might help future port authors who
then maybe wouldn't have to look them up in an x86 instruction
reference quite as often. At its best, this even allows us to
declare certain functions with the __fastcall convention and
express their parameter lists as regular C, with no additional
pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even
more amazing than previously thought when it comes to returning
pseudo-registers from inline functions. A nice example for
how this can improve readability can be found in this piece of TH02 code
for polling the PC-98 keyboard state using a BIOS interrupt:
inline uint8_t keygroup_sense(uint8_t group) {
_AL = group;
_AH = 0x04;
geninterrupt(0x18);
// This turns the output register of this BIOS call into the return value
// of this function. Surprisingly enough, this does *not* naively generate
// the `MOV AL, AH` instruction you might expect here!
return _AH;
}
void input_sense(void)
{
// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
// never emits as such, giving us only the three instructions we need.
_AH = keygroup_sense(8);
// Whereas this one gives us the one additional `MOV BH, AH` instruction
// we'd expect, and nothing more.
_BH = keygroup_sense(7);
// And now it's obvious what both of these registers contain, from just
// the assignments above.
if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
key_det |= INPUT_UP;
}
// […]
}
I love it. No inline assembly, as close to idiomatic C code as something
like this is going to get, yet still compiling into the minimum possible
number of x86 instructions on even a 1994 compiler. This is how I keep
this project interesting for myself during chores like these.
We might have even reached peak
inline already?
And that's 65% of technical debt in the SHARED segment repaid
so far. Next up: Two more of these, which might already complete that
segment? Finally!
P0133
Separating translation units, part 4/10 (focused around TH02 / TH05)
💰 Funded by:
[Anonymous]
🏷️ Tags:
Wow, 31 commits in a single push? Well, what the last push had in
progress, this one had in maintenance. The
📝 master.lib header transition absolutely
had to be completed in this one, for my own sanity. And indeed,
it reduced the build time for the entirety of ReC98 to about 27 seconds on
my system, just as expected in the original announcement. Looking forward
to even faster build times with the upcoming #include
improvements I've got up my sleeve! The port authors of the future are
going to appreciate those quite a bit.
As for the new translation units, the funniest one is probably TH05's
function for blitting the 1-color .CDG images used for the main menu
options. Which is so optimized that it becomes decompilable again,
by ditching the self-modifying code of its TH04 counterpart in favor of
simply making better use of CPU registers. The resulting C code is still a
mess, but what can you do.
This was followed by even more TH05 functions that clearly weren't
compiled from C, as evidenced by their padding
bytes. It's about time I've documented my lack of ideas of how to get
those out of Turbo C++.
And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…
In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?
P0132
Separating translation units, part 3/10 (focused around TH02 / TH03)
💰 Funded by:
[Anonymous]
🏷️ Tags:
Now that's the amount of translation unit separation progress I was
looking for! Too bad that RL is keeping me more and more occupied these
days, and ended up delaying this push until 2021. Now that
Touhou Patch Center is also commissioning me to update their
infrastructure, it's going to take a while for ReC98 to return to full
speed, and for the store to be reopened. Should happen by April at the
latest, though!
With everything related to this separation of translation units explained
earlier, we've really got a push with nothing to talk about, this
time. Except, maybe, for the realization that
📝 this current approach might not be the
best fit for TH02 after all: Not only did it force us to
📝 throw away the previous decompilation of
the sound effect playback functions, but OP.EXE also contains
obviously copy-pasted code in addition to the common, shared set of
library functions. How was that game even built, originally??? No
way around compiling that one instance of the "delay until given BGM
measure" function separately then, if it insists on using its own
instance of the VSync delay function…
Oh well, this separated layout still works better for the later games, and
consistency is good. Smooth sailing with all of the other functions, at
least.
Next up: One more of these, which might even end up completing the
📝 transition to our own master.lib header file.
In terms of the total number of ASM code left in the SHARED
code segments, we're now 30% done after 3 dedicated pushes. It really
shouldn't require 7 more pushes, though!
P0126
TH03/TH04/TH05 decompilation (EGC-powered blitting + .MRS format, part 1/2)
P0127
TH03 decompilation (.MRS format, part 2/2) + separating translation units, part 2/10
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:
Alright, back to continuing the master.hpp transition started
in P0124, and repaying technical debt. The last blog post already
announced some ridiculous decompilations… and in fact, not a single
one of the functions in these two pushes was decompilable into
idiomatic C/C++ code.
As usual, that didn't keep me from trying though. The TH04 and TH05
version of the infamous 16-pixel-aligned, EGC-accelerated rectangle
blitting function from page 1 to page 0 was fairly average as far as
unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be
the .MRS functions, used to render the gauge attack portraits and bomb
backgrounds. The blitting code there uses the additional FS and GS segment
registers provided by the Intel 386… which
are not supported by Turbo C++'s inline assembler, and
can't be turned into pointers, due to a compiler bug in Turbo C++ that
generates wrong segment prefix opcodes for the _FS and
_GS pseudo-registers.
Apparently I'm the first one to even try doing that with this compiler? I
haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around
this bug and generate the correct instructions. But that would incur yet
another dependency on a 16-bit TASM, for something honestly quite
insignificant.
What we can always do, however, is using __emit__() to simply
output x86 opcodes anywhere in a function. Unlike spelled-out inline
assembly, that can even be used in helper functions that are supposed to
inline… which does in fact allow us to fully abstract away this compiler
bug. Regular if() comparisons with pseudo-registers
wouldn't inline, but "converting" them into C++ template function
specializations does. All that's left is some C preprocessor abuse
to turn the pseudo-registers into types, and then we do retain a
normal-looking poke() call in the blitting functions in the
end. 🤯
Yeah… the result is
batshitinsane.
I may have gone too far in a few places…
One might certainly argue that all these ridiculous decompilations
actually hurt the preservation angle of this project. "Clearly, ZUN
couldn't have possibly written such unreasonable C++ code.
So why pretend he did, and not just keep it all in its more natural ASM
form?" Well, there are several reasons:
Future port authors will merely have to translate all the
pseudo-registers and inline assembly to C++. For the former, this is
typically as easy as replacing them with newly declared local variables. No
need to bother with function prolog and epilog code, calling conventions, or
the build system.
No duplication of constants and structures in ASM land.
As a more expressive language, C++ can document the code much better.
Meticulous documentation seems to have become the main attraction of ReC98
these days – I've seen it appreciated quite a number of times, and the
continued financial support of all the backers speaks volumes. Mods, on the
other hand, are still a rather rare sight.
Having as few .ASM files in the source tree as possible looks better to
casual visitors who just look at GitHub's repo language breakdown. This way,
ReC98 will also turn from an "Assembly project" to its rightful state
of "C++ project" much sooner.
And finally, it's not like the ASM versions are
gone – they're still part of the Git history.
Unfortunately, these pushes also demonstrated a second disadvantage in
trying to decompile everything possible: Since Turbo C++ lacks TASM's
fine-grained ability to enforce code alignment on certain multiples of
bytes, it might actually be unfeasible to link in a C-compiled object file
at its intended original position in some of the .EXE files it's used in.
Which… you're only going to notice once you encounter such a case. Due to
the slightly jumbled order of functions in the
📝 second, shared code segment, that might
be long after you decompiled and successfully linked in the function
everywhere else.
And then you'll have to throw away that decompilation after all 😕 Oh
well. In this specific case (the lookup table generator for horizontally
flipping images), that decompilation was a mess anyway, and probably
helped nobody. I could have added a dummy .OBJ that does nothing but
enforce the needed 2-byte alignment before the function if I
really insisted on keeping the C version, but it really wasn't
worth it.
Now that I've also described yet another meta-issue, maybe there'll
really be nothing to say about the next technical debt pushes?
Next up though: Back to actual progress
again, with TH01. Which maybe even ends up pushing that game over the 50%
RE mark?
Finally, after a long while, we've got two pushes with barely anything to
talk about! Continuing the road towards 100% PI for TH05, these were
exactly the two pushes that TH05 MAINE.EXE PI was estimated
to additionally cost, relative to TH04's. Consequently, they mostly went
to TH05's unique data structures in the ending cutscenes, the score name
registration menu, and the
staff roll.
A unique feature in there is TH05's support for automatic text color
changes in its ending scripts, based on the first full-width Shift-JIS
codepoint in a line. The \c=codepoint,color
commands at the top of the _ED??.TXT set up exactly this
codepoint→color mapping. As far as I can tell, TH05 is the only Touhou
game with a feature like this – even the Windows Touhou games went back to
manually spelling out each color change.
The orb particles in TH05's staff roll also try to be a bit unique by
using 32-bit X and Y subpixel variables for their current position. With
still just 4 fractional bits, I can't really tell yet whether the extended
range was actually necessary. Maybe due to how the "camera scrolling"
through "space" was implemented? All other entities were pretty much the
usual fare, though.
12.4, 4.4, and now a 28.4 fixed-point format… yup,
📝 C++ templates were
definitely the right choice.
At the end of its staff roll, TH05 not only displays
the usual performance
verdict, but then scrolls in the scores at the end of each stage
before switching to the high score menu. The simplest way to smoothly
scroll between two full screens on a PC-98 involves a separate bitmap…
which is exactly what TH05 does here, reserving 28,160 bytes of its global
data segment for just one overly large monochrome 320×704 bitmap where
both the screens are rendered to. That's… one benefit of splitting your
game into multiple executables, I guess?
Not sure if it's common knowledge that you can actually scroll back and
forth between the two screens with the Up and Down keys before moving to
the score menu. I surely didn't know that before. But it makes sense –
might as well get the most out of that memory.
The necessary groundwork for all of this may have actually made
TH04's (yes, TH04's) MAINE.EXE technically
position-independent. Didn't quite reach the same goal for TH05's – but
what we did reach is ⅔ of all PC-98 Touhou code now being
position-independent! Next up: Celebrating even more milestones, as
-Tom- is about to finish development on his TH05
MAIN.EXE PI demo…
Alright, tooling and technical debt. Shouldn't be really much to talk
about… oh, wait, this is still ReC98
For the tooling part, I finished up the remaining ergonomics and error
handling for the
📝 sprite converter that Jonathan Campbell contributed two months ago.
While I familiarized myself with the tool, I've actually ran into some
unreported errors myself, so this was sort of important to me. Still got
no command-line help in there, but the error messages can now do that job
probably even better, since we would have had to write them anyway.
So, what's up with the technical debt then? Well, by now we've accumulated
quite a number of 📝 ASM code slices that
need to be either decompiled or clearly marked as undecompilable. Since we
define those slices as "already reverse-engineered", that decision won't
affect the numbers on the front page at all. But for a complete
decompilation, we'd still have to do this someday. So, rather than
incorporating this work into pushes that were purchased with the
expectation of measurable progress in a certain area, let's take the
"anything goes" pushes, and focus entirely on that during them.
The second code segment seemed like the best place to start with this,
since it affects the largest number of games simultaneously. Starting with
TH02, this segment contains a set of random "core" functions needed by the
binary. Image formats, sounds, input, math, it's all there in some
capacity. You could maybe call it all "libzun" or something like
that? But for the time being, I simply went with the obvious name,
seg2. Maybe I'll come up with something more convincing in
the future.
Oh, but wait, why were we assembling all the previous undecompilable ASM
translation units in the 16-bit build part? By moving those to the 32-bit
part, we don't even need a 16-bit TASM in our list of dependencies, as
long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit
Windows version. 🎉 Which is certainly the most user-visible improvement
in all of these two pushes.
Back in 2015, I already decompiled all of TH02's seg2
functions. As suggested by the Borland compiler, I tried to follow a "one
translation unit per segment" layout, bundling the binary-specific
contents via #include. In the end, it required two
translation units – and that was even after manually inserting the
original padding bytes via #pragma codestring… yuck. But it
worked, compiled, and kept the linker's job (and, by extension,
segmentation worries) to a minimum. And as long as it all matched the
original binaries, it still counted as a valid reconstruction of ZUN's
code.
However, that idea ultimately falls apart once TH03 starts mixing
undecompilable ASM code inbetween C functions. Now, we officially have no
choice but to use multiple C and ASM translation units, with maybe only
just one or two #includes in them…
…or we finally start reconstructing the actual seg2 library,
turning every sequence of related functions into its own translation unit.
This way, we can simply reuse the once-compiled .OBJ files for all the
binaries those functions appear in, without requiring that additional
layer of translation units mirroring the original segmentation.
The best example for this is
TH03's
almost undecompilable function that generates a lookup table for
horizontally flipping 8 1bpp pixels. It's part of every binary since
TH03, but only used in that game. With the previous approach, we would
have had to add 9 C translation units, which would all have just
#included that one file. Now, we simply put the .OBJ file
into the correct place on the linker command line, as soon as we can.
💡 And suddenly, the linker just inserts the correct padding bytes itself.
The most immediate gains there also happened to come from TH03. Which is
also where we did get some tiny RE% and PI% gains out of this after
all, by reverse-engineering some of its sprite blitting setup code. Sure,
I should have done even more RE here, to also cover those 5 functions at
the end of code segment #2 in TH03's MAIN.EXE that were in
front of a number of library functions I already covered in this push. But
let's leave that to an actual RE push 😛
All in all though, I was just getting started with this; the real
gains in terms of removed ASM files are still to come. But in the
meantime, the funding situation has become even better in terms of
allowing me to focus on things nobody asked for. 🙂 So here's a slightly
better idea: Instead of spending two more pushes on this, let's shoot for
TH05 MAINE.EXE position independence next. If I manage to get
it done, we'll have a 100% position-independent TH05 by the time
-Tom- finishes his MAIN.EXE PI demo, rather
than the 94% we'd get from just MAIN.EXE. That's bound to
make a much better impression on all the people who will then
(re-)discover the project.
P0110
TH05 RE (Shinki and EX-Alice background animation structures)
💰 Funded by:
[Anonymous], Blue Bolt
🏷️ Tags:
… and just as I explained 📝 in the last post
how decompilation is typically more sensible and efficient than ASM-level
reverse-engineering, we have this push demonstrating a counter-example.
The reason why the background particles and lines in the Shinki and
EX-Alice battles contributed so much to position dependence was simply
because they're accessed in a relatively large amount of functions, one
for each different animation. Too many to spend the remaining precious
crowdfunded time on reverse-engineering or even decompiling them all,
especially now that everyone anticipates 100% PI for TH05's
MAIN.EXE.
Therefore, I only decompiled the two functions of the line structure that
also demonstrate best how it works, which in turn also helped with RE.
Sadly, this revealed that we actually can't📝 overload operator =() to get
that nice assignment syntax for 12.4 fixed-point values, because one of
those new functions relies on Turbo C++'s built-in optimizations for
trivially copyable structures. Still, impressive that this abstraction
caused no other issues for almost one year.
As for the structures themselves… nope, nothing to criticize this time!
Sure, one good particle system would have been awesome, instead of having
separate structures for the Stage 2 "starfield" particles and the one used
in Shinki's battle, with hardcoded animations for both. But given the
game's short development time, that was quite an acceptable compromise,
I'd say.
And as for the lines, there just has to be a reason why the game
reserves 20 lines per set, but only renders lines #0, #6, #12, and #18.
We'll probably see once we get to look at those animation functions more
closely.
This was quite a 📝 TH03-style RE push,
which yielded way more PI% than RE%. But now that that's done, I can
finally not get distracted by all that stuff when looking at the
list of remaining memory references. Next up: The last few missing
structures in TH05's MAIN.EXE!
Well, that took twice as long as I thought, with the two pushes containing
a lot more maintenance than actual new research. Spending some time
improving both field names and types in
32th System's
TH03 resident structure finally gives us all of those
structures. Which means that we can now cover all the remaining
decompilable ZUN.COM parts at once…
Oh wait, their main() functions have stayed largely identical
since TH02? Time to clean up and separate that first, then… and combine
two recent code generation observations into the solution to a
decompilation puzzle from 4½ years ago. Alright, time to decomp-
Oh wait, we'd kinda like to properly RE all the code in TH03-TH05
that deals with loading and saving .CFG files. Almost every outside
contributor wanted to grab this supposedly low-hanging fruit a lot
earlier, but (of course) always just for a single game, while missing how
the format evolved.
So, ZUN.COM. For some reason, people seem to consider it
particularly important, even though it contains neither any game logic nor
any code specific to PC-98 hardware… All that this decompilable part does
is to initialize a game's .CFG file, allocate an empty resident structure
using master.lib functions, release it after you quit the game,
error-check all that, and print some playful messages~ (OK, TH05's also
directly fills the resident structure with all data from
MIKO.CFG, which all the other games do in OP.EXE.)
At least modders can now freely change and extend all the resident
structures, as well as the .CFG files? And translators can translate those
messages that you won't see on a decently fast emulator anyway? Have fun,
I guess 🤷
And you can in fact do this right now – even for TH04 and TH05,
whose ZUN.COM currently isn't rebuilt by ReC98. There is
actually a rather involved reason for this:
One of the missing files is TH05's GJINIT.COM.
Which contains all of TH05's gaiji characters in hardcoded 1bpp form,
together with a bit of ASM for writing them to the PC-98's hardware gaiji
RAM
Which means we'd ideally first like to have a sprite compiler, for
all the hardcoded 1bpp sprites
Which must compile to an ASM slice in the meantime, but should also
output directly to an OMF .OBJ file (for performance now), as well as to C
code (for portability later)
Which I won't put in as long as the backlog contains actual
progress to drive up the percentages on the front page.
So yeah, no meaningful RE and PI progress at any of these levels. Heck,
even as a modder, you can just replace the zun zun_res
(TH02), zun -5 (TH03), or zun -s (TH04/TH05)
calls in GAME.BAT with a direct call to your modified
*RES*.COM. And with the alternative being "manually typing 0 and 1
bits into a text file", editing the sprites in TH05's
GJINIT.COM is way more comfortable in a binary sprite editor
anyway.
For me though, the best part in all of this was that it finally made sense
to throw out the old Borland C++ run-time assembly slices 🗑 This giant
waste of time
became obvious 5 years ago, but any ASM dump of a .COM
file would have needed rather ugly workarounds without those slices. Now
that all .COM binaries that were originally written in C are
compiled from C, we can all enjoy slightly faster grepping over the entire
repository, which now has 229 fewer files. Productivity will skyrocket!
Next up: Three weeks of almost full-time ReC98 work! Two more PI-focused
pushes to finish this TH05 stretch first, before switching priorities to
TH01 again.
Turns out that covering TH03's 128-byte player structure was way
more insightful than expected! And while it doesn't include every
bit of per-player data, we still got to know quite a bit about the game
from just trying to name its members:
50 frames of invincibility when starting a new round
110 frames of invincibility when getting hit
64 frames of knockback when getting hit
128 frames before a charged up gauge/boss attack is fired
automatically
The damage a player will take from the next hit starts out at ½ heart
at the beginning of each round, and increases by another ½ heart every
1024 frames, capped at a maximum of 3 hearts. This guarantees that a
player will always survive at least two hits.
In Story Mode, hit damage is biased in favor of the player for the
first 6 stages. The CPU will always take an additional 1½ hearts of damage
in stages 1 and 2, 1 heart in stages 3 and 4, and ½ heart in stages 5 and
6, plus the above frame-based and capped damage amount. So while it's
therefore possible to cause 4½ hearts of damage in Stages 1 and 2 if the
first hit is somehow delayed for at least 5120 frames, you'd still win
faster if the CPU gets hit as soon as possible.
CPU players will charge up a gauge/boss attack as soon as their gauge
has reached a certain level. These levels are now proved to be random; at
the start of every round, the game generates a sequence of 64 gauge level
positions (from 1 to 4), separately for each player. If a round were to
last long enough for a CPU player to fire all 64 of those predetermined
attacks, you'd observe that sequence repeating.
Yes, that means that in theory, these levels can be
RNG-manipulated. More details on that once we got this game's resident
structure, where the seed is stored.
CPU players follow two main strategies: trying to not get hit, and…
not quite doing that once they've survived for a certain safety threshold
of frames. For the first 2000 frames of a round, this safety frame counter
is reset to 0 every 64 frames, leading the CPU to switch quickly between
the two strategies in the first few Story Mode stages on lower
difficulties, where this safety threshold is less than 64. The calculation
of the actual value is a bit more complex; more on that also once we got
this game's resident structure.
Section 13 of 夢時空.TXT states that Boss Attacks are only counted
towards the Clear Bonus if they were caused by reaching a certain number
of spell points. This is incorrect; manually charged Level 4 Boss Attacks
are counted as well.
The next TH03 pushes can now cover all the functions that reference this
structure in one way or another, and actually commit all this research and
translate it into some RE%. Since the non-TH05 priorities have become a
bit unclear after the last 50 € RE contribution though (as of this
writing, it's still 10 € to decide on what game to cover in two RE
pushes!), I'll be returning to TH05 until that's decided.
As noted in 📝 P0061, TH03 gameplay RE is
indeed going to progress very slowly in the beginning. A lot of the
initial progress won't even be reflected in the RE% – there are just so
many features in this game that are intertwined into each other, and I
only consider functions to be "reverse-engineered" once we understand
every involved piece of code and data, and labeled every absolute
memory reference in it. (Yes, that means that the percentages on the front
page are actually underselling ReC98's progress quite a bit, and reflect a
pretty low bound of our actual understanding of the games.)
So, when I get asked to look directly at gameplay code right now,
it's quite the struggle to find a place that can be covered within a push
or two and that would immediately benefit
scoreplayers. The basics of score and combo handling themselves
managed to fit in pretty well, though:
Just like TH04 and TH05, TH03 stores the current score as 8
binary-coded
decimal digits. Since the last constant 0 is not included, the maximum
score displayable without glitches therefore is 999,999,990 points, but
the game will happily store up to 24,699,999,990 points before the score
wraps back to 0.
There are (surprisingly?) only 6 places where the game actually
adds points to the score. Not quite sure about all of them yet, but they
(of course) include ending a combo, killing enemies, and the bonus at the
end of a round.
Combos can be continued for 80 frames after a 2-hit. The hit counter
can only be increased in the first 48, and effectively resets to 0 for the
last 32, when the Spell Point value starts blinking.
TH03 can track a total of 16 independent "hit combo sources" per
player, simultaneously. These are not related to the number of
actual explosions; rather, each explosion is assigned to one of the 16
slots when it spawns, and all consecutive explosions spawned from that one
will then add to the hit combo in that slot. The hit number displayed in
the top left is simply the largest one among all these.
Oh well, at least we still got a bit of PI% out of this one. From this
point though, the next push (or two) should be enough to cover the big
128-byte player structure – which by itself might not be immediately
interesting to scoreplayers, but surely is quite a blocker for everything
else.
A~nd resident structures ended up being exactly
the right thing to start off the new year with.
WindowsTiger and
spaztron64 have already been
pushing for them with their own reverse-engineering, and together with my
own recent GENSOU.SCR RE work, we've clarified just enough
context around the harder-to-explain values to make both TH04's and TH05's
structures fit nicely into the typical time frame of a single push.
With all the apparently obvious and seemingly just duplicated values, it
has always been easy to do a superficial job for most of the structure,
then lose motivation for the last few unknown fields. Pretty glad to got
this finally covered; I've heard that people are going to write trainer
tools now?
Also, where better to slot in a push that, in terms of figures, seems to
deliver 0% RE and only miniscule PI progress, than at the end of
Touhou Patch Center's 5-push order that already had multiple pushes
yielding above-average progress? As usual,
we'll be reaping the rewards of this work in the next few TH04/TH05
pushes…
…whenever they get funded, that is, as for January, the backers have
shifted the priorities towards TH01 and TH03. TH01 especially is something
I'm quite excited about, as we're finally going to see just how fast this
bloated game is really going to progress. Are you excited?
🎉 TH04's and TH05's OP.EXE are now fully
position-independent! 🎉
What does this mean?
You can now add any data or code to the main menus of the two games, by
simply editing the ReC98 source, writing your mod in ASM or C/C++, and
recompiling the code. Since all absolute memory addresses have now been
converted to labels, this will work without causing any instability. See
the position independence section in the FAQ
for a more thorough explanation about why this was a problem.
What does this not mean?
The original ZUN code hasn't been completely reverse-engineered yet, let
alone decompiled. Pretty much all of that is still ASM, which might make
modding a bit inconvenient right now.
Since this push was otherwise pretty unremarkable, I made a video
demonstrating a few basic things you can do with this:
Now, what to do for the last outstanding Touhou Patch Center push?
Bullets, or resident structures?
… nope, with a game whose MAIN.EXE is still just 5%
reverse-engineered and which naturally makes heavy use of
structures, there's still a lot more PI groundwork to be done before RE
progress can speed up to the levels that we've now reached with TH05. The
good news is that this game is (now) way easier to understand: In contrast
to TH04 and TH05, where we needed to work towards player shots over a
two-digit number of pushes, TH03 only needed two for SPRITE16, and a half
one for the playfield shaking mechanism. After that, I could even already
decompile the per-frame shot update and render functions, thanks to TH03's
high number of code segments. Now, even the big 128-byte player structure
doesn't seem all too far off.
Then again, as TH03 shares no code with any other game, this actually was
a completely average PI push. For the remaining three, we'll return to
TH04 and TH05 though, which should more than make up for the slight drop
in RE speed after this one.
In other news, we've now also reached peak C++, with the introduction of
templates! TH03 stores movement speeds in a 4.4 fixed-point
format, which is an 8-bit spin on the usual 16-bit, 12.4 fixed-point
format.
So, where to start? Well, TH04 bullets are hard, so let's
procrastinate start with TH03 instead
The 📝 sprite display functions are the
obvious blocker for any structure describing a sprite, and therefore most
meaningful PI gains in that game… and I actually did manage to fit a
decompilation of those three functions into exactly the amount of time
that the Touhou Patch Center community votes alloted to TH03
reverse-engineering!
And a pretty amazing one at that. The original code was so obviously
written in ASM and was just barely decompilable by exclusively using
register pseudovariables and a bit of goto, but I was able to
abstract most of that away, not least thanks to a few helpful optimization
properties of Turbo C++… seriously, I can't stop marveling at this ancient
compiler. The end result is both readable, clear, and dare I say
portable?! To anyone interested in porting TH03,
take a look. How painful would it be to port that away from 16-bit
x86?
However, this push is also a typical example that the RE/PI priorities can
only control what I look at, and the outcome can actually differ
greatly. Even though the priorities were 65% RE and 35% PI, the progress
outcome was +0.13% RE and +1.35% PI. But hey, we've got one more push with
a focus on TH03 PI, so maybe that one will include more RE than
PI, and then everything will end up just as ordered?
No priorities, again…?! Please don't do this to me… 😕
Well, let's not continue with TH05 then 😛 And instead use the occasion to
commit this
interesting discovery, made by @m1yur1 last year. Yup, TH03's "ZUNSP"
sprite driver is actually a "rebranded" version of Promisence Soft's
SPRITE16.COM. Sure, you were allowed to use this
driver in your own game, but replacing the copyright with your own isn't
exactly the nicest thing to do… That now makes three library programmers
that ZUN didn't credit. Makes me wonder what makes M. Kajihara so special.
Probably the fact that Touhou has always been about the music for ZUN,
first and foremost.
But what makes this more than a piece of trivia is the fact that
Promiscence Soft's SPRITE16 sample game StormySpace was bundled
with documentation on the driver. Shoutout to the Neo Kobe PC-98
collection for preserving he original release!
That means more documented third-party code that we don't necessarily have
to reverse-engineer, just like master.lib or KAJA's PMD driver. However,
the PC-98 EGC is rather complex and definitely not designed
for alpha-tested 16-color sprite blitting. So it (once again) took quite a
while to make sense of SPRITE16's code and the available documentation on
the EGC, to come up with satisfying function names. As a result, I'm going
to distribute the entire RE work related to TH03's SPRITE16 interface
across a total of three pushes, this one being the first of them.
The second one will reverse-engineer the SPRITE16 code reachable from
its interrupt handler, and also come with somewhat detailed English
documentation on the PC-98 EGC raster ops in particular,
Boss explosions! And… urgh, I really also had to wade through that overly complicated HUD rendering code. Even though I had to pick -Tom-'s 7th push here as well, the worst of that is still to come. TH04 and TH05 exclusively store the current and high score internally as unpacked little-endian BCD, with some pretty dense ASM code involving the venerable x86 BCD instructions to update it.
So, what's actually the goal here. Since I was given no priorities , I still haven't had to (potentially) waste time researching whether we really can decompile from anywhere else inside a segment other than backwards from the end. So, the most efficient place for decompilation right now still is the end of TH05's main_01_TEXT segment. With maybe 1 or 2 more reverse-engineering commits, we'd have everything for an efficient decompilation up to sub_123AD. And that mass of code just happens to include all the shot type control functions, and makes up 3,007 instructions in total, or 12% of the entire remaining unknown code in MAIN.EXE.
So, the most reasonable thing would be to actually put some of the upcoming decompilation pushes towards reverse-engineering that missing part. I don't think that's a bad deal since it will allow us to mod TH05 shot types in C sooner, but zorg and qp might disagree
Next up: thcrap TL notes, followed by finally finishing GhostPhanom's old ReC98 future-proofing pushes. I really don't want to decompile without a proper build system.
P0043
TH04/TH05 RE (Scrolling stage backgrounds, part 1)
P0044
TH04/TH05 RE (Scrolling stage backgrounds, part 2)
P0045
TH04/TH05 RE (Scrolling stage backgrounds, part 3)
Turns out I had only been about half done with the drawing routines. The rest was all related to redrawing the scrolling stage backgrounds after other sprites were drawn on top. Since the PC-98 does have hardware-accelerated scrolling, but no hardware-accelerated sprites, everything that draws animated sprites into a scrolling VRAM must then also make sure that the background tiles covered by the sprite are redrawn in the next frame, which required a bit of ZUN code. And that are the functions that have been in the way of the expected rapid reverse-engineering progress that uth05win was supposed to bring. So, looks like everything's going to go really fast now?
… yeah, no, we won't get very far without figuring out these drawing routines.
Which process data that comes from the .STD files.
Which has various arrays related to the background… including one to specify the scrolling speed. And wait, setting that to 0 actually is what starts a boss battle?
So, have a TH05 Boss Rush patch: 2018-12-26-TH05BossRush.zip
Theoretically, this should have also worked for TH04, but for some reason,
the Stage 3 boss gets stuck on the first phase if we do this?
While we're waiting for Bruno to release the next thcrap build with ANM header patching, here are the resulting commits of the ReC98 CDG/CD2 special offer purchased by DTM, reverse-engineering all code that covers these formats.
> OK, let's do a quick ReC98 update before going back to thcrap, shouldn't take long
> Hm, all that input code is kind of in the way, would be nice to cover that first to ease comparisons with uth05win's source code
> What the hell, why does ZUN do this? Need to do more research
> …
> OK, research done, wait, what are those other functions doing?
> Wha, everything about this is just ever so slightly awkward
Which ended up turning this one update into 2/10, 3/10, 4/10 and 5/10 of zorg's reverse-engineering commits. But at least we now got all shared input functions of TH02-TH05 covered and well understood.