⮜ Blog

⮜ List of tags

Showing all posts tagged master.lib-

📝 Posted:
🚚 Summary of:
P0223, P0224, P0225
Commits:
139746c...371292d, 371292d...8118e61, 8118e61...4f85326
💰 Funded by:
rosenrose, Blue Bolt, Splashman, -Tom-, Yanga, Enderwolf, 32th System
🏷 Tags:
rec98+ th02+ th03+ th04+ th05+ cutscene+ blitting+ micro-optimization+ glitch+ master.lib- pc98+ performance+ kaja+ unused+ meta+ midboss+ animation+

More than three months without any reverse-engineering progress! It's been way too long. Coincidentally, we're at least back with a surprising 1.25% of overall RE, achieved within just 3 pushes. The ending script system is not only more or less the same in TH04 and TH05, but actually originated in TH03, where it's also used for the cutscenes before stages 8 and 9. This means that it was one of the final pieces of code shared between three of the four remaining games, which I got to decompile at roughly 3× the usual speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The Music Room is largely equivalent in all three remaining games as well, and the sound device selection, ZUN Soft logo screens, and main/option menus are the same in TH04 and TH05. A lot of that code is in the "technically RE'd but not yet decompiled" ASM form though, so it would shift Finalized% more significantly than RE%. Therefore, make sure to order the new Finalization option rather than Reverse-engineering if you want to make number go up.


So, cutscenes. On the surface, the .TXT files look simple enough: You directly write the text that should appear on the screen into the file without any special markup, and add commands to define visuals, music, and other effects at any place within the script. Let's start with the basics of how text is rendered, which are the same in all three games:


Superficially, the list of game-specific differences doesn't look too long, and can be summarized in a rather short table:

:th03: TH03 :th04: TH04 :th05: TH05
Script size limit 65536 bytes (heap-allocated) 8192 bytes (statically allocated)
Delay between every 2 bytes of text 1 frame by default, customizable via \v None
Text delay when holding ESC Varying speed-up factor None
Visibility of new text Immediately typed onto the screen Rendered onto invisible VRAM page, faded in on wait commands
Visibility of old text Unblitted when starting a new box Left on screen until crossfaded out with new text
Key binding for advancing the script Any key ⏎ Return, Shot, or ESC
Animation while waiting for an advance key None ⏎⃣, past right edge of current row
Inexplicable delays None 1 frame before changing pictures and after rendering new text boxes
Additional delay per interpreter loop 614.4 µs None 614.4 µs
The 614.4 µs correspond to the necessary delay for working around the repeated key up and key down events sent by PC-98 keyboards when holding down a key. While the absence of this delay significantly speeds up TH04's interpreter, it's also the reason why that game will stop recognizing a held ESC key after a few seconds, requiring you to press it again.

It's when you get into the implementation that the combined three systems reveal themselves as a giant mess, with more like 56 differences between the games. :zunpet: Every single new weird line of code opened up another can of worms, which ultimately made all of this end up with 24 pieces of bloat and 14 bugs. The worst of these should be quite interesting for the general PC-98 homebrew developers among my audience:


That brings us to the individual script commands… and yes, I'm going to document every single one of them. Some of their interactions and edge cases are not clear at all from just looking at the code.

Almost all commands are preceded by… well, a 0x5C lead byte. :thonk: Which raises the question of whether we should document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded ¥ yen sign. From a gaijin perspective, it seems obvious that it's a backslash, as it's consistently displayed as one in most of the editors you would actually use nowadays. But interestingly, iconv -f shift-jis -t utf-8 does convert any 0x5C lead bytes to actual ¥ U+00A5 YEN SIGN code points :tannedcirno:.
Ultimately, the distinction comes down to the font. There are fonts that still render 0x5C as ¥, but mainly do so out of an obvious concern about backward compatibility to JIS X 0201, where this mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho, the old Japanese fonts from Windows 3.1, but even Meiryo and Yu Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much every other modern font, and freely licensed ones in particular, render this code point as \, even if you set your editor to Shift-JIS. And while ZUN most definitely saw it as a ¥, documenting this code point as \ is less ambiguous in the long run. It can only possibly correspond to one specific code point in either Shift-JIS or UTF-8, and will remain correct even if we later mod the cutscene system to support full-blown Unicode.

Now we've only got to clarify the parameter syntax, and then we can look at the big table of commands:

:th03: :th04: :th05: \@ Clears both VRAM pages by filling them with VRAM color 0.
🐞 In TH03 and TH04, this command does not update the internal text area background used for unblitting. This bug effectively restricts usage of this command to either the beginning of a script (before the first background image is shown) or its end (after no more new text boxes are started). See the image below for an example of using it anywhere else.
:th03: :th04: :th05: \b2 Sets the font weight to a value between 0 (raw font ROM glyphs) to 3 (very thicc). Specifying any other value has no effect.
:th04: :th05: 🐞 In TH04 and TH05, \b3 leads to glitched pixels when rendering half-width glyphs due to a bug in the newly micro-optimized ASM version of 📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the graph_putsa_fx() effect function, removing the sanity check that was present in TH03. In exchange, you can also access the four dissolve masks for the bold font (\b2) by specifying a parameter between 4 (fewest pixels) to 7 (most pixels). Demo video below.
:th03: :th04: :th05: \c15 Changes the text color to VRAM color 15.
:th05: \c=,15 Adds a color map entry: If is the first code point inside the name area on a new line, the text color is automatically set to 15. Up to 8 such entries can be registered before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
:th03: :th04: :th05: \e0 Plays the sound effect with the given ID.
:th03: :th04: :th05: \f (no-op)
:th03: :th04: :th05: \fi1
\fo1
Calls master.lib's palette_black_in() or palette_black_out() to play a hardware palette fade animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
:th03: :th04: :th05: \fm1 Fades out BGM volume via PMD's AH=02h interrupt call, in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to AH=02h's fade-in feature, which can't be used from cutscene scripts because it requires BGM volume to first be lowered via AH=19h, and there is no command to do that.
:th03: :th04: :th05: \g8 Plays a blocking 8-frame screen shake animation.
:th03: :th04: \ga0 Shows the gaiji with the given ID from 0 to 255 at the current cursor position. Even in TH03, gaiji always ignore the text delay interval configured with \v.
:th05: @3 TH05's replacement for the \ga command from TH03 and TH04. The default ID of 3 corresponds to the ♫ gaiji. Not to be confused with \@, which starts with a backslash, unlike this command.
:th05: @h Shows the 🎔 gaiji.
:th05: @t Shows the 💦 gaiji.
:th05: @! Shows the ! gaiji.
:th05: @? Shows the ? gaiji.
:th05: @!! Shows the ‼ gaiji.
:th05: @!? Shows the ⁉ gaiji.
:th03: :th04: :th05: \k0 Waits 0 frames (0 = forever) for an advance key to be pressed before continuing script execution. Before waiting, TH05 crossfades in any new text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time. As a workaround, \vp1 can be used before \k to immediately display that text without a fade-in animation.
:th03: :th04: :th05: \m$ Stops the currently playing BGM.
:th03: :th04: :th05: \m* Restarts playback of the currently loaded BGM from the beginning.
:th03: :th04: :th05: \m,filename Stops the currently playing BGM, loads a new one from the given file, and starts playback.
:th03: :th04: :th05: \n Starts a new line at the leftmost X coordinate of the box, i.e., the start of the name area. This is how scripts can "change" the name of the currently speaking character, or use the entire 480×64 pixels without being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line. Using this command at the "end" of a line with the maximum number of 30 full-width glyphs would therefore start a second new line and leave the previously started line empty.
If this command moved the cursor into the 5th line of a box, \s is executed afterward, with any of \n's parameters passed to \s.
:th03: :th04: :th05: \p (no-op)
:th03: :th04: :th05: \p- Deallocates the loaded .PI image.
:th03: :th04: :th05: \p,filename Loads the .PI image with the given file into the single .PI slot available to cutscenes. TH04 and TH05 automatically deallocate any previous image, 🐞 TH03 would leak memory without a manual prior call to \p-.
:th03: :th04: :th05: \pp Sets the hardware palette to the one of the loaded .PI image.
:th03: :th04: :th05: \p@ Sets the loaded .PI image as the full-screen 640×400 background image and overwrites both VRAM pages with its pixels, retaining the current hardware palette.
:th03: :th04: :th05: \p= Runs \pp followed by \p@.
:th03: :th04: :th05: \s0
\s-
Ends a text box and starts a new one. Fades in any text rendered to the invisible VRAM page, then waits 0 frames (0 = forever) for an advance key to be pressed. Afterward, the new text box is started with the cursor moved to the top-left corner of the name area.
\s- skips the wait time and starts the new box immediately.
:th03: :th04: :th05: \t100 Sets palette brightness via master.lib's palette_settone() to any value from 0 (fully black) to 200 (fully white). 100 corresponds to the palette's original colors. Preceded by a 1-frame delay unless ESC is held.
:th03: \v1 Sets the number of frames to wait between every 2 bytes of rendered text.
:th04: Sets the number of frames to spend on each of the 4 fade steps when crossfading between old and new text. The game-specific default value is also used before the first use of this command.
:th05: \v2
:th03: :th04: :th05: \vp0 Shows VRAM page 0. Completely useless in TH03 (this game always synchronizes both VRAM pages at a command boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to their intended shown page before every blitting operation anyway. A debloated mod of this game would just remove this command, as it exposes an implementation detail that script authors should not need to worry about. None of the original scripts use it anyway.
:th03: :th04: :th05: \w64
  • \w and \wk wait for the given number of frames
  • \wm and \wmk wait until PMD has played back the current BGM for the total number of measures, including loops, given in the first parameter, and fall back on calling \w and \wk with the second parameter as the frame number if BGM is disabled.
    🐞 Neither PMD nor MMD reset the internal measure when stopping playback. If no BGM is playing and the previous BGM hasn't been played back for at least the given number of measures, this command will deadlock.
Since both TH04 and TH05 fade in any new text from the invisible VRAM page, these commands can be used to simulate TH03's typing effect in those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be interrupted if ⏎ Return or Shot are held down. TH04 and TH05 recognize the k as well, but removed its functionality.
All of these commands have no effect if ESC is held.
\wm64,64
:th03: \wk64
\wmk64,64
:th03: :th04: :th05: \wi1
\wo1
Calls master.lib's palette_white_in() or palette_white_out() to play a hardware palette fade animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
:th03: :th04: :th05: \=4 Immediately displays the given quarter of the loaded .PI image in the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
:th03: :th04: :th05: \==4,1 Crossfades the picture area between its current content and quarter #4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless ESC is held. Any value ≥ 4 is replaced with quarter #0.
:th03: :th04: :th05: \$ Stops script execution. Must be called at the end of each file; otherwise, execution continues into whatever lies after the script buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04 require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter is omitted; \c is therefore equivalent to \c15.
Using the \@ command in the middle of a TH03 or TH04 cutscene script
The \@ bug. Yes, the ¥ is fake. It was easier to GIMP it than to reword the sentences so that the backslashes landed on the second byte of a 2-byte half-width character pair. :onricdennat:
Cutscene font weights in TH03Cutscene font weights in TH05, demonstrating the <code>\b3</code> bug that also affects TH04Cutscene font weights in TH03, rendered at a hypothetical unaligned X positionCutscene font weights in TH05, rendered at a hypothetical unaligned X position
The font weights and effects available through \b, including the glitch with \b3 in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels at the left side of each glyph, which would extend into the previous VRAM byte. Ironically, the TH04/TH05 version is more correct in this regard: For half-width glyphs, it preserves any further pixel columns generated by the weight functions in the high byte of the 16-dot glyph variable. Unlike TH03, which still cuts them off when rendering text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them towards their correct place (4️⃣). It's only at byte-aligned X positions (2️⃣) where they remain at their internally calculated place, and appear on screen as these glitched pixel columns, 15 pixels away from the glyph they belong to. It's easy to blame bugs like these on micro-optimized ASM code, but in this instance, you really can't argue against it if the original C++ version was equally incorrect.
Combining \b and s- into a partial dissolve animation. The speed can be controlled with \v.
Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we also get an additional fade animation after the box ends.

So yeah, that's the cutscene system. I'm dreading the moment I will have to deal with the other command interpreter in these games, i.e., the stage enemy system. Luckily, that one is completely disconnected from any other system, so I won't have to deal with it until we're close to finishing MAIN.EXE… that is, unless someone requests it before. And it won't involve text encodings or unblitting…


The cutscene system got me thinking in greater detail about how I would implement translations, being one of the main dependencies behind them. This goal has been on the order form for a while and could soon be implemented for these cutscenes, with 100% PI being right around the corner for the TH03 and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching for Latin-script languages could be implemented fairly quickly:

  1. Establish basic UTF-8 parsing for less painful manual editing of the source files
  2. Procedurally generate glyphs for the few required additional letters based on existing font ROM glyphs. For example, we'd generate ä by painting two short lines on top of the font ROM's a glyph, or generate ¿ by vertically flipping the question mark. This way, the text retains a consistent look regardless of whether the translated game is run with an NEC or EPSON font ROM, or the hideous abomination that Neko Project II auto-generates if you don't provide either.
  3. (Optional) Change automatic line breaks to work on a per-word basis, rather than per-glyph

That's it – script editing and distribution would be handled by your local translation group. It might seem as if this would also work for Greek and Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not sure if I want to attempt procedurally shrinking these glyphs from 16×16 to 8×16… For any more thorough solution, we'd need to go for a more "Chad" kind of full-blown translation support:

  1. Implement text subdivisions at a sensible granularity while retaining automatic line and box breaks
  2. Compile translatable text into a Japanese→target language dictionary (I'm too old to develop any further translation systems that would overwrite modded source text with translations of the original text)
  3. Implement a custom Unicode font system (glyphs would be taken from GNU Unifont unless translators provide a different 8×16 font for their language)
  4. Combine the text compiler with the font compiler to only store needed glyphs as part of the translation's font file (dealing with a multi-MB font file would be rather ugly in a Real Mode game)
  5. Write a simple install/update/patch stacking tool that supports both .HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to warrant a separate tool – each patch stack would be statically compiled into a single package file in the game's directory)
  6. Add a nice language selection option to the main menu
  7. (Optional) Support proportional fonts

Which sounds more like a separate project to be commissioned from Touhou Patch Center's Open Collective funds, separate from the ReC98 cap. This way, we can make sure that the feature is completely implemented, and I can talk with every interested translator to make sure that their language works.
It's still cheaper overall to do this on PC-98 than to first port the games to a modern system and then translate them. On the other hand, most of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with the difficulty of getting arbitrary Unicode characters to work natively in a PC-98 DOS game at all, and would be either unnecessary or trivial if we had already ported the game. Depending on where the patrons' interests lie, it may not be worth it. So let's see what all of you think about which way we should go, or whether it's worth doing at all. (Edit (2022-12-01): With Splashman's order towards the stage dialogue system, we've pretty much confirmed that it is.) Maybe we want to meet in the middle – using e.g. procedural glyph generation for dynamic translations to keep text rendering consistent with the rest of the PC-98 system, and just not support non-Latin-script languages in the beginning? In any case, I've added both options to the order form.


Surprisingly, there was still a bit of RE work left in the third push after all of this, which I filled with some small rendering boilerplate. Since I also wanted to include TH02's playfield overlay functions, 1/15 of that last push went towards getting a TH02-exclusive function out of the way, which also ended up including that game in this delivery. :tannedcirno:
The other small function pointed out how TH05's Stage 5 midboss pops into the playfield quite suddenly, since its clipping test thinks it's only 32 pixels tall rather than 64:

Good chance that the pop-in might have been intended. There's even another quirk here: The white flash during its first frame is actually carried over from the previous midboss, which the game still considers as actively getting hit by the player shot that defeated it. It's the regular boilerplate code for rendering a midboss that resets the responsible damage variable, and that code doesn't run during the defeat explosion animation.

Next up: Staying with TH05 and looking at more of the pattern code of its boss fights. Given the remaining TH05 budget, it makes the most sense to continue in in-game order, with Sara and the Stage 2 midboss. If more money comes in towards this goal, I could alternatively go for the Mai & Yuki fight and immediately develop a pretty fix for the cheeto storage glitch. Also, there's a rather intricate pull request for direct ZMBV decoding on the website that I've still got to review…

📝 Posted:
🚚 Summary of:
P0190, P0191, P0192
Commits:
5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:
nrook, -Tom-, [Anonymous]
🏷 Tags:
rec98+ th02+ th03+ th04+ th05+ gameplay+ boss+ shinki+ danmaku-pattern+ midboss+ master.lib- file-format+ tcc+ pc98+ micro-optimization+

The important things first:

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:


Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels
A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.


And then the code quality takes a nosedive with Shinki's main function. :onricdennat: Even in TH05, these boss and midboss update functions are still very imperative:

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.


The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.


Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

📝 Posted:
🚚 Summary of:
P0186, P0187, P0188
Commits:
a21ab3d...bab5634, bab5634...426a531, 426a531...e881f95
💰 Funded by:
Blue Bolt, [Anonymous], nrook
🏷 Tags:
rec98+ th02+ th04+ th05+ gameplay+ boss+ portability+ tcc+ master.lib- bug+ marisa-4+ gengetsu+ score+

Did you know that moving on top of a boss sprite doesn't kill the player in TH04, only in TH05?

Screenshot of Reimu moving on top of Stage 6 Yuuka, demonstrating the lack of boss↔player collision in TH04
Yup, Reimu is not getting hit… yet.

That's the first of only three interesting discoveries in these 3 pushes, all of which concern TH04. But yeah, 3 for something as seemingly simple as these shared boss functions… that's still not quite the speed-up I had hoped for. While most of this can be blamed, again, on TH04 and all of its hardcoded complexities, there still was a lot of work to be done on the maintenance front as well. These functions reference a bunch of code I RE'd years ago and that still had to be brought up to current standards, with the dependencies reaching from 📝 boss explosions over 📝 text RAM overlay functionality up to in-game dialog loading.

The latter provides a good opportunity to talk a bit about x86 memory segmentation. Many aspiring PC-98 developers these days are very scared of it, with some even going as far as to rather mess with Protected Mode and DOS extenders just so that they don't have to deal with it. I wonder where that fear comes from… Could it be because every modern programming language I know of assumes memory to be flat, and lacks any standard language-level features to even express something like segments and offsets? That's why compilers have a hard time targeting 16-bit x86 these days: Doing anything interesting on the architecture requires giving the programmer full control over segmentation, which always comes down to adding the typical non-standard language extensions of compilers from back in the day. And as soon as DOS stopped being used, these extensions no longer made sense and were subsequently removed from newer tools. A good example for this can be found in an old version of the NASM manual: The project started as an attempt to make x86 assemblers simple again by throwing out most of the segmentation features from MASM-style assemblers, which made complete sense in 1996 when 16-bit DOS and Windows were already on their way out. But there was a point to all those features, and that's why ReC98 still has to use the supposedly inferior TASM.

Not that this fear of segmentation is completely unfounded: All the segmentation-related keywords, directives, and #pragmas provided by Borland C++ and TASM absolutely can be the cause of many weird runtime bugs. Even if the compiler or linker catches them, you are often left with confusing error messages that aged just as poorly as memory segmentation itself.
However, embracing the concept does provide quite the opportunity for optimizations. While it definitely was a very crazy idea, there is a small bit of brilliance to be gained from making proper use of all these segmentation features. Case in point: The buffer for the in-game dialog scripts in TH04 and TH05.

// Thanks to the semantics of `far` pointers, we only need a single 32-bit
// pointer variable for the following code.
extern unsigned char far *dialog_p;

// This master.lib function returns a `void __seg *`, which is a 16-bit
// segment-only pointer. Converting to a `far *` yields a full segment:offset
// pointer to offset 0000h of that segment.
dialog_p = (unsigned char far *)hmem_allocbyte(/* … */);

// Running the dialog script involves pointer arithmetic. On a far pointer,
// this only affects the 16-bit offset part, complete with overflow at 64 KiB,
// from FFFFh back to 0000h.
dialog_p += /* … */;
dialog_p += /* … */;
dialog_p += /* … */;

// Since the segment part of the pointer is still identical to the one we
// allocated above, we can later correctly free the buffer by pulling the
// segment back out of the pointer.
hmem_free((void __seg *)dialog_p);

If dialog_p was a huge pointer, any pointer arithmetic would have also adjusted the segment part, requiring a second pointer to store the base address for the hmem_free call. Doing that will also be necessary for any port to a flat memory model. Depending on how you look at it, this compression of two logical pointers into a single variable is either quite nice, or really, really dumb in its reliance on the precise memory model of one single architecture. :tannedcirno:


Why look at dialog loading though, wasn't this supposed to be all about shared boss functions? Well, TH04 unnecessarily puts certain stage-specific code into the boss defeat function, such as loading the alternate Stage 5 Yuuka defeat dialog before a Bad Ending, or initializing Gengetsu after Mugetsu's defeat in the Extra Stage.
That's TH04's second core function with an explicit conditional branch for Gengetsu, after the 📝 dialog exit code we found last year during EMS research. And I've heard people say that Shinki was the most hardcoded fight in PC-98 Touhou… Really, Shinki is a perfectly regular boss, who makes proper use of all internal mechanics in the way they were intended, and doesn't blast holes into the architecture of the game. Even within TH05, it's Mai and Yuki who rely on hacks and duplicated code, not Shinki.

The worst part about this though? How the function distinguishes Mugetsu from Gengetsu. Once again, it uses its own global variable to track whether it is called the first or the second time within TH04's Extra Stage, unrelated to the same variable used in the dialog exit function. But this time, it's not just any newly created, single-use variable, oh no. In a misguided attempt to micro-optimize away a few bytes of conventional memory, TH04 reserves 16 bytes of "generic boss state", which can (and are) freely used for anything a boss doesn't want to store in a more dedicated variable.
It might have been worth it if the bosses actually used most of these 16 bytes, but the majority just use (the same) two, with only Stage 4 Reimu using a whopping seven different ones. To reverse-engineer the various uses of these variables, I pretty much had to map out which of the undecompiled danmaku-pattern functions corresponds to which boss fight. In the end, I assigned 29 different variable names for each of the semantically different use cases, which made up another full push on its own.

Now, 16 bytes of wildly shared state, isn't that the perfect recipe for bugs? At least during this cursory look, I haven't found any obvious ones yet. If they do exist, it's more likely that they involve reused state from earlier bosses – just how the Shinki death glitch in TH05 is caused by reusing cheeto data from way back in Stage 4 – and hence require much more boss-specific progress.
And yes, it might have been way too early to look into all these tiny details of specific boss scripts… but then, this happened:

TH04 crashing to the DOS prompt in the Stage 4 Marisa fight, right as the last of her bits is destroyed

Looks similar to another screenshot of a crash in the same fight that was reported in December, doesn't it? I was too much in a hurry to figure it out exactly, but notice how both crashes happen right as the last of Marisa's four bits is destroyed. KirbyComment has suspected this to be the cause for a while, and now I can pretty much confirm it to be an unguarded division by the number of on-screen bits in Marisa-specific pattern code. But what's the cause for Kurumi then? :thonk:
As for fixing it, I can go for either a fast or a slow option:

  1. Superficially fixing only this crash will probably just take a fraction of a push.
  2. But I could also go for a deeper understanding by looking at TH04's version of the 📝 custom entity structure. It not only stores the data of Marisa's bits, but is also very likely to be involved in Kurumi's crash, and would get TH04 a lot closer to 100% PI. Taking that look will probably need at least 2 pushes, and might require another 3-4 to completely decompile Marisa's fight, and 2-3 to decompile Kurumi's.

OK, now that that's out of the way, time to finish the boss defeat function… but not without stumbling over the third of TH04's quirks, relating to the Clear Bonus for the main game or the Extra Stage:

And after another few collision-related functions, we're now truly, finally ready to decompile bosses in both TH04 and TH05! Just as the anything funds were running out… :onricdennat: The remaining ¼ of the third push then went to Shinki's 32×32 ball bullets, rounding out this delivery with a small self-contained piece of the first TH05 boss we're probably going to look at.

Next up, though: I'm not sure, actually. Both Shinki and Elis seem just a little bit larger than the 2¼ or 4 pushes purchased so far, respectively. Now that there's a bunch of room left in the cap again, I'll just let the next contribution decide – with a preference for Shinki in case of a tie. And if it will take longer than usual for the store to sell out again this time (heh), there's still the 📝 PC-98 text RAM JIS trail word rendering research waiting to be documented.

📝 Posted:
🚚 Summary of:
P0168, P0169
Commits:
c2de6ab...8b046da, 8b046da...479b766
💰 Funded by:
rosenrose, Blue Bolt
🏷 Tags:
rec98+ th04+ th05+ boss+ yuuka-5+ blitting+ bug+ master.lib- waste+ mod+

EMS memory! The infamous stopgap measure between the 640 KiB ("ought to be enough for everyone") of conventional memory offered by DOS from the very beginning, and the later XMS standard for accessing all the rest of memory up to 4 GiB in the x86 Protected Mode. With an optionally active EMS driver, TH04 and TH05 will make use of EMS memory to preload a bunch of situational .CDG images at the beginning of MAIN.EXE:

  1. The "eye catch" game title image, shown while stages are loaded
  2. The character-specific background image, shown while bombing
  3. The player character dialog portraits
  4. TH05 additionally stores the boss portraits there, preloading them at the beginning of each stage. (TH04 instead keeps them in conventional memory during the entire stage.)

Once these images are needed, they can then be copied into conventional memory and accessed as usual.

Uh… wait, copied? It certainly would have been possible to map EMS memory to a regular 16-bit Real Mode segment for direct access, bank-switching out rarely used system or peripheral memory in exchange for the EMS data. However, master.lib doesn't expose this functionality, and only provides functions for copying data from EMS to regular memory and vice versa.
But even that still makes EMS an excellent fit for the large image files it's used for, as it's possible to directly copy their pixel data from EMS to VRAM. (Yes, I tried!) Well… would, because ZUN doesn't do that either, and always naively copies the images to newly allocated conventional memory first. In essence, this dumbs down EMS into just another layer of the memory hierarchy, inserted between conventional memory and disk: Not quite as slow as disk, but still requiring that memcpy() to retrieve the data. Most importantly though: Using EMS in this way does not increase the total amount of memory simultaneously accessible to the game. After all, some other data will have to be freed from conventional memory to make room for the newly loaded data.


The most idiomatic way to define the game-specific layout of the EMS area would be either a struct or an enum. Unfortunately, the total size of all these images exceeds the range of a 16-bit value, and Turbo C++ 4.0J supports neither 32-bit enums (which are silently degraded to 16-bit) nor 32-bit structs (which simply don't compile). That still leaves raw compile-time constants though, you only have to manually define the offset to each image in terms of the size of its predecessor. But instead of doing that, ZUN just placed each image at a nice round decimal offset, each slightly larger than the actual memory required by the previous image, just to make sure that everything fits. :tannedcirno: This results not only in quite a bit of unnecessary padding, but also in technically the single biggest amount of "wasted" memory in PC-98 Touhou: Out of the 180,000 (TH04) and 320,000 (TH05) EMS bytes requested, the game only uses 135,552 (TH04) and 175,904 (TH05) bytes. But hey, it's EMS, so who cares, right? Out of all the opportunities to take shortcuts during development, this is among the most acceptable ones. Any actual PC-98 model that could run these two games comes with plenty of memory for this to not turn into an actual issue.

On to the EMS-using functions themselves, which are the definition of "cross-cutting concerns". Most of these have a fallback path for the non-EMS case, and keep the loaded .CDG images in memory if they are immediately needed. Which totally makes sense, but also makes it difficult to find names that reflect all the global state changed by these functions. Every one of these is also just called from a single place, so inlining them would have saved me a lot of naming and documentation trouble there.
The TH04 version of the EMS allocation code was actually displayed on ZUN's monitor in the 2010 MAG・ネット documentary; WindowsTiger already transcribed the low-quality video image in 2019. By 2015 ReC98 standards, I would have just run with that, but the current project goal is to write better code than ZUN, so I didn't. 😛 We sure ain't going to use magic numbers for EMS offsets.

The dialog init and exit code then is completely different in both games, yet equally cross-cutting. TH05 goes even further in saving conventional memory, loading each individual player or boss portrait into a single .CDG slot immediately before blitting it to VRAM and freeing the pixel data again. People who play TH05 without an active EMS driver are surely going to enjoy the hard drive access lag between each portrait change… :godzun: TH04, on the other hand, also abuses the dialog exit function to preload the Mugetsu defeat / Gengetsu entrance and Gengetsu defeat portraits, using a static variable to track how often the function has been called during the Extra Stage… who needs function parameters anyway, right? :zunpet:

This is also the function in which TH04 infamously crashes after the Stage 5 pre-boss dialog when playing with Reimu and without any active EMS driver. That crash is what motivated this look into the games' EMS usage… but the code looks perfectly fine? Oh well, guess the crash is not related to EMS then. Next u–

OK, of course I can't leave it like that. Everyone is expecting a fix now, and I still got half of a push left over after decompiling the regular EMS code. Also, I've now RE'd every function that could possibly be involved in the crash, and this is very likely to be the last time I'll be looking at them.


Turns out that the bug has little to do with EMS, and everything to do with ZUN limiting the amount of conventional RAM that TH04's MAIN.EXE is allowed to use, and then slightly miscalculating this upper limit. Playing Stage 5 with Reimu is the most asset-intensive configuration in this game, due to the combination of

The star image used in TH04's Stage 5.
The star image used in TH04's Stage 5.

Remove any single one of the above points, and this crash would have never occurred. But with all of them combined, the total amount of memory consumed by TH04's MAIN.EXE just barely exceeds ZUN's limit of 320,000 bytes, by no more than 3,840 bytes, the size of the star image.

But wait: As we established earlier, EMS does nothing to reduce the amount of conventional memory used by the game. In fact, if you disabled TH04's EMS handling, you'd still get this crash even if you are running an EMS driver and loaded DOS into the High Memory Area to free up as much conventional RAM as possible. How can EMS then prevent this crash in the first place?

The answer: It's only because ZUN's usage of EMS bypasses the need to load the cached images back out of the XOR-encrypted 東方幻想.郷 packfile. Leaving aside the general stupidity of any game data file encryption*, master.lib's decryption implementation is also quite wasteful: It uses a separate buffer that receives fixed-size chunks of the file, before decrypting every individual byte and copying it to its intended destination buffer. That really resembles the typical slowness of a C fread() implementation more than it does the highly optimized ASM code that master.lib purports to be… And how large is this well-hidden decryption buffer? 4 KiB. :onricdennat:

So, looking back at the game, here is what happens once the Stage 5 pre-battle dialog ends:

  1. Reimu's bomb background image, which was previously freed to make space for her dialog portraits, has to be loaded back into conventional memory from disk
  2. BB0.CDG is found inside the 東方幻想.郷 packfile
  3. file_ropen() ends up allocating a 4 KiB buffer for the encrypted packfile data, getting us the decisive ~4 KiB closer to the memory limit
  4. The .CDG loader tries to allocate 52 608 contiguous bytes for the pixel data of Reimu's bomb image
  5. This would exceed the memory limit, so hmem_allocbyte() fails and returns a nullptr
  6. ZUN doesn't check for this case (as usual)
  7. The pixel data is loaded to address 0000:0000, overwriting the Interrupt Vector Table and whatever comes after
  8. The game crashes
The final frame rendered before the TH04 Stage 5 Reimu No-EMS crash
The final frame rendered by a crashing TH04.

The 4 KiB encryption buffer would only be freed by the corresponding file_close() call, which of course never happens because the game crashes before it gets there. At one point, I really did suspect the cause to be some kind of memory leak or fragmentation inside master.lib, which would have been quite delightful to fix.
Instead, the most straightforward fix here is to bump up that memory limit by at least 4 KiB. Certainly easier than squeezing in a cdg_free() call for the star image before the pre-boss dialog without breaking position dependence.

Or, even better, let's nuke all these memory limits from orbit because they make little sense to begin with, and fix every other potential out-of-memory crash that modders would encounter when adding enough data to any of the 4 games that impose such limits on themselves. Unless you want to launch other binaries (which need to do their own memory allocations) after launching the game, there's really no reason to restrict the amount of memory available to a DOS process. Heck, whenever DOS creates a new one, it assigns all remaining free memory by default anyway.
Removing the memory limits also removes one of ZUN's few error checks, which end up quitting the game if there isn't at least a given maximum amount of conventional RAM available. While it might be tempting to reserve enough memory at the beginning of execution and then never check any allocation for a potential failure, that's exactly where something like TH04's crash comes from.
This game is also still running on DOS, where such an initial allocation failure is very unlikely to happen – no one fills close to half of conventional RAM with TSRs and then tries running one of these games. It might have been useful to detect systems with less than 640 KiB of actual, physical RAM, but none of the PC-98 models with that little amount of memory are fast enough to run these games to begin with. How ironic… a place where ZUN actually added an error check, and then it's mostly pointless.

Here's an archive that contains both fix variants, just in case. These were compiled from the th04_noems_crash_fix and mem_assign_all branches, and contain as little code changes as possible.
Edit (2022-04-18): For TH04, you probably want to download the 📝 community choice fix package instead, which contains this fix along with other workarounds for the Divide error crashes. 2021-11-29-Memory-limit-fixes.zip

So yeah, quite a complex bug, leaving no time for the TH03 scorefile format research after all. Next up: Raising prices.

📝 Posted:
🚚 Summary of:
P0133
Commits:
045450c...1d5db71
💰 Funded by:
[Anonymous]
🏷 Tags:
rec98+ th01+ th02+ th03+ th04+ th05+ micro-optimization+ master.lib- tcc+

Wow, 31 commits in a single push? Well, what the last push had in progress, this one had in maintenance. The 📝 master.lib header transition absolutely had to be completed in this one, for my own sanity. And indeed, it reduced the build time for the entirety of ReC98 to about 27 seconds on my system, just as expected in the original announcement. Looking forward to even faster build times with the upcoming #include improvements I've got up my sleeve! The port authors of the future are going to appreciate those quite a bit.

As for the new translation units, the funniest one is probably TH05's function for blitting the 1-color .CDG images used for the main menu options. Which is so optimized that it becomes decompilable again, by ditching the self-modifying code of its TH04 counterpart in favor of simply making better use of CPU registers. The resulting C code is still a mess, but what can you do. :tannedcirno:
This was followed by even more TH05 functions that clearly weren't compiled from C, as evidenced by their padding bytes. It's about time I've documented my lack of ideas of how to get those out of Turbo C++. :onricdennat:

And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…

In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?

📝 Posted:
🚚 Summary of:
P0124, P0125
Commits:
72dfa09...056b1c7, 056b1c7...f6a3246
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:
rec98+ th02+ th04+ pc98+ menu+ waste+ master.lib- waste+

Turns out that TH04's player selection menu is exactly three times as complicated as TH05's. Two screens for character and shot type rather than one, and a way more intricate implementation for saving and restoring the background behind the raised top and left edges of a character picture when moving the cursor between Reimu and Marisa. TH04 decides to backup precisely only the two 256×8 (top) and 8×244 (left) strips behind the edges, indicated in red in the picture below.

Backed-up VRAM area in TH04's player character selection

These take up just 4 KB of heap memory… but require custom blitting functions, and expanding this explicitly hardcoded approach to TH05's 4 characters would have been pretty annoying. So, rather than, uh, not explicitly hardcoding it all, ZUN decided to just be lazy with the backup area in TH05, saving the entire 640×400 screen, and thus spending 128 KB of heap memory on this rather simple selection shadow effect. :zunpet:


So, this really wasn't something to quickly get done during the first half of a push, even after already having done TH05's equivalent of this menu. But since life is very busy right now, I also used the occasion to start addressing another code organization annoyance: master.lib's single master.h header file.

So, time to start a new master.hpp header that would contain just the declarations from master.h that PC-98 Touhou actually needs, plus some semantic (yes, semantic) sugar. Comparing just the old master.h to just the new master.hpp after roughly 60% of the transition has been completed, we get median build times of 319 ms for master.h, and 144 ms for master.hpp on my (admittedly rather slow) DOSBox setup. Nice!
As of this push, ReC98 consists of 107 translation units that have to be compiled with Turbo C++ 4.0J. Fully rebuilding all of these currently takes roughly 37.5 seconds in DOSBox. After the transition to master.hpp is done, we could therefore shave some 10 to 15 seconds off this time, simply by switching header files. And that's just the beginning, as this will also pave the way for further #include optimizations. Life in this codebase will be great!


Unfortunately, there wasn't enough time to repay some of the actual technical debt I was looking forward to, after all of this. Oh well, at least we now also have nice identifiers for the three different boldface options that are used when rendering text to VRAM, after procrastinating that issue for almost 11 months. Next up, assuming the existing subscriptions: More ridiculous decompilations of things that definitely weren't originally written in C, and a big blocker in TH03's MAIN.EXE.