⮜ Blog

⮜ List of tags

Showing all posts tagged
and

📝 Posted:
🚚 Summary of:
P0264, P0265
Commits:
46cd6e7...78728f6, 78728f6...ff19bed
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷 Tags:

Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.

But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!

  1. The Music Room background masking effect
  2. The GRCG's plane disabling flags
  3. Text color restrictions
  4. The entire messy rest of the Music Room code
  5. TH04's partially consistent congratulation picture on Easy Mode
  6. TH02's boss position and damage variables

As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.

The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:

TH02's Music Room background in its on-disk state TH03's Music Room background in its on-disk state TH04's Music Room background in its on-disk state TH05's Music Room background in its on-disk state
Let's call this "the blank image".

That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right? :thonk:
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:

TH02's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH03's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH04's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH05's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.

On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.


Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:

// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);

// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);

// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);

// We're done, turn off the GRCG.
outportb(0x7C, 0x00);

This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.

Three overlapping Music Room polygons rendered using master.lib's grcg_polygon_c() function with a disabled GRCGThree overlapping Music Room polygons rendered as in the original game, with the GRCG enabled
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.

As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:

The contents of nopoly_B with each game's first track selected.

Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:

And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier… :onricdennat:


For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".

With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌


The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.

Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there! Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.

Screenshot of TH02's Five Magic Stones, with the first two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the second two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the last one (both internally and in the order you fight them in) alive and activated

This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!

These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.

Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.

📝 Posted:
🚚 Summary of:
P0198, P0199, P0200
Commits:
48db0b7...440637e, 440637e...5af2048, 5af2048...67e46b5
💰 Funded by:
Ember2528, Lmocinemod, Yanga
🏷 Tags:

What's this? A simple, straightforward, easy-to-decompile TH01 boss with just a few minor quirks and only two rendering-related ZUN bugs? Yup, 2½ pushes, and Kikuri was done. Let's get right into the overview:

So yeah, there's your new timeout challenge. :godzun:


The few issues in this fight all relate to hitboxes, starting with the main one of Kikuri against the Orb. The coordinates in the code clearly describe a hitbox in the upper center of the disc, but then ZUN wrote a < sign instead of a > sign, resulting in an in-game hitbox that's not quite where it was intended to be…

Kikuri's actual hitbox. Since the Orb sprite doesn't change its shape, we can visualize the hitbox in a pixel-perfect way here. The Orb must be completely within the red area for a hit to be registered.
TODO TH01 Kikuri's intended hitboxTH01 Kikuri's actual hitbox

Much worse, however, are the teardrop ripples. It already starts with their rendering routine, which places the sprites from TAMAYEN.PTN at byte-aligned VRAM positions in the ultimate piece of if(…) {…} else if(…) {…} else if(…) {…} meme code. Rather than tracking the position of each of the five ripple sprites, ZUN suddenly went purely functional and manually hardcoded the exact rendering and collision detection calls for each frame of the animation, based on nothing but its total frame counter. :zunpet:
Each of the (up to) 5 columns is also unblitted and blitted individually before moving to the next column, starting at the center and then symmetrically moving out to the left and right edges. This wouldn't be a problem if ZUN's EGC-powered unblitting function didn't word-align its X coordinates to a 16×1 grid. If the ripple sprites happen to start at an odd VRAM byte position, their unblitting coordinates get rounded both down and up to the nearest 16 pixels, thus touching the adjacent 8 pixels of the previously blitted columns and leaving the well-known black vertical bars in their place. :tannedcirno:

OK, so where's the hitbox issue here? If you just look at the raw calculation, it's a slightly confusingly expressed, but perfectly logical 17 pixels. But this is where byte-aligned blitting has a direct effect on gameplay: These ripples can be spawned at any arbitrary, non-byte-aligned VRAM position, and collisions are calculated relative to this internal position. Therefore, the actual hitbox is shifted up to 7 pixels to the right, compared to where you would expect it from a ripple sprite's on-screen position:

Due to the deterministic nature of this part of the fight, it's always 5 pixels for this first set of ripples. These visualizations are obviously not pixel-perfect due to the different potential shapes of Reimu's sprite, so they instead relate to her 32×32 bounding box, which needs to be entirely inside the red area.

We've previously seen the same issue with the 📝 shot hitbox of Elis' bat form, where pixel-perfect collision detection against a byte-aligned sprite was merely a sidenote compared to the more serious X=Y coordinate bug. So why do I elevate it to bug status here? Because it directly affects dodging: Reimu's regular movement speed is 4 pixels per frame, and with the internal position of an on-screen ripple sprite varying by up to 7 pixels, any micrododging (or "grazing") attempt turns into a coin flip. It's sort of mitigated by the fact that Reimu is also only ever rendered at byte-aligned VRAM positions, but I wouldn't say that these two bugs cancel out each other.
Oh well, another set of rendering issues to be fixed in the hypothetical Anniversary Edition – obviously, the hitboxes should remain unchanged. Until then, you can always memorize the exact internal positions. The sequence of teardrop spawn points is completely deterministic and only controlled by the fixed per-difficulty spawn interval.


Aside from more minor coordinate inaccuracies, there's not much of interest in the rest of the pattern code. In another parallel to Elis though, the first soul pattern in phase 4 is aimed on every difficulty except Lunatic, where the pellets are once again statically fired downwards. This time, however, the pattern's difficulty is much more appropriately distributed across the four levels, with the simultaneous spinning circle pellets adding a constant aimed component to every difficulty level.

Kikuri's phase 4 patterns, on every difficulty.


That brings us to 5 fully decompiled PC-98 Touhou bosses, with 26 remaining… and another ½ of a push going to the cutscene code in FUUIN.EXE.
You wouldn't expect something as mundane as the boss slideshow code to contain anything interesting, but there is in fact a slight bit of speculation fuel there. The text typing functions take explicit string lengths, which precisely match the corresponding strings… for the most part. For the "Gatekeeper 'SinGyoku'" string though, ZUN passed 23 characters, not 22. Could that have been the "h" from the Hepburn romanization of 神玉?!
Also, come on, if this text is already blitted to VRAM for no reason, you could have gone for perfect centering at unaligned byte positions; the rendering function would have perfectly supported it. Instead, the X coordinates are still rounded up to the nearest byte.

The hardcoded ending cutscene functions should be even less interesting – don't they just show a bunch of images followed by frame delays? Until they don't, and we reach the 地獄/Jigoku Bad Ending with its special shake/"boom" effect, and this picture:

Picture #2 from ED2A.GRP.

Which is rendered by the following code:

for(int i = 0; i <= boom_duration; i++) { // (yes, off-by-one)
	if((i & 3) == 0) {
		graph_scrollup(8);
	} else {
		graph_scrollup(0);
	}

	end_pic_show(1); // ← different picture is rendered
	frame_delay(2);  // ← blocks until 2 VSync interrupts have occurred

	if(i & 1) {
		end_pic_show(2); // ← picture above is rendered
	} else {
		end_pic_show(1);
	}
}

Notice something? You should never see this picture because it's immediately overwritten before the frame is supposed to end. And yet it's clearly flickering up for about one frame with common emulation settings as well as on my real PC-9821 Nw133, clocked at 133 MHz. master.lib's graph_scrollup() doesn't block until VSync either, and removing these calls doesn't change anything about the blitted images. end_pic_show() uses the EGC to blit the given 320×200 quarter of VRAM from page 1 to the visible page 0, so the bottleneck shouldn't be there either…

…or should it? After setting it up via a few I/O port writes, the common method of EGC-powered blitting works like this:

  1. Read 16 bits from the source VRAM position on any single bitplane. This fills the EGC's 4 16-bit tile registers with the VRAM contents at that specific position on every bitplane. You do not care about the value the CPU returns from the read – in optimized code, you would make sure to just read into a register to avoid useless additional stores into local variables.
  2. Write any 16 bits to the target VRAM position on any single bitplane. This copies the contents of the EGC's tile registers to that specific position on every bitplane.

To transfer pixels from one VRAM page to another, you insert an additional write to I/O port 0xA6 before 1) and 2) to set your source and destination page… and that's where we find the bottleneck. Taking a look at the i486 CPU and its cycle counts, a single one of these page switches costs 17 cycles – 1 for MOVing the page number into AL, and 16 for the OUT instruction itself. Therefore, the 8,000 page switches required for EGC-copying a 320×200-pixel image require 136,000 cycles in total.

And that's the optimal case of using only those two instructions. 📝 As I implied last time, TH01 uses a function call for VRAM page switches, complete with creating and destroying a useless stack frame and unnecessarily updating a global variable in main memory. I tried optimizing ZUN's code by throwing out unnecessary code and using 📝 pseudo-registers to generate probably optimal assembly code, and that did speed up the blitting to almost exactly 50% of the original version's run time. However, it did little about the flickering itself. Here's a comparison of the first loop with boom_duration = 16, recorded in DOSBox-X with cputype=auto and cycles=max, and with i overlaid using the text chip. Caution, flashing lights:

The original animation, completing in 50 frames instead of the expected 34, thanks to slow blitting. Combined with the lack of double-buffering, this results in noticeable tearing as the screen refreshes while blitting is still in progress. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic #1.)
This optimized version completes in the expected 34 frames. No tearing happens to be visible in this recording, but the ドカーン image is still visible on every second loop iteration. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic #1.)

I pushed the optimized code to the th01_end_pic_optimize branch, to also serve as an example of how to get close to optimal code out of Turbo C++ 4.0J without writing a single ASM instruction.
And if you really want to use the EGC for this, that's the best you can do. It really sucks that it merely expanded the GRCG's 4×8-bit tile register to 4×16 bits. With 32 bits, ≥386 CPUs could have taken advantage of their wider registers and instructions to double the blitting performance. Instead, we now know the reason why 📝 Promisence Soft's EGC-powered sprite driver that ZUN later stole for TH03 is called SPRITE16 and not SPRITE32. What a massive disappointment.

But what's perhaps a bigger surprise: Blitting planar images from main memory is much faster than EGC-powered inter-page VRAM copies, despite the required manual access to all 4 bitplanes. In fact, the blitting functions for the .CDG/.CD2 format, used from TH03 onwards, would later demonstrate the optimal method of using REP MOVSD for blitting every line in 32-pixel chunks. If that was also used for these ending images, the core blitting operation would have taken ((12 + (3 × (320 / 32))) × 200 × 4) = 33,600 cycles, with not much more overhead for the surrounding row and bitplane loops. Sure, this doesn't factor in the whole infamous issue of VRAM being slow on PC-98, but the aforementioned 136,000 cycles don't even include any actual blitting either. And as you move up to later PC-98 models with Pentium CPUs, the gap between OUT and REP MOVSD only becomes larger. (Note that the page I linked above has a typo in the cycle count of REP MOVSD on Pentium CPUs: According to the original Intel Architecture and Programming Manual, it's 13+𝑛, not 3+𝑛.)
This difference explains why later games rarely use EGC-"accelerated" inter-page VRAM copies, and keep all of their larger images in main memory. It especially explains why TH04 and TH05 can get away with naively redrawing boss backdrop images on every frame.

In the end, the whole fact that ZUN did not define how long this image should be visible is enough for me to increment the game's overall bug counter. Who would have thought that looking at endings of all things would teach us a PC-98 performance lesson… Sure, optimizing TH01 already seemed promising just by looking at its bloated code, but I had no idea that its performance issues extended so far past that level.

That only leaves the common beginning part of all endings and a short main() function before we're done with FUUIN.EXE, and 98 functions until all of TH01 is decompiled! Next up: SinGyoku, who not only is the quickest boss to defeat in-game, but also comes with the least amount of code. See you very soon!