ReC98

⮜ Blog

⮜ List of tags

Showing all posts tagged
and

📝 Posted:: 2023-05-01 23:56 UTC
🚚 Summary of:: P0238, P0239
⌨ Commits:: (Website) 4698397...edf2926, c5e51e6...P0239
💰 Funded by:: Ember2528
🏷 Tags:

Stripe is now properly integrated into this website as an alternative to PayPal! Now, you can also financially support the project if PayPal doesn't work for you, or if you prefer using a provider out of Stripe's greater variety. It's unfortunate that I had to ship this integration while the store is still sold out, but the Shuusou Gyoku OpenGL backend has turned out way too complicated to be finished next to these two pushes within a month. It will take quite a while until the store reopens and you all can start using Stripe, so I'll just link back to this blog post when it happens.

Integrating Stripe wasn't the simplest task in the world either. At first, the Checkout API seems pretty friendly to developers: The entire payment flow is handled on the backend, in the server language of your choice, and requires no frontend JavaScript except for the UI feedback code you choose to write. Your backend API endpoint initiates the Stripe Checkout session, answers with a redirect to Stripe, and Stripe then sends a redirect back to your server if the customer completed the payment. Superficially, this server-based approach seems much more GDPR-friendly than PayPal, because there are no remote scripts to obtain consent for. In reality though, Stripe shares much more potential personal data about your credit card or bank account with a merchant, compared to PayPal's almost bare minimum of necessary data.
It's also rather annoying how the backend has to persist the order form information throughout the entire Checkout session, because it would otherwise be lost if the server restarts while a customer is still busy entering data into Stripe's Checkout form. Compare that to the PayPal JavaScript SDK, which only POSTs back to your server after the customer completed a payment. In Stripe's case, more JavaScript actually only makes the integration harder: If you trigger the initial payment HTTP request from JavaScript, you will have to improvise a bit to avoid the CORS error when redirecting away to a different domain.

But sure, it's all not too bad… for regular orders at least. With subscriptions, however, things get much worse. Unlike PayPal, Stripe kind of wants to stay out of the way of the payment process as much as possible, and just be a wrapper around its supported payment methods. So if customers aren't really meant to register with Stripe, how would they cancel their subscriptions?
Answer: Through the… merchant? Which I quite dislike in principle, because why should you have to trust me to actually cancel your subscription after you requested it? It also means that I probably should add some sort of UI for self-canceling a Stripe subscription, ideally without adding full-blown user accounts. Not that this solves the underlying trust issue, but it's more convenient than contacting me via email or, worse, going through your bank somehow. Here is how my solution works:

When setting up a Stripe subscription, the server will generate a random ID for authentication. This ID is then used as a salt for a hash of the Stripe subscription ID, linking the two without storing the latter on my server.
The thank you page, which is parameterized with the Stripe Checkout session ID, will use that ID to retrieve the subscription ID via an API call to Stripe, and display it together with the above salt. This works indefinitely – contrary to what the expiry field in the Checkout session object suggests, Stripe sessions are indeed stored forever. After all, Stripe also displays this session information in a merchant's transaction log with an excessive amount of detail. It might have been better to add my own expiration system to these pages, but this had been taking long enough already. For now, be aware that sharing the link to a Stripe thank you page is equivalent to sharing your subscription cancellation password.
The salt is then used as the key for a subscription management page. To cancel, you visit this page and enter the Stripe subscription ID to confirm. The server then checks whether the salt and subscription ID pair belong to each other, and sends the actual cancellation request back to Stripe if they do.

I might have gone a bit overboard with the crypto there, but I liked the idea of not storing any of the Stripe session IDs in the server database. It's not like that makes the system more complex anyway, and it's nice to have a separate confirmation step before canceling a subscription.

But even that wasn't everything I had to keep in mind here. Once you switch from test to production mode for the final tests, you'll notice that certain SEPA-based payment providers take their sweet time to process and activate new subscriptions. The Checkout session object even informs you about that, by including a payment status field. Which initially seems just like another field that could indicate hacking attempts, but treating it as such and rejecting any unpaid session can also reject perfectly valid subscriptions. I don't want all this control… 🥲
Instead, all I can do in this case is to tell you about it. In my test, the Stripe dashboard said that it might take days or even weeks for the initial subscription transaction to be confirmed. In such a case, the respective fraction of the cap will unfortunately need to remain red for that entire time.

And that was 1½ pushes just to replicate the basic functionality of a simple PayPal integration with the simplest type of Stripe integration. On the architectural site, all the necessary refactoring work made me finally upgrade my frontend code to TypeScript at least, using the amazing esbuild to handle transpilation inside the server binary. Let's see how long it will now take for me to upgrade to SCSS…

With the new payment options, it makes sense to go for another slight price increase, from up to per push. The amount of taxes I have to pay on this income is slowly becoming significant, and the store has been selling out almost immediately for the last few months anyway. If demand remains at the current level or even increases, I plan to gradually go up to by the end of the year.
📝 As 📝 usual, I'm going to deliver existing orders in the backlog at the value they were originally purchased at. Due to the way the cap has to be calculated, these contributions now appear to have increased in value by a rather awkward 13.33%.

This left ½ of a push for some more work on the TH01 Anniversary Edition. Unfortunately, this was too little time for the grand issue of removing byte-aligned rendering of bigger sprites, which will need some additional blitting performance research. Instead, I went for a bunch of smaller bugfixes:

ANNIV.EXE now launches ZUNSOFT.COM if MDRV98 wasn't resident before. In hindsight, it's completely obvious why this is the right thing to do: Either you start ANNIV.EXE directly, in which case there's no resident MDRV98 and you haven't seen the ZUN Soft logo, or you have made a single-line edit to GAME.BAT and replaced op with anniv, in which case MDRV98 is resident and you have seen the logo. These are the two reasonable cases to support out of the box. If you are doing anything else, it shouldn't be that hard to adjust though?
You might be wondering why I didn't just include all code of ZUNSOFT.COM inside ANNIV.EXE together with the rest of the game. The reason: ZUNSOFT.COM has almost nothing in common with regular TH01 code. While the rest of TH01 uses the custom image formats and bad rendering code I documented again and again during its RE process, ZUNSOFT.COM fully relies on master.lib for everything about the bouncing-ball logo animation. Its code is much closer to TH02 in that respect, which suggests that ZUN did in fact write this animation for TH02, and just included the binary in TH01 for consistency when he first sold both games together at Comiket 52. Unlike the 📝 various bad reasons for splitting the PC-98 Touhou games into three main executables, it's still a good idea to split off animations that use a completely different set of rendering and file format functions. Combined with all the BFNT and shape rendering code, ZUNSOFT.COM actually contains even more unique code than OP.EXE, and only slightly less than FUUIN.EXE.
The optional AUTOEXEC.BAT is now correctly encoded in Shift-JIS instead of accidentally being UTF-8, fixing the previous mojibake in its final ECHO line.
The command-line option that just adds a stage selection without other debug features (anniv s) now works reliably.
This one's quite interesting because it only ever worked because of a ZUN bug. From a superficial look at the code, it shouldn't: While the presence of an 's' branch proves that ZUN had such a mode during development, he nevertheless forgot to initialize the debug flag inside the resident structure within this branch. This mode only ever worked because master.lib's resdata_create() function doesn't clear the resident structure after allocation. If anything on the system previously happened to write something other than 0x00, 0x01, or 0x03 to the specific byte that then gets repurposed as the debug mode flag, this lack of initialization does in fact result in a distinct non-test and non-debug stage selection mode.
This is what happens on a certain widely circulated .HDI copy of TH01 that boots MS-DOS 3.30C. On this system, the memory that master.lib will allocate to the TH01 resident structure was previously used by DOS as stack for its kernel, which left the future resident debug flag byte at address 9FF6:0012 at a value of 0x12. This might be the entire reason why game s is even widely documented to trigger a stage selection to begin with – on the widely circulated TH04 .HDI that boots MS-DOS 6.20, or on DOSBox-X, the s parameter doesn't work because both DOS systems leave the resident debug flag byte at 0x00. And since ANNIV.EXE pushes MDRV98 into that area of conventional DOS RAM, anniv s previously didn't work even on MS-DOS 3.30C.
Both bugs in the 📝 1×1 particle system during the Mima fight have been fixed. These include the off-by-one error that killed off the very first particle on the 80^th frame and left it in VRAM, and, just like every other entity type, a replacement of ZUN's EGC unblitter with the new pixel-perfect and fast one. Until I've rearchitected unblitting as a whole, the particles will now merely rip barely visible 1×1 holes into the sprites they overlap.
The 📝 score popups for flipped cards are now displayed without the two frames of flicker.
The bomb value shown in the lowest line of the in-game debug mode output is now right-aligned together with the rest of the values. This ensures that the game always writes a consistent number of characters to TRAM, regardless of the magnitude of the bomb value, preventing the seemingly wrong timer values that appeared in the original game whenever the value of the bomb variable changed to a lower number of digits:
Finally, I've streamlined VRAM page access changes, which allowed me to consistently replace ZUN's expensive function call with the optimal two inlined x86 instructions. Interestingly, this change alone removed 2 KiB from the binary size, which is almost all of the difference between 📝 the P0234-1 release and this one. Let's see how much longer we can make each new release of ANNIV.EXE smaller than the previous one.

The final point, however, raised the question of what we're now going to do about 📝 a certain issue in the 地獄/Jigoku Bad Ending. ZUN's original expensive way of switching the accessed VRAM page was the main reason behind the lag frames on slower PC-98 systems, and search-replacing the respective function calls would immediately get us to the optimized version shown in that blog post. But is this something we actually want? If we wanted to retain the lag, we could surely preserve that function just for this one instance…
The discovery of this issue predates the clear distinction between bloat, quirks, and bugs, so it makes sense to first classify what this issue even is. The distinction comes all down to observability, which I defined as changes to rendered frames between explicitly defined frame boundaries. That alone would be enough to categorize any cause behind lag frames as bloat, but it can't hurt to be more explicit here.

Therefore, I now officially judge observability in terms of an infinitely fast PC-98 that can instantly render everything between two explicitly defined frames, and will never add additional lag frames. If we plan to port the games to faster architectures that aren't bottlenecked by disappointing blitter chips, this is the only reasonable assumption to make, in my opinion: The minimum system requirements in the games' README files are minimums, after all, not recommendations. Chasing the exact frame drop behavior that ZUN must have experienced during the time he developed these games can only be a guessing game at best, because how can we know which PC-98 model ZUN actually developed the games on? There might even be more than one model, especially when it comes to TH01 which had been in development for at least two years before ZUN first sold it. It's also not like any current PC-98 emulator even claims to emulate the specific timing of any existing model, and I sure hope that nobody expects me to import a bunch of bulky obsolete hardware just to count dropped frames.

That leaves the tearing, where it's much more obvious how it's a bug. On an infinitely fast PC-98, the ドカーン frame would never be visible, and thus falls into the same category as the 📝 two unused animations in the Sariel fight. With only a single unconditional 2-frame delay inside the animation loop, it becomes clear that ZUN intended both frames of the animation to be displayed for 2 frames each:

No tearing, and 34 frames in total for the first of the two instances of this animation.

TH01 Anniversary Edition, version P0239 2023-05-01-th01-anniv.zip

Next up: Taking the oldest still undelivered push and working towards TH04 position independence in preparation for multilingual translations. The Shuusou Gyoku OpenGL backend shouldn't take that much longer either, so I should have lots of stuff coming up in May afterward.

📝 Posted:: 2022-08-11 21:27 UTC
🚚 Summary of:: P0212, P0213
⌨ Commits:: d398a94...363fd54, 363fd54...158a91e
💰 Funded by:: LeyDud, Lmocinemod, GhostRiderCog, Ember2528
🏷 Tags:

Wow, it's been 3 days and I'm already back with an unexpectedly long post about TH01's bonus point screens? 3 days used to take much longer in my previous projects…

Before I talk about graphics for the rest of this post, let's start with the exact calculations for both bonuses. Touhou Wiki already got these right, but it still makes sense to provide them here, in a format that allows you to cross-reference them with the source code more easily. For the card-flipping stage bonus:

Time	`min((`Stage timer `* 3), 6553)`
Continuous	`min((`Highest card combo `* 100), 6553)`
Bomb&Player	`min(((`Lives `* 200) + (`Bombs `* 100)), 6553)`
STAGE	`min((`(Stage number `- 1) * 200), 6553)`
BONUS Point	Sum of all above values * 10

The boss stage bonus is calculated from the exact same metrics, despite half of them being labeled differently. The only actual differences are in the higher multipliers and in the cap for the stage number bonus. Why remove it if raising it high enough also effectively disables it?

Time	`min((`Stage timer `* 5), 6553)`
Continuous	`min((`Highest card combo `* 200), 6553)`
MIKOsan	`min(((`Lives `* 500) + (`Bombs `* 200)), 6553)`
Clear	`min((`Stage number `* 1000), 65530)`
TOTLE	Sum of all above values * 10

The transition between the gameplay and TOTLE screens is one of the more impressive effects showcased in this game, especially due to how wavy it often tends to look. Aside from the palette interpolation (which is, by the way, the first time ZUN wrote a correct interpolation algorithm between two 4-bit palettes), the core of the effect is quite simple. With the TOTLE image blitted to VRAM page 1:

Shift the contents of a line on VRAM page 0 by 32 pixels, alternating the shift direction between right edge → left edge (even Y values) and the other way round (odd Y values)
Keep a cursor for the destination pixels on VRAM page 1 for every line, starting at the respective opposite edge
Blit the 32 pixels at the VRAM page 1 cursor to the newly freed 32 pixels on VRAM page 0, and advance the cursor towards the other edge
Successive line shifts will then include these newly blitted 32 pixels as well
Repeat (640 / 32) = 20 times, after which all new pixels will be in their intended place

So it's really more like two interlaced shift effects with opposite directions, starting on different scanlines. No trigonometry involved at all.

Horizontally scrolling pixels on a single VRAM page remains one of the few 📝 appropriate uses of the EGC in a fullscreen 640×400 PC-98 game, regardless of the copied block size. The few inter-page copies in this effect are also reasonable: With 8 new lines starting on each effect frame, up to (8 × 20) = 160 lines are transferred at any given time, resulting in a maximum of (160 × 2 × 2) = 640 VRAM page switches per frame for the newly transferred pixels. Not that frame rate matters in this situation to begin with though, as the game is doing nothing else while playing this effect.
What does sort of matter: Why 32 pixels every 2 frames, instead of 16 pixels on every frame? There's no performance difference between doing one half of the work in one frame, or two halves of the work in two frames. It's not like the overhead of another loop has a serious impact here, especially with the PC-98 VRAM being said to have rather high latencies. 32 pixels over 2 frames is also harder to code, so ZUN must have done it on purpose. Guess he really wanted to go for that 📽 cinematic 30 FPS look 📽 here…

Removing the palette interpolation and transitioning from a black screen to CLEAR3.GRP makes it a lot clearer how the effect works.

Once all the metrics have been calculated, ZUN animates each value with a rather fancy left-to-right typing effect. As 16×16 images that use a single bright-red color, these numbers would be perfect candidates for gaiji… except that ZUN wanted to render them at the more natural Y positions of the labels inside CLEAR3.GRP that are far from aligned to the 8×16 text RAM grid. Not having been in the mood for hardcoding another set of monochrome sprites as C arrays that day, ZUN made the still reasonable choice of storing the image data for these numbers in the single-color .GRC form– yeah, no, of course he once again chose the .PTN hammer, and its 📝 16×16 "quarter" wrapper functions around nominal 32×32 sprites.

.PTN sprite for the TOTLE metric digits of 0, 1, 2, and 3 — The three 32×32 TOTLE metric digit sprites inside `NUMB.PTN`.

.PTN sprite for the TOTLE metric digits of 4, 5, 6, and 7 — The three 32×32 TOTLE metric digit sprites inside `NUMB.PTN`.

Why do I bring up such a detail? What's actually going on there is that ZUN loops through and blits each digit from 0 to 9, and then continues the loop with "digit" numbers from 10 to 19, stopping before the number whose ones digit equals the one that should stay on screen. No problem with that in theory, and the .PTN sprite selection is correct… but the .PTN quarter selection isn't, as ZUN wrote (digit % 4) instead of the correct ((digit % 10) % 4). Since .PTN quarters are indexed in a row-major way, the 10-19 part of the loop thus ends up blitting 2 → 3 → 0 → 1 → 6 → 7 → 4 → 5 → (nothing):

This footage was slowed down to show one sprite blitting operation per frame. The actual game waits a hardcoded 4 milliseconds between each sprite, so even theoretically, you would only see roughly every 4^th digit. And yes, we can also observe the empty quarter here, only blitted if one of the digits is a 9.

Seriously though? If the deadline is looming and you've got to rush some part of your game, a standalone screen that doesn't affect anything is the best place to pick. At 4 milliseconds per digit, the animation goes by so fast that this quirk might even add to its perceived fanciness. It's exactly the reason why I've always been rather careful with labeling such quirks as "bugs". And in the end, the code does perform one more blitting call after the loop to make sure that the correct digit remains on screen.

The remaining ¾ of the second push went towards transferring the final data definitions from ASM to C land. Most of the details there paint a rather depressing picture about ZUN's original code layout and the bloat that came with it, but it did end on a real highlight. There was some unused data between ZUN's non-master.lib VSync and text RAM code that I just moved away in September 2015 without taking a closer look at it. Those bytes kind of look like another hardcoded 1bpp image though… wait, what?!

An unused mouse cursor sprite found in all of TH01's binaries

Lovely! With no mouse-related code left in the game otherwise, this cursor sprite provides some great fuel for wild fan theories about TH01's development history:

Could ZUN have 📝 stolen the basic PC-98 VSync or text RAM function code from a source that also implemented mouse support?
Did he have a mouse-controlled level editor during development? It's highly likely that he had something, given all the 📝 bit twiddling seen in the STAGE?.DAT format.
Or was this game actually meant to have mouse-controllable portions at some point during development? Even if it would have just been the menus.

… Actually, you know what, with all shared data moved to C land, I might as well finish FUUIN.EXE right now. The last secret hidden in its main() function: Just like GAME.BAT supports launching the game in various debug modes from the DOS command line, FUUIN.EXE can directly launch one of the game's endings. As long as the MDRV2 driver is installed, you can enter fuuin t1 for the 魔界/Makai Good Ending, or fuuin t for 地獄/Jigoku Good Ending.
Unfortunately, the command-line parameter can only control the route. Choosing between a Good or Bad Ending is still done exclusively through TH01's resident structure, and the continues_per_scene array in particular. But if you pre-allocate that structure somehow and set one of the members to a nonzero value, it would work. Trainers, anyone?

Alright, gotta get back to the code if I want to have any chance of finishing this game before the 15^th… Next up: The final 17 functions in REIIDEN.EXE that tie everything together and add some more debug features on top.

📝 Posted:: 2022-06-17 17:03 UTC
🚚 Summary of:: P0198, P0199, P0200
⌨ Commits:: 48db0b7...440637e, 440637e...5af2048, 5af2048...67e46b5
💰 Funded by:: Ember2528, Lmocinemod, Yanga
🏷 Tags:

What's this? A simple, straightforward, easy-to-decompile TH01 boss with just a few minor quirks and only two rendering-related ZUN bugs? Yup, 2½ pushes, and Kikuri was done. Let's get right into the overview:

Just like 📝 Elis, Kikuri's fight consists of 5 phases, excluding the entrance animation. For some reason though, they are numbered from 2 to 6 this time, skipping phase 1? For consistency, I'll use the original phase numbers from the source code in this blog post.
The main phases (2, 5, and 6) also share Elis' HP boundaries of 10, 6, and 0, respectively, and are once again indicated by different colors in the HP bar. They immediately end upon reaching the given number of HP, making Kikuri immune to the 📝 heap corruption in test or debug mode that can happen with Elis and Konngara.
Phase 2 solely consists of the infamous big symmetric spiral pattern.
Phase 3 fades Kikuri's ball of light from its default bluish color to bronze over 100 frames. Collision detection is deactivated during this phase.
In Phase 4, Kikuri activates her two souls while shooting the spinning 8-pellet circles from the previously activated ball. The phase ends shortly after the souls fired their third spread pellet group.
- Note that this is a timed phase without an HP boundary, which makes it possible to reduce Kikuri's HP below the boundaries of the next phases, effectively skipping them. Take this video for example, where Kikuri has 6 HP by the end of Phase 4, and therefore directly starts Phase 6.
  (Obviously, Kikuri's HP can also be reduced to 0 or below, which will end the fight immediately after this phase.)
Phase 5 combines the teardrop/ripple "pattern" from the souls with the "two crossed eye laser" pattern, on independent cycles.
Finally, Kikuri cycles through her remaining 4 patterns in Phase 6, while the souls contribute single aimed pellets every 200 frames.
Interestingly, all HP-bounded phases come with an additional hidden timeout condition:
- Phase 2 automatically ends after 6 cycles of the spiral pattern, or 5,400 frames in total.
- Phase 5 ends after 1,600 frames, or the first frame of the 7^th cycle of the two crossed red lasers.
- If you manage to keep Kikuri alive for 29 of her Phase 6 patterns, her HP are automatically set to 1. The HP bar isn't redrawn when this happens, so there is no visual indication of this timeout condition even existing – apart from the next Orb hit ending the fight regardless of the displayed HP. Due to the deterministic order of patterns, this always happens on the 8^th cycle of the "symmetric gravity pellet lines from both souls" pattern, or 11,800 frames. If dodging and avoiding orb hits for 3½ minutes sounds tiring, you can always watch the byte at DS:0x1376 in your emulator's memory viewer. Once it's at 0x1E, you've reached this timeout.

So yeah, there's your new timeout challenge.

The few issues in this fight all relate to hitboxes, starting with the main one of Kikuri against the Orb. The coordinates in the code clearly describe a hitbox in the upper center of the disc, but then ZUN wrote a < sign instead of a > sign, resulting in an in-game hitbox that's not quite where it was intended to be…

TODO — Kikuri's actual hitbox. Since the Orb sprite doesn't change its shape, we can visualize the hitbox in a pixel-perfect way here. The Orb must be completely within the red area for a hit to be registered.

TH01 Kikuri's intended hitbox — Kikuri's actual hitbox. Since the Orb sprite doesn't change its shape, we can visualize the hitbox in a pixel-perfect way here. The Orb must be completely within the red area for a hit to be registered.

Much worse, however, are the teardrop ripples. It already starts with their rendering routine, which places the sprites from TAMAYEN.PTN at byte-aligned VRAM positions in the ultimate piece of if(…) {…} else if(…) {…} else if(…) {…} meme code. Rather than tracking the position of each of the five ripple sprites, ZUN suddenly went purely functional and manually hardcoded the exact rendering and collision detection calls for each frame of the animation, based on nothing but its total frame counter.
Each of the (up to) 5 columns is also unblitted and blitted individually before moving to the next column, starting at the center and then symmetrically moving out to the left and right edges. This wouldn't be a problem if ZUN's EGC-powered unblitting function didn't word-align its X coordinates to a 16×1 grid. If the ripple sprites happen to start at an odd VRAM byte position, their unblitting coordinates get rounded both down and up to the nearest 16 pixels, thus touching the adjacent 8 pixels of the previously blitted columns and leaving the well-known black vertical bars in their place.

OK, so where's the hitbox issue here? If you just look at the raw calculation, it's a slightly confusingly expressed, but perfectly logical 17 pixels. But this is where byte-aligned blitting has a direct effect on gameplay: These ripples can be spawned at any arbitrary, non-byte-aligned VRAM position, and collisions are calculated relative to this internal position. Therefore, the actual hitbox is shifted up to 7 pixels to the right, compared to where you would expect it from a ripple sprite's on-screen position:

Due to the deterministic nature of this part of the fight, it's always 5 pixels for this first set of ripples. These visualizations are obviously not pixel-perfect due to the different potential shapes of Reimu's sprite, so they instead relate to her 32×32 bounding box, which needs to be entirely inside the red area.

We've previously seen the same issue with the 📝 shot hitbox of Elis' bat form, where pixel-perfect collision detection against a byte-aligned sprite was merely a sidenote compared to the more serious X=Y coordinate bug. So why do I elevate it to bug status here? Because it directly affects dodging: Reimu's regular movement speed is 4 pixels per frame, and with the internal position of an on-screen ripple sprite varying by up to 7 pixels, any micrododging (or "grazing") attempt turns into a coin flip. It's sort of mitigated by the fact that Reimu is also only ever rendered at byte-aligned VRAM positions, but I wouldn't say that these two bugs cancel out each other.
Oh well, another set of rendering issues to be fixed in the hypothetical Anniversary Edition – obviously, the hitboxes should remain unchanged. Until then, you can always memorize the exact internal positions. The sequence of teardrop spawn points is completely deterministic and only controlled by the fixed per-difficulty spawn interval.

Aside from more minor coordinate inaccuracies, there's not much of interest in the rest of the pattern code. In another parallel to Elis though, the first soul pattern in phase 4 is aimed on every difficulty except Lunatic, where the pellets are once again statically fired downwards. This time, however, the pattern's difficulty is much more appropriately distributed across the four levels, with the simultaneous spinning circle pellets adding a constant aimed component to every difficulty level.

Kikuri's phase 4 patterns, on every difficulty.

That brings us to 5 fully decompiled PC-98 Touhou bosses, with 26 remaining… and another ½ of a push going to the cutscene code in FUUIN.EXE.
You wouldn't expect something as mundane as the boss slideshow code to contain anything interesting, but there is in fact a slight bit of speculation fuel there. The text typing functions take explicit string lengths, which precisely match the corresponding strings… for the most part. For the "Gatekeeper 'SinGyoku'" string though, ZUN passed 23 characters, not 22. Could that have been the "h" from the Hepburn romanization of 神玉?!
Also, come on, if this text is already blitted to VRAM for no reason, you could have gone for perfect centering at unaligned byte positions; the rendering function would have perfectly supported it. Instead, the X coordinates are still rounded up to the nearest byte.

The hardcoded ending cutscene functions should be even less interesting – don't they just show a bunch of images followed by frame delays? Until they don't, and we reach the 地獄/Jigoku Bad Ending with its special shake/"boom" effect, and this picture:

Which is rendered by the following code:

for(int i = 0; i <= boom_duration; i++) { // (yes, off-by-one)
	if((i & 3) == 0) {
		graph_scrollup(8);
	} else {
		graph_scrollup(0);
	}

	end_pic_show(1); // ← different picture is rendered
	frame_delay(2);  // ← blocks until 2 VSync interrupts have occurred

	if(i & 1) {
		end_pic_show(2); // ← picture above is rendered
	} else {
		end_pic_show(1);
	}
}

Notice something? You should never see this picture because it's immediately overwritten before the frame is supposed to end. And yet it's clearly flickering up for about one frame with common emulation settings as well as on my real PC-9821 Nw133, clocked at 133 MHz. master.lib's graph_scrollup() doesn't block until VSync either, and removing these calls doesn't change anything about the blitted images. end_pic_show() uses the EGC to blit the given 320×200 quarter of VRAM from page 1 to the visible page 0, so the bottleneck shouldn't be there either…

…or should it? After setting it up via a few I/O port writes, the common method of EGC-powered blitting works like this:

Read 16 bits from the source VRAM position on any single bitplane. This fills the EGC's 4 16-bit tile registers with the VRAM contents at that specific position on every bitplane. You do not care about the value the CPU returns from the read – in optimized code, you would make sure to just read into a register to avoid useless additional stores into local variables.
Write any 16 bits to the target VRAM position on any single bitplane. This copies the contents of the EGC's tile registers to that specific position on every bitplane.

To transfer pixels from one VRAM page to another, you insert an additional write to I/O port 0xA6 before 1) and 2) to set your source and destination page… and that's where we find the bottleneck. Taking a look at the i486 CPU and its cycle counts, a single one of these page switches costs 17 cycles – 1 for MOVing the page number into AL, and 16 for the OUT instruction itself. Therefore, the 8,000 page switches required for EGC-copying a 320×200-pixel image require 136,000 cycles in total.

And that's the optimal case of using only those two instructions. 📝 As I implied last time, TH01 uses a function call for VRAM page switches, complete with creating and destroying a useless stack frame and unnecessarily updating a global variable in main memory. I tried optimizing ZUN's code by throwing out unnecessary code and using 📝 pseudo-registers to generate probably optimal assembly code, and that did speed up the blitting to almost exactly 50% of the original version's run time. However, it did little about the flickering itself. Here's a comparison of the first loop with boom_duration = 16, recorded in DOSBox-X with cputype=auto and cycles=max, and with i overlaid using the text chip. Caution, flashing lights:

The original animation, completing in 50 frames instead of the expected 34, thanks to slow blitting. Combined with the lack of double-buffering, this results in noticeable tearing as the screen refreshes while blitting is still in progress. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic #1.)

This optimized version completes in the expected 34 frames. No tearing happens to be visible in this recording, but the ドカーン image is still visible on every second loop iteration. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic #1.)

I pushed the optimized code to the th01_end_pic_optimize branch, to also serve as an example of how to get close to optimal code out of Turbo C++ 4.0J without writing a single ASM instruction.
And if you really want to use the EGC for this, that's the best you can do. It really sucks that it merely expanded the GRCG's 4×8-bit tile register to 4×16 bits. With 32 bits, ≥386 CPUs could have taken advantage of their wider registers and instructions to double the blitting performance. Instead, we now know the reason why 📝 Promisence Soft's EGC-powered sprite driver that ZUN later stole for TH03 is called SPRITE16 and not SPRITE32. What a massive disappointment.

But what's perhaps a bigger surprise: Blitting planar images from main memory is much faster than EGC-powered inter-page VRAM copies, despite the required manual access to all 4 bitplanes. In fact, the blitting functions for the .CDG/.CD2 format, used from TH03 onwards, would later demonstrate the optimal method of using REP MOVSD for blitting every line in 32-pixel chunks. If that was also used for these ending images, the core blitting operation would have taken ((12 + (3 × (320 / 32))) × 200 × 4) = 33,600 cycles, with not much more overhead for the surrounding row and bitplane loops. Sure, this doesn't factor in the whole infamous issue of VRAM being slow on PC-98, but the aforementioned 136,000 cycles don't even include any actual blitting either. And as you move up to later PC-98 models with Pentium CPUs, the gap between OUT and REP MOVSD only becomes larger. (Note that the page I linked above has a typo in the cycle count of REP MOVSD on Pentium CPUs: According to the original Intel Architecture and Programming Manual, it's 13+𝑛, not 3+𝑛.)
This difference explains why later games rarely use EGC-"accelerated" inter-page VRAM copies, and keep all of their larger images in main memory. It especially explains why TH04 and TH05 can get away with naively redrawing boss backdrop images on every frame.

In the end, the whole fact that ZUN did not define how long this image should be visible is enough for me to increment the game's overall bug counter. Who would have thought that looking at endings of all things would teach us a PC-98 performance lesson… Sure, optimizing TH01 already seemed promising just by looking at its bloated code, but I had no idea that its performance issues extended so far past that level.

That only leaves the common beginning part of all endings and a short main() function before we're done with FUUIN.EXE, and 98 functions until all of TH01 is decompiled! Next up: SinGyoku, who not only is the quickest boss to defeat in-game, but also comes with the least amount of code. See you very soon!

⮜ Blog

⮜ List of tags

Showing all posts tagged th01- and cutscene-

Showing all posts tagged
and