⮜ Blog

⮜ List of tags

Showing all posts tagged
and

📝 Posted:
🚚 Summary of:
P0245
Commits:
97f0c3b...5876755
💰 Funded by:
Blue Bolt, Ember2528, [Anonymous], Yanga
🏷 Tags:

And then, the supposed boilerplate code revealed yet another confusing issue that quickly forced me back to serial work, leading to no parallel progress made with Shuusou Gyoku after all. 🥲 The list of functions I put together for the first ½ of this push seemed so boring at first, and I was so sure that there was almost nothing I could possibly talk about:

That's three instances of ZUN removing sprites way earlier than you'd want to, intentionally deciding against those sprites flying smoothly in and out of the playfield. Clearly, there has to be a system and a reason behind it.

Turns out that it can be almost completely blamed on master.lib. None of the super_*() sprite blitting functions can clip the rendered sprite to the edges of VRAM, and much less to the custom playfield rectangle we would actually want here. This is exactly the wrong choice to make for a game engine: Not only is the game developer now stuck with either rendering the sprite in full or not at all, but they're also left with the burden of manually calculating when not to display a sprite.
However, strictly limiting the top-left screen-space coordinate to (0, 0) and the bottom-right one to (640, 400) would actually stop rendering some of the sprites much earlier than the clipping conditions we encounter in these games. So what's going on there?

The answer is a combination of playfield borders, hardware scrolling, and master.lib needing to provide at least some help to support the latter. Hardware scrolling on PC-98 works by dividing VRAM into two vertical partitions along the Y-axis and telling the GDC to display one of them at the top of the screen and the other one below. The contents of VRAM remain unmodified throughout, which raises the interesting question of how to deal with sprites that reach the vertical edges of VRAM. If the top VRAM row that starts at offset 0x0000 ends up being displayed below the bottom row of VRAM that starts at offset 0x7CB0 for 399 of the 400 possible scrolling positions, wouldn't we then need to vertically wrap most of the rendered sprites?
For this reason, master.lib provides the super_roll_*() functions, which unconditionally perform exactly this vertical wrapping. But this creates a new problem: If these functions still can't clip, and don't even know which VRAM rows currently correspond to the top and bottom row of the screen (since master.lib's graph_scrollup() function doesn't retain this information), won't we also see sprites wrapping around the actual edges of the screen? That's something we certainly wouldn't want in a vertically scrolling game…
The answer is yes, and master.lib offers no solution for this issue. But this is where the playfield borders come in, and helpfully cover 16 pixels at the top and 16 pixels at the bottom of the screen. As a result, they can hide up to 32 rows of potentially wrapped sprite pixels below them:


The earliest possible frame that TH05 can start rendering the Stage 5 midboss on. Hiding the text layer reveals how master.lib did in fact "blindly" render the top part of her sprite to the bottom of the playfield. That's where her sprite starts before it is correctly wrapped around to the top of VRAM.
If we scrolled VRAM by another 200 pixels (and faked an equally shifted TRAM for demonstration purposes), we get an equally valid game scene that points out why a vertically scrolling PC-98 game must wrap all sprites at the vertical edges of VRAM to begin with.
Also, note how the HP bar has filled up quite a bit before the midboss can actually appear on screen.
VRAM contents of the first possible frame that TH05's Stage 5 midboss can appear on, at their original scrolling position. Also featuring the 64×64 bounding box of the midboss sprite.VRAM contents of the first possible frame that TH05's Stage 5 midboss can appear on, scrolled down by a further 200 pixels. Also featuring the 64×64 bounding box of the midboss sprite.

And that's how the lowest possible top Y coordinate for sprites blitted using the master.lib super_roll_*() functions during the scrolling portions of TH02, TH04, and TH05 is not 0, but -16. Any lower, and you would actually see some of the sprite's upper pixels at the bottom of the playfield, as there are no more opaque black text cells to cover them. Theoretically, you could lower this number for some animation frames that start with multiple rows of transparent pixels, but I thankfully haven't found any instance of ZUN using such a hack. So far, at least… :godzun:
Visualized like that, it all looks quite simple and logical, but for days, I did not realize that these sprites were rendered to a scrolling VRAM. This led to a much more complicated initial explanation involving the invisible extra space of VRAM between offsets 0x7D00 and 0x7FFF that effectively grant a hidden additional 9.6 lines below the playfield. Or even above, since PC-98 hardware ignores the highest bit of any offset into a VRAM bitplane segment (& 0x7FFF), which prevents blitting operations from accidentally reaching into a different bitplane. Together with the aforementioned rows of transparent pixels at the top of these midboss sprites, the math would have almost worked out exactly. :tannedcirno:

The need for manual clipping also applies to the X-axis. Due to the lack of scrolling in this dimension, the boundaries there are much more straightforward though. The minimum left coordinate of a sprite can't fall below 0 because any smaller coordinate would wrap around into the 📝 tile source area and overwrite some of the pixels there, which we obviously don't want to re-blit every frame. Similarly, the right coordinate must not extend into the HUD, which starts at 448 pixels.
The last part might be surprising if you aren't familiar with the PC-98 text chip. Contrary to the CGA and VGA text modes of IBM-compatibles, PC-98 text cells can only use a single color for either their foreground or background, with the other pixels being transparent and always revealing the pixels in VRAM below. If you look closely at the HUD in the images above, you can see how the background of cells with gaiji glyphs is slightly brighter (◼ #100) than the opaque black cells (◼ #000) surrounding them. This rather custom color clearly implies that those pixels must have been rendered by the graphics GDC. If any other sprite was rendered below the HUD, you would equally see it below the glyphs.

So in the end, I did find the clear and logical system I was looking for, and managed to reduce the new clipping conditions down to a set of basic rules for each edge. Unfortunately, we also need a second macro for each edge to differentiate between sprites that are smaller or larger than the playfield border, which is treated as either 32×32 (for super_roll_*()) or 32×16 (for non-"rolling" super_*() functions). Since smaller sprites can be fully contained within this border, the games can stop rendering them as soon as their bottom-right coordinate is no longer seen within the playfield, by comparing against the clipping boundaries with <= and >=. For example, a 16×16 sprite would be completely invisible once it reaches (16, 0), so it would still be rendered at (17, 1). A larger sprite during the scrolling part of a stage, like, say, the 64×64 midbosses, would still be rendered if their top-left coordinate was (0, -16), so ZUN used < and > comparisons to at least get an additional pixel before having to stop rendering such a sprite. Turbo C++ 4.0J sadly can't constant-fold away such a difference in comparison operators.

And for the most part, ZUN did follow this system consistently. Except for, of course, the typical mistakes you make when faced with such manual decisions, like how he treated TH04's Stage 4 midboss as a "small" sprite below 32×32 pixels (it's 64×64), losing that precious one extra pixel. Or how the entire rendering code for the 48×48 boss explosion sprite pretends that it's actually 64×64 pixels large, which causes even the initial transformation into screen space to be misaligned from the get-go. :zunpet: But these are additional bugs on top of the single one that led to all this research.
Because that's what this is, a bug. 🐞 Every resulting pixel boundary is a systematic result of master.lib's unfortunate lack of clipping. It's as much of a bug as TH01's byte-aligned rendering of entities whose internal position is not byte-aligned. In both cases, the entities are alive, simulated, and partake in collision detection, but their rendered appearance doesn't accurately reflect their internal position.
Initially, I classified 📝 the sudden pop-in of TH05's Stage 5 midboss as a quirk because we had no conclusive evidence that this wasn't intentional, but now we do. There have been multiple explanations for why ZUN put borders around the playfield, but master.lib's lack of sprite clipping might be the biggest reason.

And just like byte-aligned rendering, the clipping conditions can easily be removed when porting the game away from PC-98 hardware. That's also what uth05win chose to do: By using OpenGL and not having to rely on hardware scrolling, it can simply place every sprite as a textured quad at its exact position in screen space, and then draw the black playfield borders on top in the end to clip everything in a single draw call. This way, the Stage 5 midboss can smoothly fly into the playfield, just as defined by its movement code:

The entire smooth Stage 5 midboss entrance animation as shown in uth05win. If the simultaneous appearance of the Enemy!! label doesn't lend further proof to this having been ZUN's actual intention, I don't know what will.

Meanwhile, I designed the interface of the 📝 generic blitter used in the TH01 Anniversary Edition entirely around clipping the blitted sprite at any explicit combination of VRAM edges. This was nothing I tacked on in the end, but a core aspect that informed the architecture of the code from the very beginning. You really want to have one and only one place where sprite clipping is done right – and only once per sprite, regardless of how many bitplanes you want to write to.


Which brings us to the goal that the final ¼ of this push went toward. I thought I was going to start cleaning up the 📝 player movement and rendering code, but that turned out too complicated for that amount of time – especially if you want to start with just cleanup, preserving all original bugs for the time being.
Fixing and smoothening player and Orb movement would be the next big task in Anniversary Edition development, needing about 3 pushes. It would start with more performance research into runtime-shifting of larger sprites, followed by extending my generic blitter according to the results, writing new optimized loaders for the original image formats, and finally rewriting all rendering code accordingly. With that code in place, we can then start cleaning up and fixing the unique code for each boss, one by one.

Until that's funded, the code still contains a few smaller and easier pieces of code that are equally related to rendering bugs, but could be dealt with in a more incremental way. Line rendering is one of those, and first needs some refactoring of every call site, including 📝 the rotating squares around Mima and 📝 YuugenMagan's pentagram. So far, I managed to remove another 1,360 bytes from the binary within this final ¼ of a push, but there's still quite a bit to do in that regard.
This is the perfect kind of feature for smaller (micro-)transactions. Which means that we've now got meaningful TH01 code cleanup and Anniversary Edition subtasks at every price range, no matter whether you want to invest a lot or just a little into this goal.

If you can, because Ember2528 revealed the plan behind his Shuusou Gyoku contributions: A full-on Linux port of the game, which will be receiving all the funding it needs to happen. 🐧 Next up, therefore: Turning this into my main project within ReC98 for the next couple of months, and getting started by shipping the long-awaited first step towards that goal.
I've raised the cap to avoid the potential of rounding errors, which might prevent the last needed Shuusou Gyoku push from being correctly funded. I already had to pick the larger one of the two pending TH02 transactions for this push, because we would have mathematically ended up 1/25500 short of a full push with the smaller transaction. :onricdennat: And if I'm already at it, I might as well free up enough capacity to potentially ship the complete OpenGL backend in a single delivery, which is currently estimated to cost 7 pushes in total.

📝 Posted:
🚚 Summary of:
P0238, P0239
Commits:
(Website) 4698397...edf2926, c5e51e6...P0239
💰 Funded by:
Ember2528
🏷 Tags:

:stripe: Stripe is now properly integrated into this website as an alternative to PayPal! Now, you can also financially support the project if PayPal doesn't work for you, or if you prefer using a provider out of Stripe's greater variety. It's unfortunate that I had to ship this integration while the store is still sold out, but the Shuusou Gyoku OpenGL backend has turned out way too complicated to be finished next to these two pushes within a month. It will take quite a while until the store reopens and you all can start using Stripe, so I'll just link back to this blog post when it happens.

Integrating Stripe wasn't the simplest task in the world either. At first, the Checkout API seems pretty friendly to developers: The entire payment flow is handled on the backend, in the server language of your choice, and requires no frontend JavaScript except for the UI feedback code you choose to write. Your backend API endpoint initiates the Stripe Checkout session, answers with a redirect to Stripe, and Stripe then sends a redirect back to your server if the customer completed the payment. Superficially, this server-based approach seems much more GDPR-friendly than PayPal, because there are no remote scripts to obtain consent for. In reality though, Stripe shares much more potential personal data about your credit card or bank account with a merchant, compared to PayPal's almost bare minimum of necessary data. :thonk:
It's also rather annoying how the backend has to persist the order form information throughout the entire Checkout session, because it would otherwise be lost if the server restarts while a customer is still busy entering data into Stripe's Checkout form. Compare that to the PayPal JavaScript SDK, which only POSTs back to your server after the customer completed a payment. In Stripe's case, more JavaScript actually only makes the integration harder: If you trigger the initial payment HTTP request from JavaScript, you will have to improvise a bit to avoid the CORS error when redirecting away to a different domain.

But sure, it's all not too bad… for regular orders at least. With subscriptions, however, things get much worse. Unlike PayPal, Stripe kind of wants to stay out of the way of the payment process as much as possible, and just be a wrapper around its supported payment methods. So if customers aren't really meant to register with Stripe, how would they cancel their subscriptions? :thonk:
Answer: Through the… merchant? Which I quite dislike in principle, because why should you have to trust me to actually cancel your subscription after you requested it? It also means that I probably should add some sort of UI for self-canceling a Stripe subscription, ideally without adding full-blown user accounts. Not that this solves the underlying trust issue, but it's more convenient than contacting me via email or, worse, going through your bank somehow. Here is how my solution works:

I might have gone a bit overboard with the crypto there, but I liked the idea of not storing any of the Stripe session IDs in the server database. It's not like that makes the system more complex anyway, and it's nice to have a separate confirmation step before canceling a subscription.

But even that wasn't everything I had to keep in mind here. Once you switch from test to production mode for the final tests, you'll notice that certain SEPA-based payment providers take their sweet time to process and activate new subscriptions. The Checkout session object even informs you about that, by including a payment status field. Which initially seems just like another field that could indicate hacking attempts, but treating it as such and rejecting any unpaid session can also reject perfectly valid subscriptions. I don't want all this control… 🥲
Instead, all I can do in this case is to tell you about it. In my test, the Stripe dashboard said that it might take days or even weeks for the initial subscription transaction to be confirmed. In such a case, the respective fraction of the cap will unfortunately need to remain red for that entire time.

And that was 1½ pushes just to replicate the basic functionality of a simple PayPal integration with the simplest type of Stripe integration. On the architectural site, all the necessary refactoring work made me finally upgrade my frontend code to TypeScript at least, using the amazing esbuild to handle transpilation inside the server binary. Let's see how long it will now take for me to upgrade to SCSS…


With the new payment options, it makes sense to go for another slight price increase, from up to per push. The amount of taxes I have to pay on this income is slowly becoming significant, and the store has been selling out almost immediately for the last few months anyway. If demand remains at the current level or even increases, I plan to gradually go up to by the end of the year.
📝 As 📝 usual, I'm going to deliver existing orders in the backlog at the value they were originally purchased at. Due to the way the cap has to be calculated, these contributions now appear to have increased in value by a rather awkward 13.33%.


This left ½ of a push for some more work on the TH01 Anniversary Edition. Unfortunately, this was too little time for the grand issue of removing byte-aligned rendering of bigger sprites, which will need some additional blitting performance research. Instead, I went for a bunch of smaller bugfixes:

The final point, however, raised the question of what we're now going to do about 📝 a certain issue in the 地獄/Jigoku Bad Ending. ZUN's original expensive way of switching the accessed VRAM page was the main reason behind the lag frames on slower PC-98 systems, and search-replacing the respective function calls would immediately get us to the optimized version shown in that blog post. But is this something we actually want? If we wanted to retain the lag, we could surely preserve that function just for this one instance…
The discovery of this issue predates the clear distinction between bloat, quirks, and bugs, so it makes sense to first classify what this issue even is. The distinction comes all down to observability, which I defined as changes to rendered frames between explicitly defined frame boundaries. That alone would be enough to categorize any cause behind lag frames as bloat, but it can't hurt to be more explicit here.

Therefore, I now officially judge observability in terms of an infinitely fast PC-98 that can instantly render everything between two explicitly defined frames, and will never add additional lag frames. If we plan to port the games to faster architectures that aren't bottlenecked by disappointing blitter chips, this is the only reasonable assumption to make, in my opinion: The minimum system requirements in the games' README files are minimums, after all, not recommendations. Chasing the exact frame drop behavior that ZUN must have experienced during the time he developed these games can only be a guessing game at best, because how can we know which PC-98 model ZUN actually developed the games on? There might even be more than one model, especially when it comes to TH01 which had been in development for at least two years before ZUN first sold it. It's also not like any current PC-98 emulator even claims to emulate the specific timing of any existing model, and I sure hope that nobody expects me to import a bunch of bulky obsolete hardware just to count dropped frames.

That leaves the tearing, where it's much more obvious how it's a bug. On an infinitely fast PC-98, the ドカーン frame would never be visible, and thus falls into the same category as the 📝 two unused animations in the Sariel fight. With only a single unconditional 2-frame delay inside the animation loop, it becomes clear that ZUN intended both frames of the animation to be displayed for 2 frames each:

No tearing, and 34 frames in total for the first of the two instances of this animation.

:th01: TH01 Anniversary Edition, version P0239 2023-05-01-th01-anniv.zip

Next up: Taking the oldest still undelivered push and working towards TH04 position independence in preparation for multilingual translations. The Shuusou Gyoku OpenGL backend shouldn't take that much longer either, so I should have lots of stuff coming up in May afterward.

📝 Posted:
🚚 Summary of:
P0229, P0230, P0231, P0232, P0233, P0234
Commits:
6370f96...d535d87, d535d87...ca523b4, ca523b4...05a49b9, f7ef7f8...abeaf85, abeaf85...dbc5b51, dd2265c...12f29c6
💰 Funded by:
Ember2528, [Anonymous]
🏷 Tags:

128 commits! Who would have thought that the ideal first release of the TH01 Anniversary Edition would involve so much maintenance, and raise so many research questions? It's almost as if the real work only starts after the 100% finalization mark… Once again, I had to steal some funding from the reserved JIS trail word pushes to cover everything I liked to research, which means that the next towards the anything goal will repay this debt. Luckily, this doesn't affect any immediate plans, as I'll be spending March with tasks that are already fully funded.

So, how did this end up so massive? The list of things I originally set out to do was pretty short:

  1. Build entire game into single executable
  2. Fix rendering issues in the one or two most important parts of the game for a good initial impression

But even the first point already started with tons of little cleanup commits. A part of them can definitely be blamed on the rush to hit the 100% decompilation mark before the 25th anniversary last August. However, all the structural changes that I can't commit to master reveal how much of a mess the TH01 codebase actually is.
Merging the executables is mainly difficult because of all the inconsistencies between REIIDEN.EXE and FUUIN.EXE. The worst parts can be found in the REYHI*.DAT format code and the High Score menu, but the little things are just as annoying, like how the current score is an unsigned variable in REIIDEN.EXE, but a signed one in FUUIN.EXE. :zunpet: If it takes me this long and this many commits just to sort out all of these issues, it's no wonder that the only thing I've seen being done with this codebase since TH01's 100% decompilation was a single porting attempt that ended in a rather quick ragequit.
So why are we merging the executables in preparation for the Anniversary Edition, and not waiting with it until we start doing ports?

The game actually is so bloated that the combined binary ended up smaller than the original REIIDEN.EXE. If all you see are the file sizes of the original three executables, this might look like a pretty impressive feat. Like, how can we possibly get 407,812 bytes into less than 238,612 bytes, without using compression?
If you've ever looked at the linker map though, it's not at all surprising. Excluding the aforementioned inconsistencies that are hard to quantify, OP.EXE and FUUIN.EXE only feature 5,767 and 6,475 bytes of unique code and data, respectively. All other code in these binaries is already part of REIIDEN.EXE, with more than half of the size coming from the Borland C++ runtime. The single worst offender here is the C++ exception handler that Borland forces onto every non-.COM binary by default, which alone adds 20,512 bytes even if your binary doesn't use C++ exceptions.
On a more hilarious note, this single line is responsible for pulling another unnecessary 14,242 bytes into OP.EXE and FUUIN.EXE. This floating-point multiplication is completely unnecessary in this context because all possible parameters are integers, but it's enough for Turbo C++ and TLINK to pull in the entire x87 FPU emulation machinery. These two binaries don't even draw lines, but since this function is part of the general graphics code translation unit and contains other functions that these binaries do need, TLINK links in the entire thing. Maybe, multiple executables aren't the best choice either if you use a linker that can't do dead code elimination…

Since the 📝 Orb's physics do turn the entire precision of a double variable into gameplay effects, it's not feasible to ever get rid of all FPU code in TH01. The exception handler, however, can be removed, which easily brings the combined binary below the size of the original REIIDEN.EXE. Compiling all code with a single set of compiler optimization flags, including the more x86-friendly pascal calling convention, then gets us a few more KB on top. As does, of course, removing unused code: The only remaining purpose of features such as 📝 resident palettes is to potentially make porting more difficult for anyone who doesn't immediately realize that nothing in the game uses these functions.
Technically, all unused code would be bloat, but for now, I'm keeping the parts that may tell stories about the game's development history (such as unused effects or the 📝 mouse cursor), or that might help with debugging. Even with that in mind, I've only scratched the surface when it comes to bloat removal, and the binary is only going to get smaller from here. A lot smaller.

If only we now could start MDRV98 from this new combined binary, we wouldn't need a second batch file either…


Which brings us to the first big research question of this delivery. Using the C spawn() function works fine on this compiler, so spawn("MDRV98.COM") would be all we need to do, right? Except that the game crashes very soon after that subprocess returned. :thonk:
So it's not going to be that easy if the spawned process is a TSR. But why should this be a problem? Let's take a look at the DOS heap, and how DOS lays out processes in conventional memory if we launch the game regularly through GAME.BAT:

The rough layout of the DOS heap when launching TH01 from GAME.BAT.

The batch file starts MDRV98 first, which will therefore end up below the game in conventional memory. This is perfect for a TSR: The program can resize itself arbitrarily before returning to DOS, and the rest of memory will be left over for the game. If we assume such a layout, a DOS program can implement a custom memory allocator in a very simple way, as it only has to search for free memory in one direction – and this is exactly how Borland implemented the C heap for functions like malloc() and free(), and the C++ new and delete operators.
But if we spawn MDRV98 after starting TH01, well…

MDRV98 will spawn in the next free memory location, allocate itself, return to TH01… which suddenly finds its C heap blocked from growing. As a result, the next big allocation will immediately fail with a rather misleading "out of memory" error.

So, what can we do about this? Still in a bloat removal mindset, my gut reaction was to just throw out Borland's C heap implementation, and replace it with a very thin wrapper around the DOS heap as managed by INT 21h, AH=48h/49h/4Ah. Like, why did these DOS compilers even bother with a custom allocator in the first place if DOS already comes with a perfectly fine native one? Using the native allocator would completely erase the distinction between TSR memory and game memory, and inherently allow the game to allocate beyond MDRV98.
I did in fact implement this, and noticed even more benefits:

Ultimately though, the drawbacks became too significant. Most of them are related to the PC-98 Touhou games only ever creating a single DOS process, even though they contain multiple executables. Switching executables is done via exec(), which resizes a program's main allocation to match the new binary and then overwrites the old program image with the new one. If you've ever wondered why DOSBox-X only ever shows OP as the active process name in the title bar, you now know why. As far as DOS is concerned, it's still the same OP.EXE process rooted at the same segment, and exec() doesn't bother rewriting the name either. Most importantly though, this is how REIIDEN.EXE can launch into another REIIDEN.EXE process even if there are less than 238,612 bytes free when exec() is called, and without consuming more memory for every successive binary.
For now, ANNIV.EXE still re-exec()s itself at every point where the original game did, as ZUN's original code really depends on being reinitialized at boss and scene boundaries. The resulting accidental semi-hot reloading is also a useful property to retain during development.
So why is the DOS heap a bad idea for regular game allocation after all?

I could release this DOS heap wrapper in unused form for another push if anyone's interested, but for now, I'm pretty happy with not actually using it in the games. Instead, let's stay with the Borland C heap, and find a way to push MDRV98 to the very top of conventional RAM. Like this:

Which is much easier said than done. It would be nice if we could just use the last fit allocation strategy here, but .COM executables always receive all free memory by default anyway, which eliminates any difference between the strategies.
But we can still change memory itself. So let's temporarily claim all remaining free memory, minus the exact amount we need for MDRV98, for our process. Then, the only remaining free space to spawn MDRV98 is at the exact place where we want it to be:

Obviously, we release all the additional memory after spawning MDRV98.

Now we only need to know how much memory to not temporarily allocate. First, we need to replicate the assumption that MDRV98's -M7 command-line parameter corresponds to a resident size of 23,552 bytes. This is not as bad as it seems, because the -M parameter explicitly has a KiB unit, and we can nicely abstract it away for the API.
The (env.) block though? Its minimum size equals the combined length of all environment variables passed to the process, but its maximum size is… not limited at all?! As in, DOS implementations can add and have historically added more free space because some programs insisted on storing their own new environment variables in this exact segment. DOSBox and DOSBox-X follow this tradition by providing a configuration option for the additional amount of environment space, with the latter adding 1024 additional bytes by default, y'know, just in case someone wants to compile FreeDOS on a slow emulator. It's not even worth sending a bug report for this specific case, because it's only a symptom of the fact that unexpectedly large program environment blocks can and will happen, and are to be expected in DOS land.
So thanks to this cruel joke, it's technically impossible to achieve what we want to do there. Hooray! The only thing we can kind of do here is an educated guess: Sum up the length of all environment variables in our environment block, compare that length against the allocated size of the block, and assume that the MDRV98 process will get as much additional memory as our process got. 🤷

The remaining hurdles came courtesy of some Borland C runtime implementation details. You would think that the temporary reallocation could even be done in pure C using the sbrk(), coreleft(), and brk() functions, but all values passed to or returned from these functions are inaccurate because they don't factor in the aforementioned KiB padding to the underlying DOS memory block. So we have to directly use the DOS syscalls after all. Which at least means that learning about them wasn't completely useless…
The final issue is caused inside Borland's spawn() implementation. The environment block for the child process is built out of all the strings reachable from C's environ pointer, which is what that FreeDOS build process should have used. Coalescing them into a single buffer involves yet another C heap allocation… and since we didn't report our DOS memory block manipulation back to the C heap, the malloc() call might think it needs to request more memory from DOS. This resets the DOS memory block back to its intended level, undoing our manipulation right before the actual INT 21h, AH=4Bh EXEC syscall. Or in short:

Manipulate DOS heap ➜ spawn() call ➜ _LoadProg() ➜ allocate and prepare environment block ➜ _spawn() ➜ DOS EXEC syscall

The obvious solution: Replace _LoadProg(), implement the coalescing ourselves, and do it before the heap manipulation. Fortunately, Borland's internal low-level _spawn() function is not static, so we can call it ourselves whenever we want to:

Allocate and prepare environment block ➜ manipulate DOS heap ➜ _spawn() call ➜ EXEC syscall

So yes, launching MDRV98 from C can be done, but it involves advanced witchcraft and is completely ridiculous. :tannedcirno: Launching external sound drivers from a batch file is the right way of doing things.
Fortunately, you don't have to rely on this auto-launching feature. You can still launch DEBLOAT.EXE or ANNIV.EXE from a batch file that launched MDRV98.COM before, and the binaries will detect this case and skip the attempt of launching MDRV98 from C. It's unlikely that my heuristic will ever break, but I definitely recommend replicating GAME.BAT just to be completely sure – especially for user-friendly repacks that don't want to include the original game anyway.
This is also why ANNIV.EXE doesn't launch ZUNSOFT.COM: The "correct" and stable way to launch ANNIV.EXE still involves a batch file, and I would say that expecting people to remove ZUNSOFT.COM from that file is worse than not playing the animation. It's certainly a debate we can have, though.


This deep dive into memory allocation revealed another previously undocumented bug in the original game. The RLE decompression code for the 東方靈異.伝 packfile contains two heap overflows, which are actually triggered by SinGyoku's BOSS1_3.BOS and Konngara's BOSS8_1.BOS. They only do not immediately crash the game when loading these bosses thanks to two implementation details of Borland's C heap. :zunpet:
Obviously, this is a bug we should fix, but according to the definition of bugs, that fix would be exclusive to the anniversary branch. Isn't that too restrictive for something this critical? This code is guaranteed to blow up with a different heap implementation, if only in a Debug build. :thonk: And besides, nobody would notice a fix just by looking at the game's rendered output…

Looks like we have to introduce a fourth category of weird code, in addition to the previous bloat, bug, and quirk categories, for invisible internal issues like these. Let's call it landmine, and fix them on the debloated branch as well. Thanks to Clerish for the naming inspiration!
With this new category, the full definitions for all categories have become quite extensive. Thus, they now live in CONTRIBUTING.md inside the ReC98 repository.

With the new discoveries and the new landmine category, TH01 is now at 67 bugs and 20 landmines. And the solution for the landmine in question? Simplifying the 61 lines of the original code down to 16. And yes, I'm including comments in these numbers – if the interactions of the code are complex enough to require multi-paragraph comments, these are a necessary and valid part of the code.


While we're on the topic of weird code and its visible or invisible effects, there's one thing you might be concerned about. With all the rearchitecting and data shifting we're doing on the debloated branch, what will happen to the 📝 negative glitch stages? These are the result of a clearly observable bug that, by definition, must not be fixed on the debloated branch. But given that the observable layout of the glitch stages is defined by the memory surrounding the scene stage variable, won't the debloated branch inherently alter their appearance (= ⚠️ fanfiction ⚠️), or even remove them completely?

Well, yes, it will. But we can still preserve their layout by hardcoding the exact original data that the game would originally read, and even emulate the original segment relocations and other pieces of global data.
Doing this is feasible thanks to the fact that there are only 4 glitch stages. Unfortunately, the same can't be said for the timer values, which are determined by an array lookup with the un-modulo'd stage ID. If we wanted to preserve those as well, we'd have to bundle an exact copy of the original REIIDEN.EXE data segment to preserve the values of all 32,768 negative stages you could possibly enter, together with a map of all relocations in this segment. 😵 Which I've decided against for now, since this has been going on for far too long already. Let's first see if anyone ever actually complains about details like this…


Alright, time to start the anniversary branch by rendering everything at its correct internal unaligned X position? Eh… maybe not quite yet. If we just hacked all the necessary bit-shifting code into all the format-specific blitting functions, we'd still retain all this largely redundant, bad, and slow code, and would make no progress in terms of portability. It'd be much better to first write a single generic blitter that's decently optimized, but supports all kinds of sprites to make this optimization actually worth something.
So, next research question: How would such a blitter look like? After I learned during my 📝 first foray into cycle counting that port I/O is slow on 486 CPUs, it became clear that TH04's 📝 GRCG batching for pellets was one of the more useful optimizations that probably contributed a big deal towards achieving the high bullet counts of that game. This leads to two conclusions:

Maybe we should also start by not even doing these unaligned bit shifts ourselves, and instead expect the call site to 📝 always deliver a byte-aligned sprite that is correctly preshifted, if necessary? Some day, we definitely should measure how slow runtime shifting would really be…

What we should do, however, are some further general optimizations that I would have expected from master.lib: Unrolling the vertical loop, and baking a single function for every sprite width to eliminate the horizontal loop. We can then use the widest possible x86 MOV instruction for the lowest possible number of cycles per row – for example, we'd blit a 56-wide sprite with three MOVs (32-bit + 16-bit + 8-bit), and a 64-wide one with two 32-bit MOVs.
Or maybe not? There's a lot of blitting code in both master.lib and PC-98 Touhou that checks for empty bytes within sprites to skip needlessly writing them to VRAM:

uint8_t left_half = ((uint8_t *)(sprite))[0];
uint8_t right_half = ((uint8_t *)(sprite))[1];
if(right_half != 0x00) {
	pokeb(VRAM_SEGMENT, (vram_offset + 0), left_half);
}
if(right_half != 0x00) {
	pokeb(VRAM_SEGMENT, (vram_offset + 1), right_half);
}

Which goes against everything you seem to know about computers. We aren't running on an 8-bit CPU here, so wouldn't it be faster to always write both halves of a sprite in a single operation?

uint16_t both_halves = ((uint16_t *)(sprite))[0];
pokew(VRAM_SEGMENT, vram_offset, both_halves);

That's a single CPU instruction, compared to two instructions and two branches. The only possible explanation for this would be that VRAM writes are so slow on PC-98 that you'd want to avoid them at all costs, even if that means additional branching on the CPU to do so. Or maybe that was something you would want to do on certain models with slow VRAM, but not on others?

So I wrote a benchmark to answer all these questions, and to compare my new blitter against typical TH01 blitting code:

A not really representative run on DOSBox-X. Since the master.lib sprite functions are also unbatched, I expect them to not be much faster than the naive C implementation.

2023-03-05-blitperf.zip And here are the real-hardware results I've got from the PC-9800 Central Discord server:

PC-286LS PC-9801ES PC-9821Cb/Cx PC-9821Ap3 PC-9821An PC-9821Nw133 PC-9821Ra20
80286, 12 MHz i386SX, 16 MHz 486SX, 33 MHz 486DX4, 100 MHz Pentium, 90 MHz Pentium, 133 MHz Pentium Pro, 200 MHz
1987 1989 1994 1994 1994 1997 1996
Unchecked C GRCG 36,85 38,42 26,02 26,87 3,98 4,13 2,08 2,16 1,81 1,87 0,86 0,89 1,25 1,25
MOVS GRCG 15,22 16,87 9,33 10,19 1,22 1,37 0,44 0,44
MOV GRCG 15,42 17,08 9,65 10,53 1,15 1,3 0,44 0,44
4-plane 37,23 43,97 29,2 32,96 4,44 5,01 4,39 4,67 5,11 5,32 5,61 5,74 6,63 6,64
Checking first GRCG 17,49 19,15 10,84 11,72 1,27 1,44 1,04 1,07 0,54 0,54
4-plane 46,49 53,36 35,01 38,79 5,66 6,26 5,43 5,74 6,56 6,8 8,08 8,29 10,25 10,29
Checking second GRCG 16,47 18,12 10,77 11,65 1,25 1,39 1,02 0,51 0,51
4-plane 43,41 50,26 33,79 37,82 5,22 5,81 5,14 5,43 6,18 6,4 7,57 7,77 9,58 9,62
Checking both GRCG 16,14 18,03 10,84 11,71 1,33 1,49 1,01 0,49 0,49
4-plane 43,61 50,45 34,11 37,87 5,39 5,99 4,92 5,23 5,88 6,11 7,19 7,43 9,1 9,13
Amount of frames required to render 2000 16×8 pellet sprites on a variety of PC-98 models, using the new generic blitter. Both preshifted (first column) and runtime-shifted (second column) sprites were tested; empty columns correspond to times faster than a single frame. Thanks to cuba200611, Shoutmon, cybermind, and Digmac for running the tests!

The key takeaways:

Since this won't be the only piece of game-independent and explicitly PC-98-specific custom code involved in this delivery, it makes sense to start a dedicated PC-98 platform layer. This code will gradually eliminate the dependency on master.lib and replace it with better optimized and more readable C++ code. The blitting benchmark, for example, is already implemented completely without master.lib.
While this platform layer is mainly written to generate optimal code within Turbo C++ 4.0J, it can also serve as general PC-98 documentation for everyone who prefers code over machine-translating old Japanese books. Not to mention the immediacy of having all actual relevant information in one place, which might otherwise be pretty well hidden in these books, or some obscure old text file. For example, did you know that uploading gaiji via INT 18h might end up disabling the VSync interrupt trigger, deadlocking the process on the next frame delay loop? This nuisance is not replicated by any emulators, and it's quite frustrating to encounter it when trying to run your code on real hardware. master.lib works around it by simply hooking INT 18h and unconditionally reenabling the VSync interrupt trigger after the original handler returns, and so does our platform layer.


So, with the pellet draw calls batched and routed through the new renderer, we should have gained enough free CPU cycles to disable 📝 interlaced pellet rendering without any impact on frame rates?

Well, kinda. We do get 56.4 FPS, but only together with noticeable and reproducible tearing in the top part of the playfield, suggesting exactly why ZUN interlaced the rendering in the first place. 😕 So have we already reached the limit of single-buffered PC-98 games here, or can we still do something about it?
As it turns out, the main bottleneck actually lies in the pellet unblitting code. Every EGC-"accelerated" unblitting call in TH01 is as unbatched as the pellet blitting calls were, spending an additional 17 I/O port writes per call to completely set up and shut down the EGC, every time. And since this is TH01, the two-instruction operation of changing the active PC-98 VRAM page isn't inlined either, but instead done via a function call to a faraway segment. On the 486, that's:

This sums up to

And this calculation even ignores the lack of small micro-optimizations that could further optimize the blitting loop. Multiply that by the game's pellet cap of 100, and we get a 6-digit number of wasted CPU cycles. On paper, that's roughly 1/6 of the time we have for each of our target 56.423 FPS on the game's target 33 MHz systems. Might not sound all too critical, but the single-buffered nature of the game means that we're effectively racing the beam on every frame. In turn, we have to be even more serious about performance.

So, time to also add a batched EGC API to our PC-98 platform layer? Writing our own EGC code presents a nice opportunity to finally look deeper into all its registers and configuration options, and see what exactly we can do about ZUN's enforced 16-pixel alignment.
To nobody's surprise, this alignment is completely unnecessary, and only displays a lack of knowledge about the chip. While it is true that the EGC wants VRAM to be exclusively addressed in 16-bit chunks at 16-bit-aligned addresses, it specifically provides

And it gets even better: After ⌈bitlength ÷ 16⌉ write instructions, the EGC's internal shifter state automatically reinitializes itself in preparation for blitting another row of pixels with the same initially configured bit addresses and length. This is perfect for blitting rectangles, as two I/O port writes before the start of your blitting loop are enough to define your entire rectangle.

The manual nature of reading and writing in 16-pixel chunks does come with a slight pitfall though. If the source bit address is larger than the destination bit address, the first 16-bit read won't fill the EGC's internal shift register with all pixels that should appear in the first 16-pixel destination chunk. In this case, the EGC simply won't write anything and leave the first chunk unchanged. In a 📝 regular blitting loop, however, you expect that memory to be written and immediately move on to the next chunks within the row. As a result, the actual blitting process for such a rectangle will no longer be aligned to the configured address and bit length. The first row of the rectangle will appear 16 pixels to the right of the destination address, and the second one will start at bit offset 0 with pixels from the rightmost byte of the first line, which weren't blitted and remained in the tile register.
There is an easy solution though: Before the horizontal loop on each line of the rectangle, simply read one additional 16-pixel chunk from the source location to prefill the shift register. Thankfully, it's large enough to also fit the second read of the then full 16 pixels, without dropping any pixels along the way.

And that's how we get arbitrarily unaligned rectangle copies with the EGC! Except for a small register allocation trick to use two-register addressing, there's not much use in further optimizations, as the runtime of these inter-page blit operations is dominated by the VRAM page switches anyway.

Except that T98-Next seems to disagree about the register prefilling issue:

Glitched blitting results on T98-Next when trying EGC copies where the source bit address is larger than the destination bit address

Every other emulator agrees with real hardware in this regard, so we can safely assume this to be a bug in T98-Next. Just in case this old emulator with its last release from June 2010 still has any fans left nowadays… For now though, even they can still enjoy the TH01 Anniversary Edition: The only EGC copy algorithm that TH01 actually needs is the left one during the single-buffered tests, which even that emulator gets right.
That only leaves 📝 my old offer of documenting the EGC raster ops, and we've got the EGC figured out completely!


And that did in fact remove tearing from the pellet rendering function! For the first time, we can now fight Elis, Kikuri, Sariel, and Konngara with a doubled pellet frame rate:

Switchable videos like these can nicely provide evidence that these changes have no effect on gameplay, making it easy to see that the Orb still collides with all pellets on the same frames. Also, check out the difference in remaining conventional memory (coreleft)…

With only pellets and no other animation on screen, this exact pattern presents the optimal demonstration case for the new unblitter. But as you can already tell from the invincibility sprites, we'd also need to route every other kind of sprite through the same new code. This isn't all too trivial: Most sprites are still rendered at byte-aligned positions, and their blitting APIs hide that fact by taking a pixel position regardless. This is why we can't just replace ZUN's original 16-pixel-aligned EGC unblitting function with ours, and always have to replace both the blitter and the unblitter on a per-sprite basis.
To completely remove all flickering, we'd also like to get rid of all the sprite-specific unblit ➜ update ➜ render sequences, and instead gather all unblitting code to the beginning of the game loop, before any update and rendering calls. So yeah, it will take a long time to completely get rid of all flickering. Until we're there, I recommend any backer to tell me their favorite boss, so that I can focus on getting that one rendered without any flickering. Remember that here at ReC98, we can have a Touhou character popularity contest at any time during the year, whenever the store is open! :tannedcirno:

In the meantime, the consistent use of 8×8 rectangles during pellet unblitting does significantly reduce flickering across the entire game, and shrinks certain holes that pellets tend to rip into lazily reblitted sprites:

TH01 SinGyoku's crossing pellet pattern in the Anniversary Edition, demonstrating smaller unblitting artifactsThe same frame in the original game, featuring much more giant holes ripped into the sphere sprite
SinGyoku's "crossing pellets" pattern, shortly before completing the transformation back to the sphere.

To round out the first release, I added all the other bug fixes to achieve parity with my previously released patched REIIDEN.EXE builds:

So here it is, the first build of TH01's Anniversary Edition: 2023-03-05-th01-anniv.zip Edit (2023-03-12): If you're playing on Neko Project and seeing more flickering than in the original game, make sure you've checked the Screen → Disp vsync option.

Next up: The long overdue extended trip through the depths of TH02's low-level code. From what I've seen of it so far, the work on this project is finally going to become a bit more relaxing. Which is quite welcome after, what, 6 months of stressful research-heavy work?