Talk about a nerd snipe! I just wanted to take the first meaningful step towards getting PC-98 Touhou portable. But then, that step massively escalated and resulted in not only the single biggest subproject of 2025, but also in the most productive dev cycle this project has seen since the beginning of the crowdfunding era. 405 commits over 11 pushes, and touching on so many topics that writing a single blog post would have been way too much for even me to handle. So let's try something new and split this delivery into four "smaller" and thematically more focused posts that I'll release in quick succession:
Part 1 (this post) describes the various strategies of porting PC-98 Touhou to modern platforms, explains which one I'm going to take and why, and clears up common misconceptions surrounding performance and accuracy. This one is required reading for anyone (yes, anyone) who believes they want to see these games ported. Hence, it's also intended for people who aren't that familiar with ReC98 and its usual ideals, and tries to not go all too far into technical detail. (Hopefully.)
So, how do we get the PC-98 Touhou codebase into a portable state? That entirely depends on what kind of port we want in the first place, and how much of ZUN's code we are willing to change. Three particularly efficient options immediately come to mind:
On one end of the spectrum, we have a preconfigured PC-98 emulator with disabled configuration options and a stripped-down UI that tricks people into believing they're playing a port and prevents them from accidentally breaking the working configuration.
This might sound like a joke, but it's unironically the most efficient and pragmatic solution that will be good enough for the overwhelming majority of players. If you ask people what they expect from a port, they primarily name ease of use and not having to configure emulators. Both of these can be solved with a preconfigured emulator and thus don't justify the monumental engineering effort of the more complex porting methods described below. That effort also wouldn't be justified if people just wanted a port and had no standards regarding its technical implementation, besides maybe no input lag. Someone has to put in the effort to solve every little challenge on the way from PC-98 to modern systems, and if that effort is not appreciated…
By the way, I have no idea what people are talking about when they claim that PC-98 Touhou has input lag, because there sure is nothing like that in the code that would indicate anything above 1 frame / 17.7 ms for the in-game portions. Any investigation into these issues would therefore have to come from someone else, I'm afraid. Everything points to input lag being the result of misconfigured emulators.
This is not like Shuusou Gyoku, where a port to modern APIs made sense because almost every subsystem still performs suboptimally on modern Windows even after you set up DxWnd, a better MIDI synth, and whatever people are using to make modern gamepads work with ancient DirectInput these days. If you correctly set up a PC-98 emulator, the games do run at full speed, and are highly likely to continue running fine after emulator and operating system version updates.
Thus, can we conclude that wishing for ports is primarily a symptom of the Touhou community's past failure and negligence to spread preconfigured emulators to people? Because this surely shouldn't be a problem in this day and age anymore? While I did my part way back in 2013, it would take until spaztron64's 2021 package for the community at large to finally wake up and realize that this was a problem. Nowadays though, we have at least three decent packages made by separate people that have my personal seal of approval. And yes, this even includes the offering you can obtain at a certain mountaintop place of worship. That site used to be infamous for pushing out slop that violated their own mission statement and externalized costs to the tech support departments of their supply chain, but I'm glad to announce that they've leveled up and now provide a decent solution. And once they remove that archive inside their archive, it will be even better!
Still, if your emulator configuration guides are presented more prominently than your preconfigured emulator downloads, you're doing a disservice to the community. Make guides available, yes, but clearly label them as background information for people who already played the games and then got curious about this old Japanese computer architecture.
OK, but what if you do have standards and would appreciate a technically more solid port that removes layers and maybe even improves the games beyond the limits of the PC-98's architecture? If we take a single step towards native code and native performance, we end up with what people call a "static recompilation" these days. As I explained in the FAQ entry I wrote last year, this kind of port would still emulate the graphics, sound, input, and memory subsystems of a PC-98, but it would cut out CPU emulation.
For PC-98 Touhou, this is actually quite a huge deal: CPU speed is the single biggest point of contention when configuring PC-98 emulators for Touhou, and the vastly different x86 cores of each emulator result in vastly different performance characteristics once you start to benchmark them all more thoroughly. With no more CPU cycles to count, we'd also lose all the VRAM access latencies that emulators typically strive to replicate, and thus pretty much guarantee 0% slowdown in the resulting port. While the aforementioned kind of modded emulator could theoretically also remove cycle counting and VRAM latencies, it would still interpret x86 instructions and thus have a harder time actually reaching the native performance required for 0% slowdown.
This kind of port would also find immediate acceptance within the gameplay community. Since it would only take ZUN's original binaries as input and ignore our reconstructed source, we're guaranteed to retain the exact gameplay logic. The entire instruction translation process would be automated, leaving no room for modernizing the codebase by hand 📝 and accidentally breaking gameplay. We'd still have to defuse at least a few landmines to get the port running without issue, but those would be limited to things like filename casing, for example. Nothing even remotely close to gameplay code.
On the other end of the spectrum, we have something like uth05win: A fully native rewrite of the graphics code that takes every liberty and cuts every corner it needs to rework the game into something that naturally renders within a modern graphics API of our choice. Unlike uth05win, however, our ports will be based on complete decompilations and thus retain the original gameplay code instead of freely rewriting certain parts because they look strange. In turn, we would basically scrap all of ZUN's menu and cutscene code and write quirk-free and sane replacements. Part 4 will drive home just how much more relaxing this course of action would have been…
There's certainly an argument to be had that a modern port should reimagine the game to look and feel as modern as you can get within the original assets, and not stick to PC-98 limitations. After all, the unmodified PC-98 version is always there for you to play on your correctly configured emulator, right? In fact, if we ever wanted to port the games to weaker systems or consoles, this kind of port would be our only option.
But as you might have guessed, we're not going for either of these options:
The first option doesn't even need anything from ReC98. Even the sleekest imaginable release could be done by anyone who either knows about PC-98 emulation or keeps in contact with someone who does, and is comfortable messing around with emulator source code. In fact, I'm not even a particularly qualified person for this job; I frequently mess with emulator configurations for research reasons, and then forget the correct values for certain obscure settings.
This is such an obvious and efficient move that I seriously wonder why nobody has done it so far… but then again, I thought the same about every other idea I ended up doing myself in this space over the past 15 years. If that idea sounds great to you, feel free to go ahead – it represents the opposite of what this project is about, so the resulting fame is yours for the taking. If y'all see "ports" popping up from a place that isn't this project in the not-too-distant future, you can be pretty sure that their developers followed this strategy.
The second option would indeed be an interesting project in its own right, as I've stated in the FAQ entry. But if you remember 📝 the last time I thought about static recompilation, I was way more excited for recompiling the old compiler we use for the PC-98 code rather than the games themselves. Ironically, this is primarily because of how much a recompilation would complicate the new features we plan to add to the games. Since I can only develop new features on top of a previous reverse-engineering effort, they will necessarily remain tied to the PC-98-native version of the codebase at first. How would we port them, then?
Do I continue developing these features for the PC-98 and then simply recompile them along with the rest of the game? The issue with that approach is that most features won't have a version that could work with the original ZUN codebase that we'd prefer to recompile. For everyone's sanity, most features will only exist as part of a respective game's anniversary branch, which in turn is based on the rearchitected and de-landmined debloated branch. Recompiling these branches would undermine the entire selling point of delivering the pure, untainted ZUN code that would have probably convinced the gameplay community to invest in this strategy in the first place. It might be good enough for the rest of the community, but if I'm going to rearchitect the PC-98 codebase anyway, would there even be a point in developing the required recompilation techniques on the side? Would this give us ports faster than following a more classical approach?
Then again, I could still try slicing out the code for these features in a way that would allow them to be shared between the rearchitected PC-98 and recompiled ZUN codebases. But that's bound to create an unnatural and awkward mess that's probably even worse than the way I have to arrange ZUN's code on the unmodified master branch. I'd definitely charge extra for that.
Do I just copy-paste and maintain two versions of the feature code for both platforms, manually transferring all required reverse-engineering to the recompilation? That might feel very dull, but it's probably more efficient than any attempt at sharing that code.
Or do I just abandon the PC-98-native codebase? In favor of a pseudo-PC-98 codebase that still very much assumes PC-98 hardware but doesn't actually run on real or conventionally emulated PC-98 hardware…
The last point in particular demonstrates just how little of a help a recompilation would actually be. Since it would continue to emulate the PC-98's graphics system, I'd still have to write any new graphics code against the PC-98's planar and two-page VRAM. Automatically porting the games to a friendlier and more generic rendering paradigm is infeasible for even an advanced recompiler: Every part of the original game expects PC-98 hardware, and a generic rewrite requires engineering decisions at a much higher level than the individual x86 instructions a recompilation operates at.
And ultimately, it's these individual features that people should be (and mostly are) hyped for. Community-usable replays, translations, and TH03 netplay can all be implemented natively on PC-98. Sure, netplay would be easier to develop and easier to use within a TH03 recompilation since we can just use the native network stack of your host OS 📝 without any intermediaries. But developing both a recompiler and netplay would still take longer than 📝 following through with our current PC-98-native plan.
The third option is actually quite popular, or would at least be acceptable to the majority of the general fandom. This is what non-technical people have in mind anyway when they think about ports, even if they don't confuse ports with remakes.
To find out just how acceptable such a port would be, I picked screen fade effects as a representative detail for the corners that such a port would cut, and asked how people judge the natural alpha-blended implementation in uth05win against the palette-based method you'd use on a PC-98. Surprisingly, a whopping 79% of respondents don't have any problem with a port using whatever is most natural for the system it runs on. And that's 79% of my audience, which certainly is at least somewhat aware of PC-98 hardware details and the limitations that shaped these games into what they are. Of course, the 21% of die-hard PC-98 supremacists would then loudly complain that such a choice would make the port literally unplayable, but we could easily dismiss them by pointing to the poll where the community decided in favor of the smoother option. After all, ZUN's intention was to have a fade, and manipulation of a 12-bit color palette was simply the only tool he had on a PC-98.
However, the gameplay community has much higher hopes for ReC98. Both them and I don't just want to supplement the original PC-98 versions with something that's playable on modern systems, but
> replace the need for the proprietary, PC-98-exclusive original releases and their emulation for even the most conservative fan
as I wrote back in 2014. Sure, the community can manage spreading pre-configured emulators for a few more years, but wouldn't it be great if they could stop doing that at some point in the far future?
So if all the "easy" solutions either don't have much of a purpose or disappoint in some way, we're only left with the hard one: A classic, manual port done primarily for the sake of solving an engineering challenge. But hey, this means that it'll also produce tons of blog posts for all of you to read, which apparently is at least equally as popular as actually playing the games.
Here's what we're going to do:
Rearchitect the game to end up with one shared codebase that compiles for both PC-98 and modern systems, avoiding the code duplication drawback of static recompilation approaches.
Accept nothing less than a pixel-perfect port. The PC-98 and modern versions should look identical on every frame. It is not ReC98's job to reimagine the games; as usual, I'm going to do the hard work, and it's up to other modders to throw it all out and simplify it later.
Perform all the automated gameplay validation we possibly can to earn the trust of the gameplay community, avoiding debacles like 📝 the📝 two recent desyncs in my Shuusou Gyoku build. This forces us to have a lightweight method of recording replays on top of the unmodified master branch before we can start porting – a fact that Ember2528 already somewhat identified within his current roadmap of funding priorities for TH03.
Continue fixing landmines, bugs, and bloat. Many landmines must necessarily be fixed for a port to work at all, bugfixes are highly requested by most fans and backers, and bloat fixes ensure maintainability, moddability, and bring the PC-98 versions closer to the performance a modern port will naturally run at.
Sure, the main drawback here is the immense development effort required. But in exchange, the port retains readable and moddable code and continues to deliver the insights that this project has always stood for. Imagine stepping through gameplay code using a native C/C++ debugger at your native screen resolution!
But before we can get to how I'm going to do all that, there are two popular misconceptions I have to address.
Note that none of the emulators have accurate slowdown; the slowdown will not match real hardware.
Objectively, this is a true statement. Neko Project's i386 core is the closest thing to cycle-accurate PC-98 emulation we have, as its per-instruction cycle counts match Intel's documentation. But even its performance characteristics are wildly inaccurate compared to a real PC-98 system with a 386, as we're going to see in the next blog post.
The problem I have with this sentence is that it's very misleading in this specific context. The mere mention of accurate slowdown in a beginner's guide on PC-98 emulation paints said slowdown as something desirable and worthy of preservation. It evokes stories of console speedrunners and emulator developers who deal with fixed, well-defined hardware where the concept of accurate slowdown makes sense. Stories that probably originated from a time before decompilations of classic games became commonplace, when it was hard to say whether a particular instance of slowdown was intended or not. And even with a decompilation, these things remain a matter of interpretation if you can't ask the original developer. Thus, it's completely understandable why observable behavior of real hardware remains the one benchmark of accuracy and quality that people can understand and rally around.
The PC-98, however, is very much not that kind of fixed system, but a computer architecture that spanned 18 years of hardware evolution, from 1982 to 2000. Even if we reduce this list of models to the ones that match ZUN's stated minimum system requirements, we're still looking at 7 years of hardware, running different microarchitectures at different clock speeds and with different resulting bottlenecks. If there's such a big variety of systems, which particular slowdown behavior should the ports even preserve?
The obvious answer is "the one from the exact system ZUN wrote these games on", but we don't know that system. 📝 Last year, I claimed that ZUN developed these games on a PC-9821Xa7, but I didn't add a citation back then and can't find one now. The closest piece of related known info is this note on the Amusement Makers page that hosts the official downloads for the trial versions, listing three PC-98 models that they confirmed to run the games without issues:
なお当サークルでは
・ NEC PC-9821Xs i486DX2 66MHz
・ NEC PC-9821La13 Pentium Processor (P54C) 133MHz
・ EPSON PC-486MS AMD 5x86-P133 換装
などで正常に動くことを確認しています
These models are one whole CPU generation apart and their clock speed differs by 100%. Which one of these is supposed to have the accurate slowdown?
But even if we knew, it doesn't matter. The README is clear about ZUN's intentions:
If ZUN recommends a 486 or faster to avoid slowdown, this necessarily means that any unintentional slowdown is indeed unwanted.
Also, note how only TH02's README claims that the game was exclusively tested on a 66 MHz model, which is highly likely to be that PC-9821Xs listed on the Amusement Makers page. Did ZUN switch to a faster PC-98 model for the development of the last three games? That late into the architecture's lifespan? Or did he merely test the game on faster models while the main development still took place on his 66 MHz model?
Picking a CPU clock speed for emulators
Of course, this now creates a problem for everyone wanting to configure emulators for PC-98 Touhou. If the ideal Touhou machine is infinitely fast, we should always pick the fastest possible emulated CPU speed, right? Historically, this has been bad advice: Most emulators will then stick to exactly the amount of cycles per emulated second you specified in the menu, slowing down the emulated system as a result. It's this kind of emulator behavior that gets players to manually look for "the sweet spot" – the maximum possible explicitly specified CPU clock speed that still manages to render without slowdown on their system. This is a tragedy for many reasons:
Regular players probably don't analyze performance with any kind of rigor. I certainly have never heard them say how they made sure to record a video at 56.423 FPS and then stepped through its individual frames to confirm the absence of lag.
Instead, they will probably present their clock speed configuration as a general recommendation to others, without realizing that the "sweet spot" they found is specific to their system. If others then try this clock speed on a slower CPU, they get slowdown instead, and thus gain an entirely wrong impression about how fast the game is supposed to run, backed by a presumptive expert on the topic.
Admittedly, this will become less likely as time marches on, CPUs get faster, and emulators keep optimizing their x86 cores.
But really, why are we expecting players to do this?!
Ever since 2019, however, SimK has been developing an Async CPU mode for Neko Project 21/W, which
finally got stabilized in ver0.86 rev.93, back in April of this year. Activate this mode with the Screen → CPU clock stabilizer and Screen → Dynamic CPU clock adjustment options, and then you should theoretically be able to finally stop worrying: Just specify the maximum possible clock speed in the usual configuration menu, and Neko Project will dynamically reduce the emulated clock speed to the fastest speed your system can handle.
Then, the games are supposed to run similarly to how a correctly configured Anex86 has been running them all along, but with an additional 21 years of emulation accuracy improvements.
Sadly, this mode still needs a bit of work. Excessively high clock speeds will result in wildly fluctuating frame rates and even BGM tempos during the first few seconds of a game session as Neko Project 21/W apparently takes a while to find the optimal clock speed. Even afterwards, emulation remains noticeably slower than Anex86:
This is Neko Project 21/W ver.0.86 rev.95 configured with a clock speed of 1 GHz, running on an Intel Core i5-8400T. The fluctuations are not nearly as intense during the rest of a game session, but remain noticeable throughout.
But what about DOSBox-X, the other good emulator recommended these days? This Async CPU mode is very similar to the cycles=max option that DOSBox-X has supported all along. If you try running my 📝 past and future blitting benchmarks using this option, you can observe how DOSBox-X also starts with a low cycle count and then gradually speeds up to accommodate the actual processing load.
In the much less synthetic test case of running PC-98 Touhou, however, DOSBox-X's cycle adjustment reveals itself as much more sophisticated than Neko Project 21/W's implementation. The showdetails=true option reveals that the cycle count does fluctuate quite heavily, which does translate into minor BGM dropouts particularly near the start of a session. But these dropouts are tiny in comparison to what you'd get on Neko Project 21/W, and the framerate remains stable throughout.
As for overall performance, DOSBox-X's simple interpreter core is not nearly as optimized as Neko Project 21/W's interpreter and peaks at roughly half of its speed. The dynamic_nodhfpu core, however, solidly beats Neko Project 21/W by the same 50%. And it's this added bit of performance that makes all the difference: It eradicates slowdown in most of the usual spots in PC-98 Touhou where emulators and even Anex86 typically struggle, and turns DOSBox-X into the first emulator to finally beat Anex86's performance on the same hardware in all the workloads that matter. The dynamic core still doesn't quite reach the speeds of the hypothetical infinitely fast PC-98 on my outdated system, but it remains the most reliable configuration option when it comes to delivering ZUN's intended vision. If we ignore the BGM dropouts.
Just make sure to explicitly select the dynamic_nodhfpu variant, not the regular dynamic core. The latter is infamous for recompilation errors in FPU code that break TH01 gameplay. While that specific issue is ostensibly fixed, I still managed to occasionally run into smaller FPU-related bugs in current DOSBox-X versions. Unfortunately, I didn't manage to capture them on video; I would have reopened the issue on the spot if I did.
Of course, any performance measurement of an emulator with dynamic cycle adjustment can only ever represent a snapshot of the ever-changing adjustment state, and should therefore be taken with a grain of salt. Hence, these screenshots are purely decorative; I just added them because I'm sure that someone would have asked for exact numbers otherwise. Also, the exact relations between emulators are highly dependent on the workload…
And yes, that's a new benchmark! More about this one 📝 in part 2.
(Still, it's remarkable how close Anex86 gets despite its interpreter core, and how it even beats DOSBox-X in MOVS performance. I looked at Anex86's disassembly for 10 minutes and saw big tables of tiny per-instruction functions with custom calling conventions that make remarkably efficient use of the few registers you get in x86. Also, negative offsets? They must have written this entire x86-on-x86 core in ASM.)
While this is great news for players, the whole situation remains very unsatisfying at a technical level. Even if you don't care about the remaining BGM dropouts, running these games at the highest possible emulated clock speed means that you constantly spend 100% of all CPU cores assigned to your emulator just to avoid slowdown and lag in a few particularly CPU-intensive sections. Power saving might be the single best practical argument in favor of a port.
Also, all this complexity involved in dynamic cycle adjustment raises one question you might have had all along. Why don't we just leave our emulated CPUs at 66 MHz? After all, ZUN said that 66 MHz is enough to eliminate all slowdown in at least TH02 and TH03, so how about just living with whatever slowdown we'd still experience in TH04 and TH05? This is certainly a healthier approach, much more appropriate for these silly little indie games that were never meant to be obsessed about at this level, and we get rid of those last few BGM dropouts in DOSBox-X!
Well, if that statement was ever correct to begin with, it would have only applied to real hardware and not to emulators. mu021 reported that the final phase of TH02's Mima fight slowed down even at 78 MHz in Neko Project, and part 4 will contain 📝 even more examples of how 66 MHz slows down several effects in menus and cutscenes, and thus paints a wrong picture of them. Hence, choosing 66 MHz for a preconfigured emulator package might have a particularly annoying side effect: If people get used to how slow these effects run on emulators, they might be rather irritated once the modern ports will invariably run them at their intended speed denoted in the code. I can already imagine them yelling too fast!, inaccurate!, and literally unplayable!, oblivious to the fact that they had the wrong idea about these effects all along.
Or maybe it'll all be fine once part 4 has documented these issues in depth. I certainly wouldn't criticize a package for choosing 66 MHz. All choices are unsatisfying at some level…
If only we could optimize the games enough to remove any unwanted slowdown at 66 MHz. Then, people could freely choose one emulator over another for reasons unrelated to performance, because even cycle-limited emulators could then actually deliver on ZUN's statements in the README files…
And since we've defined debloating as an integral part of port development earlier, that's exactly what we're going to do.
But can we even do that within our high standards? Obviously, our ports should remain…
Frame-perfect
Since all five games are explicitly timed around VSync, it's immediately clear what we mean by this term:
Everything rendered to a single page of VRAM between two VSync wait loops defines one single logical frame.
If we are double-buffering correctly and the PC-98 system running the game is fast enough to finish rendering such a logical frame to VRAM within two VSync signals, everything is fine: The sequence of frames you can observe on your screen matches the logical sequence of internal frames, and we can easily record this sequence and compare the port against it.
But what about unintentional slowdown? In these cases, ZUN asks the system to do way more work than it can execute between two VSync signals. Notably, this also includes most loading times: Once we add disk access into the mix, we can't guarantee hitting any VSync deadlines anymore, and decompressing all these 640×400 images is quite expensive as well. Obviously, we don't want to abandon our goal of frame-perfection and the comparability of ports just because of this variability, so let's add another rule:
Individual defined frames may be shown on screen for any integer multiple of the frame time.
The reason for the integer restriction is obvious: If we start drawing to the screen in the middle of a frame, we get screen tearing and thus a non-perfect frame – not just because tearing looks bad, but also because the position of the tearing line always depends on the overall performance of the system you run the game on.
The combination of these two rules leads to an immediate consequence:
The games must only ever display complete logical frames.
And now we have a problem. Our rules have just outlawed screen tearing, but nearly every menu and cutscene screen in ZUN's original code has some kind of screen tearing issue. 📝 The Music Room of TH02-TH04 represents probably the worst example as it suffers from screen tearing on every single frame:
Also, how would you possibly preserve these tearing lines once you've ported the game? After all, modern platforms not only imply much faster CPUs, but also completely different rendering methods, especially once we add scaling into the mix.
This can only mean one thing:
It is fundamentally impossible to port the unmodified codebase of PC-98 Touhou and remain frame-perfect to the original release.
You could maybe get there by throwing out the integer multiple rule and accepting teared frames as legitimate. But then you'd have to decide on a particular model whose slowdown behavior you'd want to replicate and lock down exactly – and as I've stated in the section above, that's quite a silly and impractical proposition.
Resolving screen tearing
So, how do we get back to a comparable sequence of well-defined frames? This can only work if we leave the confines of real hardware and instead reach for the infinitely fast PC-98 that ZUN wanted to have anyway. Such a system would never exhibit screen tearing because it would naturally complete all rendering within the vertical blanking interval preceding each displayed frame. Once our code then ends a frame by entering a busy-waiting loop for the next VSync signal, the screen would then get to draw static and well-defined VRAM contents. This behavior is the whole reason why I get to classify screen tearing issues as landmines that must always be fixed, as opposed to bugs that a port could potentially retain.
If we actually had such an infinitely fast PC-98, we could just run ZUN's unmodified code on that system and be done now. But as we've seen above, not even DOSBox-X's dynamic core manages to run PC-98 Touhou at the infinitely fast level we'd need. Also, we wanted to get rid of relying on specific emulators and have already planned to optimize all this code anyway…
So let's defuse each screen tearing landmine one by one by rewriting its code to match the output of an infinitely fast PC-98. This is a lot more feasible than it sounds because these landmines aren't actually caused by a lack of CPU power. Every screen tearing issue comes down to ZUN misplacing certain screen-affecting operations within the hellscape of imperative hardware state mutations that is his menu and cutscene code. You can either hide the issue by throwing an infinite amount of processing power at the problem so that the order of mutations no longer observably matters, or you can just write good code.
In theory, we only have to follow a few rules:
All VRAM page flips and hardware palette changes must be moved to the vertical blanking interval.
Since TRAM is always single-buffered and ZUN rarely writes to the topmost rows, we can get by with merely moving TRAM writes close to the vertical blanking interval if we don't manage to hit the interval exactly.
On single-buffered screens, the same is true for VRAM. This category mainly includes menu screens whose upper VRAM rows thankfully remain static, so we also get some leeway here. Rewriting these screens to be double-buffered might sound better, but doing so at the high level where these landmines have to be fixed would only create more of a mess, 📝 for reasons I'll explain below.
In rare cases, ZUN placed expensive file load calls and draw calls on the same logical frame within a single-buffered screen. For an infinitely fast PC-98, this is no problem. But since all bets are off once disk access is involved, there is no way we can hide the draw calls and avoid the resulting screen tearing on real hardware and emulators while still sticking to ZUN's defined sequence of logical frames. Thus, we have to make an exception and insert an additional VSync delay loop after the load calls to separate loading and rendering, creating a new logical frame that did not exist in ZUN's original code.
This might sound very controversial. We've just come up with this mental model of an infinitely fast PC-98 to solve frame-perfection, only to now deviate from it again and snap back to reality? However:
As I'm going to describe in 📝 part 2, we're about to speed up loading and blitting by much more than this one added frame.
If we run this logical frame on the actual fastest real-hardware PC-98 system the community has to offer and even that system takes longer than 17.7 ms to render it, it's hard to argue against formalizing a delay you'd be getting on real hardware anyway.
The difficulty of actually pulling this off, however, can range anywhere from Easy to Lunatic, depending on the screen, because of course every one of them is different. Even after these 11 pushes, I'll be far from done. But in the end, we'll have perfect and easily verifiable frame parity between the PC-98 versions and the future ports, even though we had to bend the code a little. Or a lot. Oh well.
If you only opened this post for the required reading part, you can stop reading now. I've got a few more technical thoughts about a few implementation details of the future ports that tend to come up in discussions, but these aren't as essential as the high-level issues above.
So we've now decided on what to do in order to make the ports good, but what are the basic challenges we have to solve in order to port these games to modern systems in the first place? Let's start with a perhaps surprising list of non-issues that some people might perceive as challenges:
Sound. As people of culture, we can all agree that PCM recordings of sequenced sound are sacrilegious, so the ports will always use some kind of emulation here. Therefore, I'll simply ask sound people for the best YM2608 and PMD cores that won't get me canceled. If I still get canceled, we'll just resolve the disagreement with a violent flamew- I mean, a constructive discussion, or just offer multiple options if there are valid arguments for either choice – similar to how you can 📝 choose between real SC-88Pro or virtual Sound Canvas VA recordings for my Shuusou Gyoku build.
TH03's SPRITE16-powered in-game renderer. For a port, it does not matter at all how a sprite driver was originally implemented. ZUN already streamlined regular sprite blitting down to three common functions, which a port would simply need to implement differently. The game code still contains 21 additional calls to SPRITE16 functions for certain special effects, but none of these additional monochrome, masked, or overlapped blitting modes are unique to TH03.
In short: If the feature in question is consistently used through an API, it's not a challenge in itself. The hard parts are all the opposite cases – when ZUN suddenly starts writing to VRAM segments or I/O ports in the middle of gameplay code, like he does all over TH01. All of these instances need to be manually cleaned up and abstracted away. Conversely, this is also why 📝 TH02 remains by far the easiest individual game to port – it has the least amount of hand-written blitting code and mostly sticks to master.lib functions.
Instead, the biggest immediate challenge is something far more basic:
🎨 Palettized and planar graphics 🎨
After all, PC-98 Touhou doesn't just view the PC-98's graphics subsystem as an obstacle to overcome, but occasionally makes creative use of both palettes and individual bitplanes. How would we possibly cover these effects in a modern graphics API that will be far removed from these concepts? Three challenges immediately come to mind in that regard:
The whole concept of enforcing a single 16-color palette across the entire screen in a world where 32-bit RGBA is the only reliably available texture format. Shaders offer a simple solution: We simply wouldn't use traditional textures, and just write our own sampler that takes both the original palettized 16-color+alpha image and the global palette as input, and performs a lookup for each texel. But what are we supposed to do in SDL_Renderer's fixed-function pipeline? Use the CPU to update all loaded textures on every palette color change? Split each sprite into a separate texture for each color and consume 16× the amount of VRAM just so that we can use vertex colors for each individual color layer? Or break down every sprite into a point list to save the VRAM?
Any kind of sprite-shaped palette color bit flipping effect, such as 📝 the falling polygons in the Music Room. Effects like these could potentially be hardware-rendered even in a fixed-function pipeline if we split the background image into two and render the polygons using regular triangles with their UV coordinates matched to the pixel coordinates on every frame. But would all the involved interpolation reliably give us the original sharp edges without reaching for a shader to ensure that it does? In any case, this solution would need a completely different implementation for a modern port than it currently uses in ZUN's PC-98-native code, which gets by with less per-frame redraw than you'd think that this effect would need.
uth05win didn't even get to port the Music Room, which is probably not without reason.
TH01's square-shaped inverting effects used during bomb and entrance animations. Flipping a given bit of a pixel's palette index? Based on what's there before? No way around a shader for this one…
Note how the flipped cards rip holes into the square trails. I'm not even sure what the TH01 Anniversary Edition would change about the effect, or whether it even should change anything about it. Good luck porting this effect pixel-perfectly without pixel-level access.
However, writing all this custom graphics code for the modern port would run against my previously stated goal of sharing as much code as possible between PC-98 and modern platforms. While shaders are the conceptually simpler solution for all of these challenges, they aren't easy in practical terms, and I already 📝 decided against using them for Shuusou Gyoku for good reasons. Also, is all of this really worth the effort if these games demonstrably don't even need the performance of GPU rendering?
But that only leaves one conclusion:
The future ports of PC-98 Touhou to modern systems will software-render the graphics layer on the CPU.
I know, that sounds very shocking and probably disappointing at first. But at a closer look, it's really not all that bad. These games have been software-rendered all along by not only PC-98 emulators, but by real hardware at mid-90's CPU speeds. You might point to the GRCG and EGC chips as evidence for at least some capacity of hardware acceleration, but I see them more as workarounds for the unfortunate planar nature of VRAM on this Japanese business computer architecture. In the end, "software rendering" only means that the CPU receives access to every pixel in the framebuffer. Once all graphical functionality is neatly abstracted away and the game no longer directly accesses the four physical bitplanes, the ports can store sprites and the rendered graphics layer in the most performant way.
Also, note how I only said "graphics layer". Besides 📝 the obvious candidate of framebuffer scaling, the ports will use the GPU for two more important aspects:
The PC-98's text layer. With 8 fixed colors and glyphs drawn from a more or less static font ROM/gaiji texture, there is no reason not to render this layer entirely on the GPU. Even color reversing is as simple as defining a custom blend mode that inverts the alpha channel, which SDL supports for all of its renderer backends.
Vertical scrolling. 📝 The original games also reach for a PC-98 hardware feature here, and this feature can be replicated within 3D APIs in exactly the same way by adjusting the UV coordinates of the VRAM texture. This insight reduces the software renderer's required per-frame redraw to exactly the same amount as the PC-98 version, and should defeat any remaining concerns you might have about software rendering.
The still image in that post from two years ago doesn't demonstrate the PC-98 way of VRAM scrolling all too well, so here's a longer video that scrolls an entire screen's worth of tiles:
In the game logic, all entity positions represent the scrolled on-screen view, while the sprites are offset by the Y coordinate of the green line (representing the top of the scrolled screen) before they are blitted. Also note how ZUN never redraws the area between the yellow line (representing the bottom of the playfield) and the green line as part of the scrolling process, since it's always covered by a 16-pixel row of black TRAM cells. Any redraws there are a result of regular tile invalidation caused by overlapping sprites, and remain isolated to the VRAM page that the game rendered to when the overlap happened.
The gameplay is taken from 📝 ZUN's hidden TH05 Extra Stage Clear replay.
As a result, the software renderer of our hand-crafted ports would still internally produce a graphics and text layer that persists across frames and receives minimal redraws, just like the PC-98 originals did. In fact, it would have to produce the exact same graphics layer if we wanted to port the non-Anniversary Edition, including the tile source area. There's no technical need to keep tiles on the graphics layer in a port, but certain intense shake effects temporarily reveal individual tiles below the HUD:
This definitely counts as a bug to be fixed in this game's Anniversary Edition, but how would we fix this one on PC-98 where we do need the tile area in VRAM? Moving the tiles to another place and patching the 📝 .MAP at runtime?
Applying the palette to produce the final rendered image then raises another set of exciting engineering questions. Would we actually use a palettized 4bpp buffer in memory, storing two pixels in a byte? Perhaps with an 8-bit palette that maps each possible pair of pixels to a pair of 32-bit RGBA values, halving the amount of per-frame palette lookups? Or would we always store an RGBA image and merely offer a palettized API around it? As far as I'm concerned, these challenges are way more exciting than the prospect of locking ourselves into some shader language.
📃 Page flipping 📃
But wait. If the port produces a persistent graphics layer, shouldn't it produce two, one for each VRAM page on the PC-98? From the point of view of a modern port, we really don't need to. We only ever upload one "VRAM page" to the GPU anyway, which is then scrolled and scaled onto one of the GPU's backbuffers inside the swapchain. Then, the game can immediately continue drawing onto the same software-rendered VRAM buffer in the next frame without affecting the GPU output.
Obviously, this rendering paradigm doesn't translate back to the PC-98. There, we must render each frame to either the invisible or the visible page. Also, minimal redraw is crucial because we can neither afford the memory nor the performance to regularly copy an entire 128 KB of pixel data from whatever place to VRAM. As a result, page flips are a common sight in even the highest levels of menu and cutscene code, adding yet another unsightly piece of state you have to keep track of while reviewing and modding the code. I've grown to hate them quite a lot over the past four months because of just how often they are associated with bad code: In most menu and cutscene screens, ZUN just uses the second VRAM page as pixel storage for inter-page copies using the EGC, 📝 whose slowness is a regular topic on this blog. Once you've replaced these copies with optimized blits from conventional RAM, you've not only removed all these page flips and clearly revealed these screens as the single-buffered affairs they've always been, but you've also accelerated them enough to remove any screen tearing issues they might have had at 66 MHz.
Unfortunately, things are not that easy everywhere:
Sometimes, menus and cutscenes do require involved page flipping tricks to cleanly switch between two screens without tearing.
But a few of them are genuinely double-buffered. Their minimal redraw code must indeed always keep two alternating states of VRAM in mind, which effectively leaks a hardware detail – the length of the PC-98's "swapchain" – into the highest levels of game code.
Can we rewrite all of these cases in a way that high-level game code no longer has to care about pages? Can we perhaps even banish page flipping to a new lower level of the architecture that all menus and cutscenes are built on top of, and thus unconditionally double-buffer every screen while still maintaining minimal redraw? Or is none of this worth it and we'll just live with two VRAM pages on all platforms? I'm honestly not sure. And that's just a small preview of the porting challenges that still await us and were far beyond the scope of even these 11 pushes…
As for the commits that are formally assigned to this blog post: It was all maintenance, build system setup, and some debloating work on TH01 around its packfile support that I thought would be necessary but thankfully didn't yet need after all. More about that in, you guessed it, 📝 part 4.
Alright! Improving performance, fixing screen tearing issues, establishing better cross-platform interfaces, and cleaning up ZUN's code to facilitate all of that… I've got a lot to do now. Next up: Getting closer to our performance goals by optimizing all PC-98-native code surrounding the .PI files used for backgrounds and cutscene pictures, since we later want to draw our TH03 netplay menus on top.
TH05's OP.EXE? It's not one of the 📝 main blockers for multilingual translation support, but fine, let's push it to 100% RE. This didn't go all too quickly after all, though – sure, we were only missing the High Score viewer, but that's technically a menu. By now, we all know the level of code quality we can reasonably expect from ZUN's menu code, especially if we simultaneously look at how it's implemented in TH04 as well. But how much could I possibly say about even a static screen?
Then again, with half of the funding for this push not being constrained to RE, OP.EXE wasn't the worst choice. In both TH04 and TH05, the High Score viewer's code is preceded by all the functions needed to handle the GENSOU.SCR scorefile format, which I already RE'd 📝 in late 2019. Back then, it turned out to be one of the most needlessly inconsistent pieces of code in all of PC-98 Touhou, with a slightly different implementation in each of the 6 binaries that was waiting for its equally messy decompilation ever since.
Most of these inconsistencies just add bloat, but TH05's different stage number defaults for the Extra Stage do have the tiniest visible impact on the game. Since 2019 was before we had our current system of classifying weird code, let's take a quick look at this again:
In the end, this is a landmine, albeit a slightly unusual one. OP.EXE always needs to load GENSOU.SCR to determine whether the Extra Stage is unlocked and can be selected in the main menu. If that file is corrupted or doesn't exist yet, OP.EXE will always recreate it. Therefore, MAINE.EXE's recreation code would only ever run if GENSOU.SCR got deleted or corrupted while playing the game. This can only happen through code that runs outside the game or as the result of failing hardware, and thus goes beyond our criteria for observability.
On to the actual High Score screen then! The OP.EXE code I decompiled here only covers the viewer, the actual score registration is part of MAINE.EXE and is a completely different beast that only shares a few code snippets at best. This means that I'll have to do this all over again at some point down the line, which will result in another few pushes that look very similar to this one. 🥲
By now, it's no surprise that even this static screen has more or less the same density of bugs, landmines, and bloat as ZUN's more dynamic and animated menus. This time however, the worst source of bloat lies more on the meta level: TH04's version explicitly spells out every single loading and rendering call for both of that game's playable characters, rather than covering them with loops like TH05 does for its four characters. As a result, the two games only share 3¼ out of the 7 functions in even this simple viewer screen. It definitely didn't have to be this way.
On the bright side, the code starts off with a feature that probably only scoreplayers and their followers have been consciously awareof: The High Score screens can display 9-digit scores without glitches, unlike the in-game HUD's infamous overflow that turns the 8th digit into a letter once the score exceeds 100 million points.
To understand why this is such a surprise, we have to look at how scores are tracked in-game where the glitch does happen. This brings us back to the binary-coded decimal format that the final three PC-98 Touhou games use for their scores, which we didn't have to deal with 📝 for almost three years. On paper, the fixed-size array of 8 digits used by the three games would leave no room for a 9th one, so why don't we get a counterstop at 99,999,999 points, similar to what happens in modern Touhou? Let's look at the concrete example of adding, say, 200,000 points to a score of 99,899,990 points, and step through the algorithm for the most significant four digits:
score
BCD delta
09 09 08 09 09 09 09 00
+ 00 00 02 00 00 00 00 00
= 09 09 08 09 09 09 09 00
+ 00 00 02 00 00 00 00 00
= 09 0A 00 09 09 09 09 00
+ 00 00 02 00 00 00 00 00
= 0A 00 00 09 09 09 09 00
+ 00 00 02 00 00 00 00 00
= 0A 00 00 09 09 09 09 00
It sure is neat how ZUN arranged the gaiji font in such a way that the HUD's rendering is an exact visual representation of the bytes in memory… at least for scores between 100,000,000 (A0000000) and 159,999,999 (F9999999) inclusive.
Formatted as big-endian for easier reading. Here's the relevant undecompilable ASM code, featuring the venerable AAA instruction.
In other words: The carry of each addition is regularly added to the next digit as if it were binary, and then the next iteration has to adjust that value as necessary and pass along any carry to the digit after that. But once we've reached the most significant digit, there is no way for its carry to go. So it just stays there, leaving the last digit with a value greater than 9 and effectively turning it from a BCD digit into a regular old 8-bit binary value. This leaves us with a maximum representable score of 2,559,999,999 points (FF 09 09 09 09 09 09 09) – and with the scores achieved by current TAS runs being far below that limit in bothgames, it's definitely not worth it to bother about rendering that 10th score digit anywhere.
In the High Score screens, ZUN also zero-padded each score to 8 digits, but only blitted the 9th digit into the padding between name and score if it's nonzero. From this code detail alone, we can tell that ZUN was fully aware of ≥100 million points being possible, but probably considered such high scores unlikely enough to not bother rearranging the in-game HUD to support 9 digits. After all, it only looks like there's plenty of unused space next to the HUD, but in reality, it's tightly surrounded by important VRAM regions on both sides: The 32 pixels to the left provide the much-needed sprite garbage area to support 📝 visually clipped sprites despite master.lib's lack of sprite clipping, and the 64 pixels to the right are home to the 📝 tile source area:
It sure wouldn't have been impossible. You could either sacrifice the two tiles that would cover the 9th digit in both the HiScore and Score row, or – even better – move these tiles under the existing padding space within the HUD. 📝 The tile sections of TH04 and TH05 already address their images using raw VRAM addresses, so this wouldn't have even required an additional tile index→VRAM address lookup table.
And sure enough, ZUN confirms this awareness in TH04's OMAKE.TXT:
However, the highest score that the High Score screens of both games can display without visual glitches is not 999,999,999, as you would expect from 9 digits, but rather…
959 million?
(Also, this 9th digit nicely highlights a slight asymmetry in TH04's screen, where Marisa gets 4 fewer pixels of padding between names and scores.)
What a weird limit. Regardless of whether GENSOU.SCR saves its scores in a sane unsigned 32-bit format or a silly 8-digit BCD one, this limit makes no sense in either representation. In fact, GENSOU.SCR goes even further than BCD values, and instead uses… the ID of the corresponding gaiji in the 📝 bold font?
How cute. No matter how you look at it, storing digits with an added offset of 160 makes no sense:
It's suboptimal for the High Score screens (which want to display scores with the digit sprites from SCNUM.BFT and thus have to subtract 160 from every digit),
it's suboptimal for the HiScore row in the in-game HUD (which also needs actual digits under the hood for easier comparison and replacement with the current Score, and rendering just adds 160 again), and
it doesn't even work as obfuscation (with an offset of 160 / 0xA0, you can always read the number by just looking at the lower 4 bits, and each character/rank section in GENSOU.SCR is encrypted with its own key anyway).
It does start to explain the 959 million limit, though. Since each digit in GENSOU.SCR takes up 1 byte as well, they are indeed limited to a maximum value of (255 - 160) = 95 before they wrap back to 0.
But wait. If the game simply subtracts 160 from the gaiji index to get the digit value, shouldn't this subtraction also wrap back around from 0 to 255 and recover higher values without issue? The answer is, 📝 again, C's integer promotion: Splitting the binary value into two digits involves a division by 10, the C standard mandates that a regular untyped 10 is always of type int, the uint8_t digit operand gets promoted to match, and the result is actually negative and thus doesn't even get recognized as a 9th digit because no negative value is ≥10.
So what would happen if we were to enter a score that exceeds this limit? The registration screen in MAINE.EXE doesn't display the 9th digit and the 8th one wraps around. But it still sorts the score correctly, so at least the internal processing seems to work without any problem…
(160 + 99) = 259, which wraps around to 3, so this makes perfect sense. We'll figure out the exact logic behind the differently colored sprite once RE progress reaches this screen.
But once you try viewing this score, you're instead greeted with VRAM corruption resulting from master.lib's super_put() function not bounds-checking the negative sprite IDs passed by the viewer:
In a rare case for PC-98 Touhou, the High Score viewer also hides two interesting details regarding its BGM. Just like for the graphics, ZUN also coded a fade-in call for the music. In abbreviated ASM code:
mov ax, 0000h ; PMD AH=00H (start music playback)
int 60h
mov ax, 0280h ; PMD AH=02H (fade in/out)
int 60h
However, the AH=02H fade-in call has no effect because AH=00h resets the music volume and would need to be followed by a volume-lowering AH=19h call. But even if there was such a call, the fade-in would sound terrible. 80h corresponds to the fastest possible fade-in speed of -128, which is almost but not quite instant. As such, the fade-in would leave the initial note on each channel muted while the rest of the track fades in very abruptly, which clashes badly with the bass and chord notes you'd expect to hear in the name registration themes of the two games:
At least the first issue could have been avoided if PMD's AH=00h call took optional parameters that describe the initial playback state instead of relying on these mutating calls later on. After all, it might be entirely possible for a bunch of interrupts to fire between AH=00h and these further calls, and if those interrupts take a while, the FM chip might have already played a few samples at PMD's default volume. Sure, Real Mode doesn't stop you from wrapping this sequence in CLI and STI instructions to work around this issue, but why rely on even more CPU state mutation when there would have been plenty of free x86 registers for passing more initial state to AH=00h?
The second detail is the complete opposite: It's a fade-out when leaving the menu, it uses PMD's slowest fade speed, and it does work and sound good. However, the speed is so slow that you typically barely notice the feature before the main menu theme starts playing again. But ZUN hid a small easter egg in the code: After the title screen background faded back in, the games wait for all inputs to be released before moving back into the main menu and playing the title screen theme. By holding any key when leaving the High Score viewer, you can therefore listen to the fade-out for as long as you want.
Although when I said that it works, this does not include TH04. 📝 As📝 usual, this game's menus do not address the PC-98's keyboard scancode quirk with regard to held keys, causing the loop to break even while the player is still holding a key. There are 21 not yet RE'd input polling calls in TH02 and TH04 that will most certainly reveal similar inconsistencies, are you excited yet?
But in TH05, holding a key indeed reveals the hidden-content of a 37-second fade-out:
I'm holding Esc here, but this works with any key, even the ⬅️ left and ➡️ right arrow keys that don't quit out of the menu.
As you can already tell by the markers, the final bugs in TH05's (and only TH05's) OP.EXE are palette-related and revealed by switching between these two screens:
Why does the title screen initially use an ever so slightly darker palette than it does when returning from the menu?
What's with the sudden palette change between frames 1 and 2? Why are the colors suddenly much brighter?
1) is easily traced and attributed to an off-by-one error in the animation's palette fade code, but 2) is slightly more complex. This palette glitch only happens if the High Score viewer is the first palette-changing submenu you enter after the 📝 title animation. Just like 📝 TH03's character portraits, both TH04 and TH05 load the sprites for the High Score screen's digits (SCNUM.BFT) and rank indicator (HI_M.BFT) as soon as the title animation has finished. Since these are regular BFNT sprite sheets, ZUN loads them using master.lib's super_entry_bfnt(), and that's where the issue hides: master.lib's blocking palette fade functions operate on master.lib's main 8-bit palette, and super_entry_bfnt() overwrites this palette with the one in the BFNT header. Synchronizing the hardware palette with this newly loaded one would have immediately revealed this possibly unintended state mutation, but I get why master.lib might not have wanted to do that – after all, 📝 palette uploads aren't exactly cheap and would be very noticeable when loading multiple sprite sheets in a row.
In any case, this is no problem in TH04 as that game's HI_M.BFT and OP1.PI have identical palettes. But in TH05, HI_M.BFT has a significantly brighter palette:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
OP1.PI
HI01.PI / HI_M.BFT
And that's 100% RE for TH05's OP.EXE! 🎉 TH04's counterpart is not far behind either now, and only misses its title screen animation to reach the same mark.
As for 100% finalization, there's still the not yet decompiled TH04/TH05 version of the ZUN Soft logo that separates both OP.EXE binaries from this goal. But as I've mentioned 📝 time and time again, the most fitting moment for decompiling that animation would be right before reaching 100% on the entirety of either game. Really – as long as we aren't there, your funding is better invested into literally anything else. The ZUN Soft logo does not interact with or block work on any other part of the game, and any potential modding should be easy enough on the ASM level.
But thankfully, nobody actually scrolls down to the Finalized section. So I can rest assured that no one will take that moment away from me!
Next up: I'd kinda like to stay with PC-98 Touhou for a little longer, but the current backlog is pulling into too many different directions and doesn't convincingly point toward one goal over any other. TH02 is close, but with an active subscription, it makes more sense to accumulate 3 pushes of funding and then go for that game's bullet system in January. This is why I'm OK with subscriptions exceeding the cap every once in a while, because they do allow me to plan ahead in the long term.
So, let's wait a few days for all of you to capture the open towards something more specific. But if the backlog stays as indecisive as it is now, I'll instead go for finishing the Shuusou Gyoku Linux port, hopefully in time for the holiday season.
As for prices, indeed seems to be the point where my supply meets the community's demand for this project and the store no longer sells out immediately. So for the time being, we're going to stay at that push price and I won't increase it any further upon hitting the cap.
Remember when ReC98 was about researching the PC-98 Touhou games? After over half a year, we're finally back with some actual RE and decompilation work. The 📝 build system improvement break was definitely worth it though, the new system is a pure joy to use and injected some newfound excitement into day-to-day development.
And what game would be better suited for this occasion than TH03, which currently has the highest number of individual backers interested in it. Funding the full decompilation of TH03's OP.EXE is the clearest signal you can send me that 📝 you want your future TH03 netplay to be as seamlessly integrated and user-friendly as possible. We're just two menu screens away from reaching that goal anyway, and the character selection screen fits nicely into a single push.
The code of a menu typically starts with loading all its graphics, and TH03's character selection already stands out in that regard due to the sheer amount of image data it involves. Each of the game's 9 selectable characters comes with
a 192×192-pixel portrait (??SL.CD2),
a 32×44-pixel pictogram describing her Extra Attack (in SLEX.CD2), and
a 128×16-pixel image of her name (in CHNAME.BFT). While this image just consists of regular boldfaced versions of font ROM glyphs that the game could just render procedurally, pre-rendering these names and keeping them around in memory does make sense for performance reasons, as we're soon going to see. What doesn't make sense, though, is the fact that this is a 16-color BFNT image instead of a monochrome one, wasting both memory and rendering time.
Luckily, ZUN was sane enough to draw each character's stats programmatically. If you've ever looked through this game's data, you might have wondered where the game stores the sprite for an individual stat star. There's SLWIN.CDG, but that file just contains a full stat window with five stars in all three rows. And sure enough, ZUN renders each character's stats not by blitting sprites, but by painting (5 - value) yellow rectangles over the existing stars in that image.
The only stat-related image you will find as part of the game files. The number of stat stars per character is hardcoded and not based on any other internal constant we know about.
Together with the EXTRA🎔 window and the question mark portrait for Story Mode, all of this sums up to 255,216 bytes of image data across 14 files. You could remove the unnecessary alpha plane from SLEX.CD2 (-1,584 bytes) or store CHNAME.BFT in a 1-bit format (-6,912 bytes), but using 3.3% less memory barely makes a difference in the grand scheme of things.
From the code, we can assume that loading such an amount of data all at once would have led to a noticeable pause on the game's target PC-98 models. The obvious alternative would be to just start out with the initially visible images and lazy-load the data for other characters as the cursors move through the menu, but the resulting mini-latencies would have been bound to cause minor frame drops as well. Instead, ZUN opted for a rather creative solution: By segmenting the loading process into four parts and moving three of these parts ahead into the main menu, we instead get four smaller latencies in places where they don't stick out as much, if at all:
The loading process starts at the logo animation, with Ellen's, Kotohime's, and Kana's portraits getting loaded after the 東方夢時空 letters finished sliding in. Why ZUN chose to start with characters #3, #4, and #5 is anyone's guess.
Reimu's, Mima's, and Marisa's portraits as well as all 9 EXTRA🎔 attack pictograms are loaded at the end of the flash animation once the full title image is shown on screen and before the game is waiting for the player to press a key.
The stat and EXTRA🎔 windows are loaded at the end of the main menu's slide-in animation… together with the question mark portrait for Story Mode, even though the player might not actually want to play Story Mode.
Finally, the game loads Rikako's, Chiyuri's, and Yumemi's portraits after it cleared VRAM upon entering the Select screen, regardless of whether the latter two are even unlocked.
I don't like how ZUN implemented this split by using three separately named standalone functions with their own copy-pasted character loop, and the load calls for specific files could have also been arranged in a more optimal order. But otherwise, this has all the ingredients of good-code. As usual, though, ZUN then definitively ruins it all by counteracting the intended latency hiding with… deliberately added latency frames:
The entire initialization process of the character selection screen, including Step #4 of image loading, is enforced to take at least 30 frames, with the count starting before the switch to the Selection theme. Presumably, this is meant to give the player enough time to release the Z key that entered this menu, because holding it would immediately select Reimu (in Story mode) or the previously selected 1P character (in VS modes) on the very first frame. But this is a workaround at best – and a completely unnecessary one at that, given that regular navigation in this menu already needs to lock keys until they're released. In the end, you can still auto-select the default choice by just not releasing the Z key.
And if that wasn't enough, the 1P vs. 2P variant of the menu adds 16 more frames of startup delay on top.
Sure, maybe loading the fourth part's 69,120 bytes from a highly fragmented hard drive might have even taken longer than 30 frames on a period-correct PC-98, but the point still stands that these delays don't solve the problem they are supposed to solve.
But the unquestionable main attraction of this menu is its fancy background animation. Mathematically, it consists of Lissajous curves with a twist: Instead of calculating each point as
x = sin((fx·t)+ẟx)y = sin((fy·t)+ẟy), TH03 effectively calculates its points as
x = cos(fx·((t+ẟx) % 0xFF))y = sin(fy·((t+ẟy) % 0xFF)), due to t and ẟ being 📝 8-bit angles. Since the result of the addition remains 8-bit as well, it can and will regularly overflow before the frequency scaling factors fx and fy are applied, thus leading to sudden jumps between both ends of the 8-bit value range. The combination of this overflow and the gradual changes to fx and fy create all these interesting splits along the 360° of the curve:
At a high level, there really is just one big curve and one small curve, plus an array of trailing curves that approximate motion blur by subtracting from ẟx and ẟy.
In a rather unusual display of mathematical purity, ZUN fully re-calculates all variables and every point on every frame from just the single byte of state that indicates the current time within the animation's 128-frame cycle. However, that beauty is quickly tarnished by the sheer cost of fully recalculating these curves every frame:
In total, the effect calculates, clips, and plots 16 curves: 2 main ones, with up to 7×2 = 14 darker trailing curves.
Each of these curves is made up of the 256 maximum possible points you can get with 8-bit angles, giving us 4,096 points in total.
Each of these points takes at least 333 cycles on a 486 if it passes all clipping checks, not including VRAM latencies or the performance impact of the 📝 GRCG's RMW mode.
Due to the larger curve's diameter of 440 pixels, a few of the points at its edges are needlessly calculated only to then be discarded by the clipping checks as they don't fit within the 400 VRAM rows. Still, >1.3 million cycles for a single frame remains a reasonable ballpark assumption.
This is decidedly more than the 1.17 million cycles we have between each VSync on the game's target 66 MHz CPUs. So it's not surprising that this effect is not rendered at 56.4 FPS, but instead drops the frame rate of the entire menu by targeting a hardcoded 1 frame per 3 VSync interrupts, or 18.8 FPS. Accordingly, I reduced the frame rate of the video above to represent the actual animation cycle as cleanly as possible.
Apparently, ZUN also tested the game on the 33 MHz PC-98 model that he targeted with TH01, and realized that 4,096 points were way too much even at 18.8 FPS. So he also added a mechanism that decrements the number of trailing curves if the last frame took ≥5 VSync interrupts, down to a minimum of only a single extra curve. You can see this in action by underclocking the CPU in your Neko Project fork of choice.
But were any of these measures really necessary? Couldn't ZUN just have allocated a 12 KiB ring buffer to keep the coordinates of previous curves, thus reducing per-frame calculations to just 512 points? Well, he could have, but we now can't use such a buffer to optimize the original animation. The 8-bit main angle offset/animation cycle variable advances by 0x02 every frame, but some of the trailing curves subtract odd numbers from this variable and thus fall between two frames of the main curves.
So let's shelve the idea of high-level algorithmic optimizations. In this particular case though, even micro-optimizations can have massive benefits. The sheer number of points magnifies the performance impact of every suboptimal code generation decision within the inner point loop:
Frequency scaling works by multiplying the 8-bit angles with a fixed-point Q8.8 factor. The result is then scaled back to regular integers via… two divisions by 256 rather than two bitshifts? That's another ≥46 cycles where ≥10 would have sufficed. Edit (2025-08-29): The initial version of this post miscounted the number of required cycles as ≥4, or 2× the cycle count of a single SAR instruction. That number didn't consider that the frequency scaling multiplication occasionally produces negative numbers, which 📝 must be conditionally rounded up when replacing signed divisions with arithmetic bitshifts to still produce the exact original animation. This conditional rounding adds ≥8 cycles in the more common positive case, and ≥6 in the rarer negative case.
The biggest gains, however, would come from inlining the two far calls to the 5-instruction function that calculates one dimension of a polar coordinate, saving another ≥100 cycles.
Multiplied by the number of points, even these low-hanging fruit already save a whopping ≥729,088 cycles per frame on an i486, without writing a single line of ASM! On Pentium CPUs such as the one in the PC-9821Xa7 that ZUN supposedly developed this game on, the savings are slightly smaller because far calls are much faster, but still come in at a hefty ≥466,944 cycles. Thus, this animation easily beats 📝 TH01's sprite blitting and unblitting code, which just barely hit the 6-digit mark of wasted cycles, and snatches the crown of being the single most unoptimized code in all of PC-98 Touhou.
The incredible irony here is that TH03 is the point where ZUN 📝 really📝 started📝 going📝 overboard with useless ASM micro-optimizations, yet he didn't even begin to optimize the one thing that would have actually benefitted from it. Maybe he 📝 once again went for the 📽️ cinematic look 📽️ on purpose?
Unlike TH01's sprites though, all this wasted performance doesn't really matter much in the end. Sure, optimizing the animation would give us more trailing curves on slower PC-98 models, but any attempt to increase the frame rate by interpolating angles would send us straight into fanfiction territory. Due to the 0x02/2.8125° increment per cycle, tripling the frame rate of this animation would require a change to a very awkward (log2384) = 8.58-bit angle format, complete with a new 384-entry sine/cosine lookup table. And honestly, the effect does look quite impressive even at 18.8 FPS.
There are three more bugs and quirks in this animation that are unrelated to performance:
If you've tried counting the number of trailing dots in the video above, you might have noticed that the very first frame actually renders 8×2 trailing curves instead of 7×2, thus rendering an even higher 4,608 points. What's going on there is that ZUN actually requested 8 trailing curves, but then forgot to reset the VSync counter after the initial 30-frame delay. As a result, the game always thinks that the first frame of the menu took ≥30 VSync interrupts to render, thus causing the decrement mechanism to kick in and deterministically reduce the trailing curve count to 7.
This is a textbook example of my definition of a ZUN bug: The code unmistakably says 8, and we only don't get 8 because ZUN forgot to mutate a piece of global state.
The small trailing curves have a noticeable discontinuity where they suddenly get rotated by ±90° between the last and first frame of the animation cycle.
This quirk comes down to the small curve's ẟy angle offset being calculated as ((c/2)-i), with i being the number of the trailing curve. Halving the main cycle variable effectively restricts this smaller curve to only the first half of the sine oscillation, between [0x00, 0x80[. For the main curve, this is fine as i is always zero. But once the trailing curves leave us with a negative value after the subtraction, the resulting angle suddenly flips over into the second half of the sine oscillation that the regular curve never touches. And if you recall how a sine wave looks, the resulting visual rotation immediately makes sense:
Removing the division would be the most obvious fix, but that would double the speed of the sine oscillation and change the shape of the curve way beyond ZUN's intentions. The second-most obvious fix involves matching the trailing curves to the movement of the main one by restricting the subtraction to the first half of the oscillation, i.e., calculating ẟy as (((c/2)-i) % 0x80) instead. With c increasing by 0x02 on each frame of the animation, this fix would only affect the first 8 frames.
ZUN decided to plot the darker trailing curves on top of the lighter main ones. Maybe it should have been the other way round?
Now with the full 18 curves, a direction change of the smaller trailing curves at the end of the loop that only looks slightly odd, and a reversed and more natural plotting order.
Now that we fully understand how the curve animation works, there's one more issue left to investigate. Let's actually try holding the Z key to auto-select Reimu on the very first frame of the Story Mode Select screen:
The confirmation flash even happens before the menu's first page flip.
Stepping through the individual frames of the video above reveals quite a bit of tearing, particularly when VRAM is cleared in frame 1 and during the menu's first page flip in frame 49. This might remind you of 📝 the tearing issues in the Music Rooms – and indeed, this tearing is once again the expected result of ZUN landmines in the code, not an emulation bug. In fact, quite the contrary: Scanline-based rendering is a mark of quality in an emulator, as it always requires more coding effort and processing power than not doing it. Everyone's favorite two PC-98 emulators from 20 years ago might look nicer on a per-frame basis, but only because they effectively hide ZUN's frequent confusion around VRAM page flips.
To understand these tearing issues, we need to consider two more code details:
If a frame took longer than 3 VSync interrupts to render, ZUN flips the VRAM pages immediately without waiting for the next VSync interrupt.
The hardware palette fade-out is the last thing done at the end of the per-frame rendering loop, but before busy-waiting for the VSync interrupt.
The combination of 1) and the aforementioned 30-frame delay quirk explains Frame 49. There, the page flip happens within the second frame of the three-frame chunk while the electron beam is drawing row #156. DOSBox-X doesn't try to be cycle-accurate to specific CPUs, but 1 menu frame taking 1.39 real-time frames at 56.4 FPS is roughly in line with the cycle counting we did earlier.
Frame 97 is the much more intriguing one, though. While it's mildly amusing to see the palette actually go brighter for a single frame before it fades out, the interesting aspect here is that 2) practically guarantees its palette changes to happen mid-frame. And since the CRT's electron beam might be anywhere at that point… yup, that's how you'd get more than 16 colors out of the PC-98's 16-color graphics mode. 🎨
Let's exaggerate the brightness difference a bit in case the original difference doesn't come across too clearly on your display:
Probably not too much of a reason for demosceners to get excited; generic PC-98 code that doesn't try to target specific CPUs would still need a way of reliably timing such mid-frame palette changes. Bit 6 (0x40) of I/O port 0xA0 indicates HBlank, and the usual documentation suggests that you could just busy-wait for that bit to flip, but an HBlank interrupt would be much nicer.
This reproduces on both DOSBox-X and Neko Project 21/W, although the latter needs the Screen → Real palettes option enabled to actually emulate a CRT electron beam. Unfortunately, I couldn't confirm it on real hardware because my PC-9821Nw133's screen vinegar'd at the beginning of the year. But just as with the image loading times, TH03's remaining code sorts of indicate that mid-frame palette changes were noticeable on real hardware, by means of this little flag I RE'd way back in March 2019. Sure, palette_show() takes >2,850 cycles on a 486 to downconvert master.lib's 8-bit palette to the GDC's 4-bit format and send it over, and that might add up with more than one palette-changing effect per frame. But tearing is a way more likely explanation for deferring all palette updates until after VSync and to the next frame.
And that completes another menu, placing us a very likely 2 pushes away from completing TH03's OP.EXE! Not many of those left now…
To balance out this heavy research into a comparatively small amount of code, I slotted in 2024's Part 2 of my usual bi-annual website improvements. This time, they went toward future-proofing the blog and making it a lot more navigable. You've probably already noticed the changes, but here's the full changelog:
The Progress blog link in the main navigation bar now points to a new list page with just the post headers and each post's table of contents, instead of directly overwhelming your browser with a view of every blog post ever on a single page.
If you've been reading this blog regularly, you've probably been starting to dread clicking this link just as much as I've been. 14 MB of initially loaded content isn't too bad for 136 posts with an increasing amount of media content, but laying out the now 2 MB of HTML sure takes a while, leaving you with a sluggish and unresponsive browser in the meantime. The old one-page view is still available at a dedicated URL in case you want to Ctrl-F over the entire history from time to time, but it's no longer the default.
The new 🔼 and 🔽 buttons now allow quick jumps between blog posts without going through the table of contents or the old one-page view. These work as expected on all views of the blog: On single-post pages, the buttons link to the adjacent single-post pages, whereas they jump up and down within the same page on the list of posts or the tag-filtered and one-page views.
The header section of each post now shows the individual goals of each push that the post documents, providing a sort of title. This is much more useful than wasting space with meaningless commit hashes; just like in the log, links to the commit diffs don't need to be longer than a GitHub icon.
The web feeds that 📝 handlerug implemented two years ago are now prominently displayed in the new blog navigation sub-header. Listing them using <link rel="alternate"> tags in the HTML <head> is usually enough for integrated feed reader extensions to automatically discover their presence, but it can't hurt to draw more attention to them. Especially now that Twitter has been locking out unregistered users for quite some time…
Speaking of microblogging platforms, I've now also followed a good chunk of the Touhou community to Bluesky! The algorithms there seem to treat my posts much more favorably than Twitter has been doing lately, despite me having less than 1/10 of mostly automatically migrated followers there. For now, I'm going to cross-post new stuff to both platforms, but I might eventually spend a push to migrate my entire tweet history over to a self-hosted PDS to own the primary source of this data.
Next up: Staying with main menus, but jumping forward to TH04 and TH05 and finalizing some code there. Should be a quick one.
P0207
TH01 decompilation (YuugenMagan, part 1/5: Preparation)
P0208
TH01 decompilation (YuugenMagan, part 2/5: Helper functions)
P0209
TH01 decompilation (YuugenMagan, part 3/5: Main function)
P0210
TH01 decompilation (YuugenMagan, part 4/5: Eye opening/closing + 邪 colors)
P0211
TH01 decompilation (YuugenMagan, part 5/5: Quirk research + Data finalization, part 1/2 + Common part of endings)
Whew, TH01's boss code just had to end with another beast of a boss, taking
way longer than it should have and leaving uncomfortably little time for the
rest of the game. Let's get right into the overview of YuugenMagan, the most
sequential and scripted battle in this game:
The fight consists of 14 phases, numbered (of course) from 0 to 13.
Unlike all other bosses, the "entrance phase" 0 is a proper gameplay-enabled
part of the fight itself, which is why I also count it here.
YuugenMagan starts with 16 HP, second only to Sariel's 18+6. The HP bar
visualizes the HP threshold for the end of phases 3 (white part) and 7
(red-white part), respectively.
All even-numbered phases change the color of the 邪 kanji in the stage
background, and don't check for collisions between the Orb and any eye.
Almost all of them consequently don't feature an attack, except for phase
0's 1-pixel lasers, spawning symmetrically from the left and right edges of
the playfield towards the center. Which means that yes, YuugenMagan is in
fact invincible during this first attack.
All other attacks are part of the odd-numbered phases:
Phase 1: Slow pellets from the lateral eyes. Ends
at 15 HP.
Phase 3: Missiles from the southern eyes, whose
angles first shift away from Reimu's tracked position and then towards
it. Ends at 12 HP.
Phase 5: Circular pellets sprayed from the lateral
eyes. Ends at 10 HP.
Phase 7: Another missile pattern, but this time
with both eyes shifting their missile angles by the same
(counter-)clockwise delta angles. Ends at 8 HP.
Phase 9: The 3-pixel 3-laser sequence from the
northern eye. Ends at 2 HP.
Phase 11: Spawns the pentagram with one corner out
of every eye, then gradually shrinks and moves it towards the center of
the playfield. Not really an "attack" (surprise) as the pentagram can't
reach the player during this phase, but collision detection is
technically already active here. Ends at 0 HP, marking the earliest
point where the fight itself can possibly end.
Phase 13: Runs through the parallel "pentagram
attack phases". The first five consist of the pentagram alternating its
spinning direction between clockwise and counterclockwise while firing
pellets from each of the five star corners. After that, the pentagram
slams itself into the player, before YuugenMagan loops back to phase
10 to spawn a new pentagram. On the next run through phase 13, the
pentagram grows larger and immediately slams itself into the player,
before starting a new pentagram attack phase cycle with another loop
back to phase 10.
Since the HP bar fills up in a phase with no collision detection,
YuugenMagan is immune to
📝 test/debug mode heap corruption. It's
generally impossible to get YuugenMagan's HP into negative numbers, with
collision detection being disabled every other phase, and all odd-numbered
phases ending immediately upon reaching their HP threshold.
All phases until the very last one have a timeout condition, independent
from YuugenMagan's current HP:
Phase 0: 331 frames
Phase 1: 1101 frames
Phases 2, 4, 6, 8, 10, and 12: 70 frames each
Phases 3 and 7: 5 iterations of the pattern, or
1845 frames each
Phase 5: 5 iterations of the pattern, or 2230
frames
Phase 9: The full duration of the sequence, or 491
frames
Phase 11: Until the pentagram reached its target
position, or 221 frames
This makes it possible to reach phase 13 without dealing a single point of
damage to YuugenMagan, after almost exactly 2½ minutes on any difficulty.
Your actual time will certainly be higher though, as you will have to
HARRY UP at least once during the attempt.
And let's be real, you're very likely to subsequently lose a
life.
At a pixel-perfect 81×61 pixels, the Orb hitboxes are laid out rather
generously this time, reaching quite a bit outside the 64×48 eye sprites:
And that's about the only positive thing I can say about a position
calculation in this fight. Phase 0 already starts with the lasers being off
by 1 pixel from the center of the iris. Sure, 28 may be a nicer number to
add than 29, but the result won't be byte-aligned either way? This is
followed by the eastern laser's hitbox somehow being 24 pixels larger than
the others, stretching a rather unexpected 70 pixels compared to the 46 of
every other laser.
On a more hilarious note, the eye closing keyframe contains the following
(pseudo-)code, comprising the only real accidentally "unused" danmaku
subpattern in TH01:
// Did you mean ">= RANK_HARD"?
if(rank == RANK_HARD) {
eye_north.fire_aimed_wide_5_spread();
eye_southeast.fire_aimed_wide_5_spread();
eye_southwest.fire_aimed_wide_5_spread();
// Because this condition can never be true otherwise.
// As a result, no pellets will be spawned on Lunatic mode.
// (There is another Lunatic-exclusive subpattern later, though.)
if(rank == RANK_LUNATIC) {
eye_west.fire_aimed_wide_5_spread();
eye_east.fire_aimed_wide_5_spread();
}
}
Featuring the weirdly extended hitbox for the eastern laser, as well as
an initial Reimu position that points out the disparity between
byte-aligned rendering and the internal coordinates one final time.
After a few utility functions that look more like a quickly abandoned
refactoring attempt, we quickly get to the main attraction: YuugenMagan
combines the entire boss script and most of the pattern code into a single
2,634-instruction function, totaling 9,677 bytes inside
REIIDEN.EXE. For comparison, ReC98's version of this code
consists of at least 49 functions, excluding those I had to add to work
around ZUN's little inconsistencies, or the ones I added for stylistic
reasons.
In fact, this function is so large that Turbo C++ 4.0J refuses to generate
assembly output for it via the -S command-line option, aborting
with a Compiler table limit exceeded in function error.
Contrary to what the Borland C++ 4.0 User Guide suggests, this
instance of the error is not at all related to the number of function bodies
or any metric of algorithmic complexity, but is simply a result of the
compiler's internal text representation for a single function overflowing a
64 KiB memory segment. Merely shortening the names of enough identifiers
within the function can help to get that representation down below 64 KiB.
If you encounter this error during regular software development, you might
interpret it as the compiler's roundabout way of telling you that it inlined
way more function calls than you probably wanted to have inlined. Because
you definitely won't explicitly spell out such a long function
in newly-written code, right?
At least it wasn't the worst copy-pasting job in this
game; that trophy still goes to 📝 Elis. And
while the tracking code for adjusting an eye's sprite according to the
player's relative position is one of the main causes behind all the bloat,
it's also 100% consistent, and might have been an inlined class method in
ZUN's original code as well.
The clear highlight in this fight though? Almost no coordinate is
precisely calculated where you'd expect it to be. In particular, all
bullet spawn positions completely ignore the direction the eyes are facing
to:
Combining the bottom of the pupil with the exact horizontal
center of the sprite as a whole might sound like a good idea, but looks
especially wrong if the eye is facing right.Here it's the other way round: OK for a right-facing eye, really
wrong for a left-facing one.Dude, the eye is even supposed to track the laser in this
one!Hint: That's not the center of the playfield. At least the
pellets spawned from the corners are sort of correct, but with the corner
calculates precomputed, you could only get them wrong on
purpose.
Due to their effect on gameplay, these inaccuracies can't even be called
"bugs", and made me devise a new "quirk" category instead. More on that in
the TH01 100% blog post, though.
While we did see an accidentally unused bullet pattern earlier, I can
now say with certainty that there are no truly unused danmaku
patterns in TH01, i.e., pattern code that exists but is never called.
However, the code for YuugenMagan's phase 5 reveals another small piece of
danmaku design intention that never shows up within the parameters of
the original game.
By default, pellets are clipped when they fly past the top of the playfield,
which we can clearly observe for the first few pellets of this pattern.
Interestingly though, the second subpattern actually configures its pellets
to fall straight down from the top of the playfield instead. You never see
this happening in-game because ZUN limited that subpattern to a downwards
angle range of 0x73 or 162°, resulting in none of its pellets
ever getting close to the top of the playfield. If we extend that range to a
full 360° though, we can see how ZUN might have originally planned the
pattern to end:
YuugenMagan's phase 5 patterns on every difficulty, with the
second subpattern extended to reveal the different pellet behavior that
remained in the final game code. In the original game, the eyes would stop
spawning bullets on the marked frame.
If we also disregard everything else about YuugenMagan that fits the
upcoming definition of quirk, we're left with 6 "fixable" bugs, all
of which are a symptom of general blitting and unblitting laziness. Funnily
enough, they can all be demonstrated within a short 9-second part of the
fight, from the end of phase 9 up until the pentagram starts spinning in
phase 13:
General flickering whenever any sprite overlaps an eye. This is caused
by only reblitting each eye every 3 frames, and is an issue all throughout
the fight. You might have already spotted it in the videos above.
Each of the two lasers is unblitted and blitted individually instead of
each operation being done for both lasers together. Remember how
📝 ZUN unblits 32 horizontal pixels for every row of a line regardless of its width?
That's why the top part of the left, right-moving laser is never visible,
because it's blitted before the other laser is unblitted.
ZUN forgot to unblit the lasers when phase 9 ends. This footage was
recorded by pressing ↵ Return in test mode (game t or
game d), and it's probably impossible to achieve this during
actual gameplay without TAS techniques. You would have to deal the required
6 points of damage within 491 frames, with the eye being invincible during
240 of them. Simply shooting up an Orb with a horizontal velocity of 0 would
also only work a single time, as boss entities always repel the Orb with a
horizontal velocity of ±4.
The shrinking pentagram is unblitted after the eyes were blitted,
adding another guaranteed frame of flicker on top of the ones in 1). Like in
2), the blockiness of the holes is another result of unblitting 32 pixels
per row at a time.
Another missing unblitting call in a phase transition, as the pentagram
switches from its not quite correctly interpolated shrunk form to a regular
star polygon with a radius of 64 pixels. Indirectly caused by the massively
bloated coordinate calculation for the shrink animation being done
separately for the unblitting and blitting calls. Instead of, y'know, just
doing it once and storing the result in variables that can later be
reused.
The pentagram is not reblitted at all during the first 100 frames of
phase 13. During that rather long time, it's easily possible to remove
it from VRAM completely by covering its area with player shots. Or HARRY UP pellets.
Definitely an appropriate end for this game's entity blitting code.
I'm really looking forward to writing a
proper sprite system for the Anniversary Edition…
And just in case you were wondering about the hitboxes of these pentagrams
as they slam themselves into Reimu:
62 pixels on the X axis, centered around each corner point of the star, 16
pixels below, and extending infinitely far up. The latter part becomes
especially devious because the game always collision-detects
all 5 corners, regardless of whether they've already clipped through
the bottom of the playfield. The simultaneously occurring shape distortions
are simply a result of the line drawing function's rather poor
re-interpolation of any line that runs past the 640×400 VRAM boundaries;
📝 I described that in detail back when I debugged the shootout laser crash.
Ironically, using fixed-size hitboxes for a variable-sized pentagram means
that the larger one is easier to dodge.
The final puzzle in TH01's boss code comes
📝 once again in the form of weird hardware
palette changes. The 邪 kanji on the background
image goes through various colors throughout the fight, which ZUN
implemented by gradually incrementing and decrementing either a single one
or none of the color's three 4-bit components at the beginning of each
even-numbered phase. The resulting color sequence, however, doesn't
quite seem to follow these simple rules:
Phase 0: #DD5邪
Phase 2: #0DF邪
Phase 4: #F0F邪
Phase 6: #00F邪, but at the
end of the phase?!
Phase 8: #0FF邪, at the start
of the phase, #0F5邪, at the end!?
Phase 10: #FF5邪, at the start of
the phase, #F05邪, at the end
Second repetition of phase 12: #005邪
shortly after the start of the phase?!
Adding some debug output sheds light on what's going on there:
Since each iteration of phase 12 adds 63 to the red component, integer
overflow will cause the color to infinitely alternate between dark-blue
and red colors on every 2.03 iterations of the pentagram phase loop. The
65th iteration will therefore be the first one with a dark-blue color
for a third iteration in a row – just in case you manage to stall the
fight for that long.
Yup, ZUN had so much trust in the color clamping done by his hardware
palette functions that he did not clamp the increment operation on the
stage_palette itself. Therefore, the 邪
colors and even the timing of their changes from Phase 6 onwards are
"defined" by wildly incrementing color components beyond their intended
domain, so much that even the underlying signed 8-bit integer ends up
overflowing. Given that the decrement operation on the
stage_paletteis clamped though, this might be another
one of those accidents that ZUN deliberately left in the game,
📝 similar to the conclusion I reached with infinite bumper loops.
But guess what, that's also the last time we're going to encounter this type
of palette component domain quirk! Later games use master.lib's 8-bit
palette system, which keeps the comfort of using a single byte per
component, but shifts the actual hardware color into the top 4 bits, leaving
the bottom 4 bits for added precision during fades.
OK, but now we're done with TH01's bosses! 🎉That was the
8th PC-98 Touhou boss in total, leaving 23 to go.
With all the necessary research into these quirks going well into a fifth
push, I spent the remaining time in that one with transferring most of the
data between YuugenMagan and the upcoming rest of REIIDEN.EXE
into C land. This included the one piece of technical debt in TH01 we've
been carrying around since March 2015, as well as the final piece of the
ending sequence in FUUIN.EXE. Decompiling that executable's
main() function in a meaningful way requires pretty much all
remaining data from REIIDEN.EXE to also be moved into C land,
just in case you were wondering why we're stuck at 99.46% there.
On a more disappointing note, the static initialization code for the
📝 5 boss entity slots ultimately revealed why
YuugenMagan's code is as bloated and redundant as it is: The 5 slots really
are 5 distinct variables rather than a single 5-element array. That's why
ZUN explicitly spells out all 5 eyes every time, because the array he could
have just looped over simply didn't exist. 😕 And while these slot variables
are stored in a contiguous area of memory that I could just have
taken the address of and then indexed it as if it were an array, I
didn't want to annoy future port authors with what would technically be
out-of-bounds array accesses for purely stylistic reasons. At least it
wasn't that big of a deal to rewrite all boss code to use these distinct
variables, although I certainly had to get a bit creative with Elis.
Next up: Finding out how many points we got in totle, and hoping that ZUN
didn't hide more unexpected complexities in the remaining 45 functions of
this game. If you have to spare, there are two ways
in which that amount of money would help right now:
I'm expecting another subscription transaction
from Yanga before the 15th, which would leave to
round out one final TH01 RE push. With that, there'd be a total of 5 left in
the backlog, which should be enough to get the rest of this game done.
I really need to address the performance and usability issues
with all the small videos in this blog. Just look at the video immediately
above, where I disabled the controls because they would cover the debug text
at the bottom… Edit (2022-10-31):… which no longer is an
issue with our 📝 custom video player.
I already reserved this month's anonymous contribution for this work, so it would take another to be turned into a full push.
P0205
TH01 decompilation (Mima, part 1/2: Patterns 1-4)
P0206
TH01 decompilation (Mima, part 2/2: Patterns 5-8 + main function) + Research (TH01's unexpected palette changes)
💰 Funded by:
[Anonymous], Yanga
🏷️ Tags:
Oh look, it's another rather short and straightforward boss with a rather
small number of bugs and quirks. Yup, contrary to the character's
popularity, Mima's premiere is really not all that special in terms of code,
and continues the trend established with
📝 Kikuri and
📝 SinGyoku. I've already covered
📝 the initial sprite-related bugs last November,
so this post focuses on the main code of the fight itself. The overview:
The TH01 Mima fight consists of 3 phases, with phases 1 and 3 each
corresponding to one half of the 12-HP bar.
📝 Just like with SinGyoku, the distinction
between the red-white and red parts is purely visual once again, and doesn't
reflect anything about the boss script. As usual, all of the phases have to
be completed in order.
Phases 1 and 3 cycle through 4 danmaku patterns each, for a total of 8.
The cycles always start on a fixed pattern.
3 of the patterns in each phase feature rotating white squares, thus
introducing a new sprite in need of being unblitted.
Phase 1 additionally features the "hop pattern" as the last one in its
cycle. This is the only pattern where Mima leaves the seal in the center of
the playfield to hop from one edge of the playfield towards the other, while
also moving slightly higher up on the Y axis, and staying on the final
position for the next pattern cycle. For the first time, Mima selects a
random starting edge, which is then alternated on successive cycles.
Since the square entities are local to the respective pattern function,
Phase 1 can only end once the current pattern is done, even if Mima's HP are
already below 6. This makes Mima susceptible to the
📝 test/debug mode HP bar heap corruption bug.
Phase 2 simply consists of a spread-in teleport back to Mima's initial
position in the center of the playfield. This would only have been strictly
necessary if phase 1 ended on the hop pattern, but is done regardless of the
previous pattern, and does provide a nice visual separation between the two
main phases.
That's it – nothing special in Phase 3.
And there aren't even any weird hitboxes this time. What is maybe
special about Mima, however, is how there's something to cover about all of
her patterns. Since this is TH01, it's won't surprise anyone that the
rotating square patterns are one giant copy-pasta of unblitting, updating,
and rendering code. At least ZUN placed the core polar→Cartesian
transformation in a separate function for creating regular polygons
with an arbitrary number of sides, which might hint toward some more varied
shapes having been planned at one point?
5 of the 6 patterns even follow the exact same steps during square update
frames:
Calculate square corner coordinates
Unblit the square
Update the square angle and radius
Use the square corner coordinates for spawning pellets or missiles
Recalculate square corner coordinates
Render the square
Notice something? Bullets are spawned before the corner coordinates
are updated. That's why their initial positions seem to be a bit off – they
are spawned exactly in the corners of the square, it's just that it's
the square from 8 frames ago.
Mima's first pattern on Normal difficulty.
Once ZUN reached the final laser pattern though, he must have noticed that
there's something wrong there… or maybe he just wanted to fire those
lasers independently from the square unblit/update/render timer for a
change. Spending an additional 16 bytes of the data segment for conveniently
remembering the square corner coordinates across frames was definitely a
decent investment.
When Mima isn't shooting bullets from the corners of a square or hopping
across the playfield, she's raising flame pillars from the bottom of the playfield within very specifically calculated
random ranges… which are then rendered at byte-aligned VRAM positions, while
collision detection still uses their actual pixel position. Since I don't
want to sound like a broken record all too much, I'll just direct you to
📝 Kikuri, where we've seen the exact same issue with the teardrop ripple sprites.
The conclusions are identical as well.
Mima's flame pillar pattern. This video was recorded on a particularly
unlucky seed that resulted in great disparities between a pillar's
internal X coordinate and its byte-aligned on-screen appearance, leading
to lots of right-shifted hitboxes.
Also note how the change from the meteor animation to the three-arm 🚫
casting sprite doesn't unblit the meteor, and leaves that job to
any sprite that happens to fly over those pixels.
However, I'd say that the saddest part about this pattern is how choppy it
is, with the circle/pillar entities updating and rendering at a meager 7
FPS. Why go that low on purpose when you can just make the game render ✨
smoothly ✨ instead?
So smooth it's almost uncanny.
The reason quickly becomes obvious: With TH01's lack of optimization, going
for the full 56.4 FPS would have significantly slowed down the game on its
intended 33 MHz CPUs, requiring more than cheap surface-level ASM
optimization for a stable frame rate. That might very well have been ZUN's
reason for only ever rendering one circle per frame to VRAM, and designing
the pattern with these time offsets in mind. It's always been typical for
PC-98 developers to target the lowest-spec models that could possibly still
run a game, and implementing dynamic frame rates into such an engine-less
game is nothing I would wish on anybody. And it's not like TH01 is
particularly unique in its choppiness anyway; low frame rates are actually a
rather typical part of the PC-98 game aesthetic.
The final piece of weirdness in this fight can be found in phase 1's hop
pattern, and specifically its palette manipulation. Just from looking at the
pattern code itself, each of the 4 hops is supposed to darken the hardware
palette by subtracting #444 from every color. At the last hop,
every color should have therefore been reduced to a pitch-black
#000, leaving the player completely blind to the movement of
the chasing pellets for 30 frames and making the pattern quite ghostly
indeed. However, that's not what we see in the actual game:
Nothing in the pattern's code would cause the hardware palette to get
brighter before the end of the pattern, and yet…
The expected version doesn't look all too unfair, even on Lunatic…
well, at least at the default rank pellet speed shown in this
video. At maximum pellet speed, it is in fact rather brutal.
Looking at the frame counter, it appears that something outside the
pattern resets the palette every 40 frames. The only known constant with a
value of 40 would be the invincibility frames after hitting a boss with the
Orb, but we're not hitting Mima here…
But as it turns out, that's exactly where the palette reset comes from: The
hop animation darkens the hardware palette directly, while the
📝 infamous 12-parameter boss collision handler function
unconditionally resets the hardware palette to the "default boss palette"
every 40 frames, regardless of whether the boss was hit or not. I'd classify
this as a bug: That function has no business doing periodic hardware palette
resets outside the invincibility flash effect, and it completely defies
common sense that it does.
That explains one unexpected palette change, but could this function
possibly also explain the other infamous one, namely, the temporary green
discoloration in the Konngara fight? That glitch comes down to how the game
actually uses two global "default" palettes: a default boss
palette for undoing the invincibility flash effect, and a default
stage palette for returning the colors back to normal at the end of
the bomb animation or when leaving the Pause menu. And sure enough, the
stage palette is the one with the green color, while the boss
palette contains the intended colors used throughout the fight. Sending the
latter palette to the graphics chip every 40 frames is what corrects
the discoloration, which would otherwise be permanent.
The green color comes from BOSS7_D1.GRP, the scrolling
background of the entrance animation. That's what turns this into a clear
bug: The stage palette is only set a single time in the entire fight,
at the beginning of the entrance animation, to the palette of this image.
Apart from consistency reasons, it doesn't even make sense to set the stage
palette there, as you can't enter the Pause menu or bomb during a blocking
animation function.
And just 3 lines of code later, ZUN loads BOSS8_A1.GRP, the
main background image of the fight. Moving the stage palette assignment
there would have easily prevented the discoloration.
But yeah, as you can tell, palette manipulation is complete jank in this
game. Why differentiate between a stage and a boss palette to begin with?
The blocking Pause menu function could have easily copied the original
palette to a local variable before darkening it, and then restored it after
closing the menu. It's not so easy for bombs as the intended palette could
change between the start and end of the animation, but the code could have
still been simplified a lot if there was just one global "default palette"
variable instead of two. Heck, even the other bosses who manipulate their
palettes correctly only do so because they manually synchronize the two
after every change. The proper defense against bugs that result from wild
mutation of global state is to get rid of global state, and not to put up
safety nets hidden in the middle of existing effect code.
The easiest way of reproducing the green discoloration bug in
the TH01 Konngara fight, timed to show the maximum amount of time the
discoloration can possibly last.
In any case, that's Mima done! 7th PC-98 Touhou boss fully
decompiled, 24 bosses remaining, and 59 functions left in all of TH01.
In other thrilling news, my call for secondary funding priorities in new
TH01 contributions has given us three different priorities so far. This
raises an interesting question though: Which of these contributions should I
now put towards TH01 immediately, and which ones should I leave in the
backlog for the time being? Since I've never liked deciding on priorities,
let's turn this into a popularity contest instead: The contributions with
the least popular secondary priorities will go towards TH01 first, giving
the most popular priorities a higher chance to still be left over after TH01
is done. As of this delivery, we'd have the following popularity order:
TH05 (1.67 pushes), from T0182
Seihou (1 push), from T0184
TH03 (0.67 pushes), from T0146
Which means that T0146 will be consumed for TH01 next, followed by T0184 and
then T0182. I only assign transactions immediately before a delivery though,
so you all still have the chance to change up these priorities before the
next one.
Next up: The final boss of TH01 decompilation, YuugenMagan… if the current
or newly incoming TH01 funds happen to be enough to cover the entire fight.
If they don't turn out to be, I will have to pass the time with some Seihou
work instead, missing the TH01 anniversary deadline as a result.Edit (2022-07-18): Thanks to Yanga for
securing the funding for YuugenMagan after all! That fight will feature
slightly more than half of all remaining code in TH01's
REIIDEN.EXE and the single biggest function in all of PC-98
Touhou, let's go!
P0165
TH01 decompilation (Missiles, part 1/2 + large boss sprites, part 1/3)
P0166
TH01 decompilation (Large boss sprites, part 2/3)
P0167
TH01 decompilation (Large boss sprites, part 3/3 + Stage initialization + Defeat animation + Route selection)
💰 Funded by:
Ember2528
🏷️ Tags:
OK, TH01 missile bullets. Can we maybe have a well-behaved entity type,
without any weirdness? Just once?
Ehh, kinda. Apart from another 150 bytes wasted on unused structure members,
this code is indeed more on the low end in terms of overall jank. It does
become very obvious why dodging these missiles in the YuugenMagan, Mima, and
Elis fights feels so awful though: An unfair 46×46 pixel hitbox around
Reimu's center pixel, combined with the comeback of
📝 interlaced rendering, this time in every
stage. ZUN probably did this because missiles are the only 16×16 sprite in
TH01 that is blitted to unaligned X positions, which effectively ends up
touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would
have been totally possible to render every missile in every frame at roughly
the same amount of CPU time that the original game uses for interlaced
rendering:
Note that all missile sprites only use two colors, white and green.
Instead of naively going with the usual four bitplanes, extract the
pixels drawn in each of the two used colors into their own bitplanes.
master.lib calls this the "tiny format".
Use the GRCG to draw these two bitplanes in the intended white and green
colors, halving the amount of VRAM writes compared to the original
function.
(Not using the .PTN format would have also avoided the inconsistency of
storing the missile sprites in boss-specific sprite slots.)
That's an optimization that would have significantly benefitted the game, in
contrast to all of the fake ones
introduced in later games. Then again, this optimization is
actually something that the later games do, and it might have in fact been
necessary to achieve their higher bullet counts without significant
slowdown.
After some effectively unused Mima sprite effect code that is so broken that
it's impossible to make sense out of it, we get to the final feature I
wanted to cover for all bosses in parallel before returning to Sariel: The
separate sprite background storage for moving or animated boss sprites in
the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin
with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from
main memory on every frame. After all, TH01 and TH02 had a minimum required
clock speed of 33 MHz, half of the speed required for the later three games.
So, he simply blitted these boss sprites to both VRAM pages, leading
the usual unblitting calls to only remove the other sprites on top of the
boss. However, these bosses themselves want to move across the screen…
and this makes it necessary to save the stage background behind them
in some other way.
Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM
into a sprite slot. No problem with that approach in theory, as the size of
all these bigger sprites is a multiple of 32×32; splitting a larger sprite
into these smaller 32×32 chunks makes the code look just a little bit clumsy
(and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot
that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is
blitted on top of her regular sprite, using just regular sprite
transparency, she ends up with her infamous third arm:
Ironically, there's an unused code path in Mima's unblit function where ZUN
assumes a height of 48 pixels for Mima's animation sprites rather than the
actual 64. This leads to even clumsier .PTN function calls for the bottom
128×16 pixels… Failing to unblit the bottom 16 pixels would have also
yielded that third arm, although it wouldn't have looked as natural. Still
wouldn't say that it was intentional; maybe this casting sprite was just
added pretty late in the game's development?
So, mission accomplished, Sariel unblocked… at 2¼ pushes. That's quite some time left for some smaller stage initialization
code, which bundles a bunch of random function calls in places where they
logically really don't belong. The stage opening animation then adds a bunch
of VRAM inter-page copies that are not only redundant but can't even be
understood without knowing the hidden internal state of the last VRAM page
accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any
complexity limit on inlining arithmetic expressions, as long as they only
operate on compile-time constants. That's how we get macro-free,
compile-time Shift-JIS to JIS X 0208 conversion of the individual code
points in the 東方★靈異伝 string, in a compiler from 1994. As long as you
don't store any intermediate results in variables, that is…
But wait, there's more! With still ¼ of a push left, I also went for the
boss defeat animation, which includes the route selection after the SinGyoku
fight.
As in all other instances, the 2× scaled font is accomplished by first
rendering the text at regular 1× resolution to the other, invisible VRAM
page, and then scaled from there to the visible one. However, the route
selection is unique in that its scaled text is both drawn transparently on
top of the stage background (not onto a black one), and can also change
colors depending on the selection. It would have been no problem to unblit
and reblit the text by rendering the 1× version to a position on the
invisible VRAM page that isn't covered by the 2× version on the visible one,
but ZUN (needlessly) clears the invisible page before rendering any text.
Instead, he assigned a separate VRAM color for both
the 魔界 and 地獄 options, and only changed the palette value for
these colors to white or gray, depending on the correct selection. This is
another one of the
📝 rare cases where TH01 demonstrates good use of PC-98 hardware,
as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.
Then, why does this still not count as good-code? When
changing palette colors, you kinda need to be aware of everything
else that can possibly be on screen, which colors are used there, and which
aren't and can therefore be used for such an effect without affecting other
sprites. In this case, well… hover over the image below, and notice how
Reimu's hair and the bomb sprites in the HUD light up when Makai is
selected:
This push did end on a high note though, with the generic, non-SinGyoku
version of the defeat animation being an easily parametrizable copy. And
that's how you decompile another 2.58% of TH01 in just slightly over three
pushes.
Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and
SinGyoku without needing any more detours into non-boss code. Thanks to the
current TH01 funding subscriptions, I can plan to cover most, if not all, of
Sariel in a single push series, but the currently 3 pending pushes probably
won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got
quite a lot of not specifically TH01-related funds in the backlog to pass
the time though.
Due to recent developments, it actually makes quite a lot of sense to take a
break from TH01: spaztron64 has
managed what every Touhou download site so far has failed to do: Bundling
all 5 game onto a single .HDI together with pre-configured PC-98
emulators and a nice boot menu, and hosting the resulting package on a
proper website. While this first release is already quite good (and much
better than my attempt from 2014), there is still a bit of room for
improvement to be gained from specific ReC98 research. Next up,
therefore:
Researching how TH04 and TH05 use EMS memory, together with the cause
behind TH04's crash in Stage 5 when playing as Reimu without an EMS driver
loaded, and
reverse-engineering TH03's score data file format
(YUME.NEM), which hopefully also comes with a way of building a
file that unlocks all characters without any high scores.
Last part of TH01's main graphics function segment, and we've got even
more code that alternates between being boring and being slightly weird.
But at least, "boring" also meant "consistent" for once. And
so progress continued to be as fast as expected from the last TH01 pushes,
yielding 3.3% in TH01 RE%, and 1% in overall RE%, within a single day.
There even was enough time to decompile another full code segment, which
bundles all the hardware initialization and cleanup calls into single
functions to be run when starting and exiting the game. Which might be
interesting for at least one person, I guess
But seriously, trying to access page 2 on a system with only page 0 and 1?
Had to get out my real PC-98 to double-check that I wasn't missing
anything here, since every emulator only looks at the bottom bit of the
page number. But real hardware seems to do the same, and there really is
nothing special to it semantically, being equivalent to page 0. 🤷
Next up in TH01, we'll have some file format code!
So, the thing that made me so excited about TH01 were all those bulky C
reimplementations of master.lib functions. Identical copies in all three
executables, trivial to figure out and decompile, removing tons of
instructions, and providing a foundation for large parts of the game
later. The first set of functions near the end of that shared code segment
deals with color palette handling, and master.lib's resident palette
structure in particular. (No relation to the game's
resident structure.) Which directly starts us out with pretty much
all the decompilation difficulties imaginable:
iteration over internal DOS structures via segment pointers – Turbo
C++ doesn't support a lot of arithmetic on those, requiring tons of casts
to make it work
calls to a far function near the beginning of a segment
from a function near the end of a segment – these are undecompilable until
we've decompiled both functions (and thus, the majority of the segment),
and need to be spelled out in ASM for the time being. And if the caller
then stores some of the involved variables in registers, there's no
way around the ugliest of workarounds, spelling out opcode bytes…
surprising color format inconsistencies – apparently, GRB (rather than
RGB) is some sort of wider standard in PC-98 inter-process communication,
because it matches the order of the hardware's palette register ports?
(0AAh = green,
0ACh = red,
0AEh = blue)? Yet the
game's actual palette still uses RGB…
And as it turns out, the game doesn't even use the resident palette
feature. Which adds yet another set of functions to the, uh, learning
experience that ZUN must have chosen this game to be. I wouldn't be
surprised if we manage to uncover actual scrapped beta game content later
on, among all the unused code that's bound to still be in there.
At least decompilation should get easier for the next few TH01 pushes now…
right?