- ๐ Posted:
- ๐ Summary of:
- P0280
- โจ Commits:
20bac82...87eed57
- ๐ฐ Funded by:
- Blue Bolt, JonathKane, [Anonymous]
- ๐ท Tags:
TH03 gameplay! ๐ It's been over two years. People have been investing some decent money with the intention of eventually getting netplay, so let's cover some more foundations around player movementโฆ and quickly notice that there's almost no overlap between gameplay RE and netplay preparations? That makes for a fitting opportunity to think about what TH03 netplay would look like:
-
Data exchange protocol: The unofficial TH19 battle tool considered a few possibilities and ultimately chose WebRTC Data Channels. The argument goes like this:
- You'd want UDP rather than TCP for both its low latency and its NAT hole-punching ability
- However, raw UDP does not guarantee that the packets arrive in order, or that they even arrive at all
- WebRTC implements these reliability guarantees on top of UDP in a modern package, providing the best of both worlds
- NAT traversal via public or self-hosted STUN/TURN servers is built into the connection establishment protocol and APIs, so you don't even have to understand the underlying issue
I'm not too deep into networking to argue here, and it clearly works for Ju.N.Owen. If we do explore other options, it would mainly be because I can't easily get something as modern as WebRTC to natively run on Windows 9x or DOS, if we decide to go for that route.
-
Matchmaking: I like Ju.N.Owen's initial way of copy-pasting signaling codes into chat clients to establish a peer-to-peer connection without a dedicated matchmaking server. progre eventually implemented rooms on the AWS cloud, but signaling codes are still used for spectating and the Pure P2P mode. We'll probably copy the same evolution, with a slight preference for Pure P2P โ if only because you would have to check a GDPR consent box before I can put the combination of your room name and IP address into a database. Server costs shouldn't be an issue at the scale I expect this to have.
-
Rollback: In emulators, rollback netcode can be and has been implemented by keeping savestates of the last few frames together with the local player's inputs and then replaying the emulation with updated inputs of the remote player if a prediction turned out to be incorrect. This technique is a great fit for TH03 for two reasons:
- All game state is contained within a relatively small bit of memory. The only heap allocations done in
MAIN.EXE
are the ๐ .MRS images for gauge attack portraits and bomb backgrounds, and the enemy scripts and formations, both of which remain constant throughout a round. All other state is statically allocated, which can reduce per-frame snapshots from the naive 640 KiB of conventional DOS memory to just the 37 KiB ofMAIN.EXE
's data segment. And that's the upper bound โ this number is only going to go down as we move towards 100% PI, figure out how TH03 uses all its static data, and get to consolidate all mutated data into an even smaller block of memory. - For input prediction, we could even let the game's existing AI play the remote player until the actual inputs come in, guaranteeing perfect play until the remote inputs prove otherwise. Then againโฆ probably only while the remote player is not moving, because the chance for a human to replicate the AI's infamous erratic dodging is fairly low.
The only issue with rollback in specifically a PC-98 emulator is its implications for performance. Rendering is way more computationally expensive on PC-98 than it is on consoles with hardware sprites, involving lots of memory writes to the disjointed 4 bitplane segments that make up the 128 KB framebuffer, and equally as many reads and bitshift operations on sprite data. TH03 lessens the impact somewhat thanks to most of its rendering being EGC-accelerated and thus running inside the emulator as optimized native code, but we'd still be emulating all the x86 code surrounding the EGC accesses โ from the emulator's point of view, it looks no different than game logic. Let's take my aging i5 system for example:
- With the Screen โ No wait option, Neko Project 21/W can emulate TH03 gameplay at 260 FPS, or 4.6ร its regular speed.
- This leaves room for each frame to contain 3.6 frames of rollback in addition to the frame that's supposed to be displayed,
- which results in a maximum safe network latency of โ63 ms, or a ping of โ126 ms. According to this site, that's enough for a smooth connection from Germany to any other place in Europe and even out to the US Midwest. At this ping, my system could still run the game without slowdown even if every single frame required a rollback, which is highly unlikely.
- Any higher ping, however, could occasionally lead to a rollback queue that's too large for my system to process within a single frame at the intended 56.4 FPS rate. As a result, me playing anyone in the western US is highly likely to involve at least occasional slowdowns. Delaying inputs on purpose is the usual workaround, but isn't Touhou that kind of game series where people use vpatch to get rid of even the default input delay in the Windows games?
So we'd ideally want to put TH03 into an update-only mode that skips all rendering calls during re-simulation of rolled-back frames. Ironically, this means that netplay-focused RE would actually focus on the game's rendering code and ensure that it doesn't mutate any statically allocated data, allowing it to be freely skipped without affecting the game. Imagine palette-based flashing animations that are implemented by gradually mutating statically allocated values โ these would cause wrong colors for the rest of the game if the animation doesn't run on every frame.
- All game state is contained within a relatively small bit of memory. The only heap allocations done in
Implementing all of this into TH03 can be done in one, a few, or all of the following 6 ways, depending on what the backers prefer. Sorted from the most generic to the most specialized solution (and, coincidentally, from least to most total effort required):
Generic PC-98 netcode for one or more emulators
This is the most basic and puristic variant that implements generic netplay for PC-98 games in general by effectively providing remote control of the emulated keyboard and joypad. The emulator will be unaware of the game, and the game will be unaware of being netplayed, which makes this solution particularly interesting for the non-Touhou PC-98 scene, or competitive players who absolutely insist on using ZUN's original binaries and won't trust any of my modded game builds.
Applied to TH03, this means that players would select the regular hot-seat 1P vs 2P mode and then initiate a match through a new menu in the emulator UI. The same UI must then provide an option to manually remap incoming key and button presses to the 2P controls (newly introducing remapping to the emulator if necessary), as well as blocking any non-2P keys. The host then sends an initial savestate to the guest to ensure an identical starting state, and starts synchronizing and rolling back inputs at VSync boundaries.This generic nature means that we don't get to include any of the TH03-specific rollback optimizations mentioned above, leading to the highest CPU and memory requirements out of all the variants. It sure is the easiest to implement though, as we get to freely use modern C++ WebRTC libraries that are designed to work with the network stack of the underlying OS.
I can try to build this netcode as a generic library that can work with any PC-98 emulator, but it would ultimately be up to the respective upstream developers to integrate it into official releases. Therefore, expect this variant to require separate funding and custom builds for each individual emulator codebase that we'd like to support.Emulator-level netcode with optional game integration
Takes the generic netcode developed in 1) and adds the possibility for the game to control it via a special interrupt API. This enables several improvements:
- Online matches could be initiated through new options in TH03's main menu rather than the emulator's UI.
- The game could communicate the memory region that should be backed up every frame, cutting down memory usage as described above.
- The exchanged input data could use the game's internal format instead of keyboard or joypad inputs. This removes the need for key remapping at the emulator level and naturally prevents the inherent issue of remote control where players could mess with each other's controls.
- The game could be aware of the rollbacks, allowing it to jump over its rendering code while processing the queue of remote inputs and thus gain some performance as explained above.
- The game could add synchronization points that block gameplay until both players have reached them, preventing the rollback queue from growing infinitely. This solves the issue of 1) not having any inherent way of working around desyncs and the resulting growth of the rollback queue. As an example, if one of the two emulators in 1) took, say, 2 seconds longer to load the game due to a random CPU spike caused by some bloatware on their system, the two players would be out of sync by 2 seconds for the rest of the session, forcing the faster system to render 113 frames every time an input prediction turned out to be incorrect.
Good places for synchronization points include the beginning of each round, the WARNING!! You are forced to evade / Your life is in peril popups that pause the game for a few frames anyway, and whenever the game is paused via the ESC key. - During such pauses, the game could then also block the resuming ESC key of the player who didn't pause the game.
Edit (2024-04-30): Emulated serial port communicating over named pipes with a standalone netplay tool
This approach would take the netcode developed in 2) out of the emulator and into a separate application running on the (modern) host OS, just like Ju.N.Owen or Adonis. The previous interrupt API would then be turned into binary protocol communicated over the PC-98's serial port, while the rollback snapshots would be stored inside the emulated PC-98 in EMS or XMS/Protected Mode memory. Netplay data would then move through these stages:
Sending serial port data over named pipes is only a semi-common feature in PC-98 emulators, and would currently restrict netplay to Neko Project 21/W and NP2kai on Windows. This is a pretty clean and generally useful feature to have in an emulator though, and emulator maintainers will be much more likely to include this than the custom netplay code I proposed in 1) and 2). DOSBox-X has an open issue that we could help implement, and the NP2kai Linux port would probably also appreciate a
mkfifo(3)
implementation.
This could even work with emulators that only implement PC-98 serial ports in terms of, well, native Windows serial ports. This group currently includes Neko Project II fmgen, SL9821, T98-Next, and rare bundles of Anex86 that replace MIDI support with COM port emulation. These would require separately installed and configured virtual serial port software in place of the named pipe connection, as well as support for actual serial ports in the netplay tool itself. In fact, this is the only way that die-hard Anex86 and T98-Next fans could enjoy any kind of netplay on these two ancient emulators.If it works though, it's the optimal solution for the emulated use case if we don't want to fork the emulator. From the point of view of the PC-98, the serial port is the cheapest way to send a couple of bytes to some external thing, and named pipes are one of many native ways for two Windows/Linux applications to efficiently communicate.
It could even work for real hardware, but would require the PC-98 to be linked to the separately running modern system via a null modem cable.
The only slight drawback of this approach is the expected high DOS memory requirement for rollback. Unless we find a way to really compress game state snapshots to just a few KB, this approach will require a more modern DOS setup with EMS/XMS support instead of the pre-installed MS-DOS 3.30C on a certain widely circulated .HDI copy. But apart from that, all you'd need to do is run the separate netplay tool, pick the same pipe name in both the tool and the emulator, and you're good to go.Native PC-98 Windows 9x netcode (only for real PC-98 hardware equipped with an Ethernet card)
Equivalent in features to 2), but pulls the netcode into the PC-98 system itself. The tool developed in 3) would then as a separate 32-bit or 16-bit Windows application that somehow communicates with the game running in a DOS window. The handful of real-hardware owners who have actually equipped their PC-98 with a network card such as the LGY-98 would then no longer require the modern PC from 3) as a bridge in the middle.
This specific card also happens to be low-level-emulated by the 21/W fork of Neko Project. However, it makes little sense to use this technique in an emulator when compared to 3), as NP21/W requires a separately installed and configured TAP driver to actually be able to access your native Windows Internet connection. While the setup is well-documented and I did manage to get a working Internet connection inside an emulated Windows 95, it's definitely not foolproof. Not to mention DOSBox-X, which currently emulates the apparently hardware-compatible NE2000 card, but disables its emulation in PC-98 mode, most likely because its I/O ports clash with the typical peripherals of a PC-98 system.And that's not the end of the drawbacks:
- Netplay would depend on the PC-98 versions of Windows 9x and its full network stack, nothing of which is required for the game itself.
- Porting libdatachannel (and especially the required transport encryption) to Windows 95 will probably involve a bit of effort as well.
- As would actually finding a way to access V86 mode memory from a 32-bit or 16-bit Windows process, particularly due to how isolated DOS processes are from the rest of the system and even each other. A quick investigation revealed three potential approaches:
- A 32-bit process could read the memory out of the address space of the console host process (
WINOA32.MOD
). There seems to be no way of locating the specific base address of a DOS process, but you could always do a brute-force search through the memory map. - If started before Windows, TSRs will share their resident memory with both DOS and Win16 processes. The segment pointer would then be retrieved through a typical interrupt API.
- Writing a VxD driver ๐ฉ
- A 32-bit process could read the memory out of the address space of the console host process (
- Correctly setting up TH03 to run within Windows 95 to begin with can be rather tricky. The GDC clock speed check needs to be either patched out or overridden using mode-setting tools, Windows needs to be blocked from accessing the FM chip, and even then,
MAIN.EXE
might still immediately crash during the first frame and leave all of VRAM corrupted: - A matchmaking server would be much more of a requirement than in any of the emulator variants. Players are unlikely to run their favorite chat client on the same PC-98 system, and the signaling codes are way too unwieldy to type them in manually. (Then again, IRC is always an option, and the people who would fund this variant are probably the exact same people who are already running IRC clients on their PC-98.)
Native PC-98 DOS netcode (only for real PC-98 hardware equipped with an Ethernet card)
Conceptually the same as 4), but going yet another level deeper, replacing the Windows 9x network stack with a DOS-based one. This might look even more intimidating and error-prone, but after I got
ping
and even Telnet working, I was pleasantly surprised at how much simpler it is when compared to the Windows variant. The whole stack consists of just one LGY-98 hardware information tool, a LGY-98 packet driver TSR, and a TSR that implements TCP/IP/UDP/DNS/ICMP and is configured with a plaintext file. I don't have any deep experience with these protocols, so I was quite surprised that you can implement all of them in a single 40 KiB binary. Installed as TSRs, the entire stack takes up an acceptable 82 KiB of conventional memory, leaving more than enough space for the game itself. And since both of the TSRs are open-source, we can even legally bundle them with the future modified game binaries.
The matchmaking issue from the Windows 9x approach remains though, along with the following issues:- Porting libdatachannel and the required transport encryption to the TEEN stack seems even more time-consuming than a Windows 95 port.
- The TEEN stack has no UI for specifying the system's or gateway's IP addresses outside of its plaintext configuration file. This provides a nice opportunity for adding a new Internet settings menu with great error feedback to the game itself. Great for UX, but it's another thing I'd have to write.
- The LGY-98 is not the only network card for the PC-98. Others might have more complicated DOS drivers that might not work as seamlessly with the TEEN stack, or have no preserved DOS drivers at all. Heck, the most time-consuming part of the DOS setup was finding the correct download link for the LGY-98 packet driver, as the one link that appears in a lot of places only throws an access denied error these days. Edit (2024-04-30): spaztron64 is now hosting both the LGY-98 packet driver and the entire TEEN bundle on his homepage.
If you're interested in funding this variant and are using a non-LGY-98 card on real hardware, make sure you get general Internet working on DOS first.
Porting the game first
As always, this is the premium option. If the entire game already runs as a standalone executable on a modern system, we can just put all the netcode into the same binary and have the most seamless integration possible.
That leaves us with these prerequisites:
- 1), by definition, needs nothing from ReC98, and I could theoretically start implementing it right now. If you're interested in funding it, just tell me via the usual Twitter or Discord channels.
- 2) through 5) require at least 100% RE of TH03's
OP.EXE
to facilitate the new menu code. Reverse-engineering all rendering-related code inMAIN.EXE
would be nice for performance, but we don't strictly need all of it before we start. Re-simulated frames can just skip over the few pieces of rendering code we do know, and we can gradually increase the skipped area of code in future pushes.
100% PI won't be a requirement either, as I expect theMAIN.EXE
part of the interfacing netcode layer to be thin enough that it can easily fit within the original game's code layout. - 6), obviously, requires all of TH03 to be RE'd, decompiled, cleaned up, and ported to modern systems. Currently, TH03 appears to be the second-easiest game to port behind TH02:
- Although TH03 already has more needlessly micro-optimized ASM code than TH02 and there's even more to come, it still appears to have way less than TH04 or TH05.
- Its game logic and rendering code seem to be somewhat neatly separated from each other, unlike TH01 which deeply intertwines them.
- Its graphics seem free of obvious bugs, unlike โ again โ the flicker-fest that is TH01.
Once we've reached any of these prerequisites, I'll set up a separate campaign funding method that runs parallel to the cap. As netplay is one of those big features where incremental progress makes little sense and we can expect wide community support for the idea, I'll go for a more classic crowdfunding model with a fixed goal for the minimum feature set and stretch goals for optional quality-of-life features. Since I've still got two other big projects waiting to be finished, I'd like to at least complete the Shuusou Gyoku Linux port before I start working on TH03 netplay, even if we manage to hit any of the funding goals before that.
For the first time in a long while, the actual content of this push can be listed fairly quickly. I've now RE'd:
- conversions from playfield-relative coordinates to screen coordinates and back (a first in PC-98 Touhou; even TH02 uses screen space for every coordinate I've seen so far),
- the low-level code that moves the player entity across the screen,
- a copy of the per-round frame counter that, for some reason, resets to 0 at the start of the Win/Lose animation, resetting a bunch of animations with it,
- a global hitbox with one variable that sometimes stores the center of an entity, and sometimes its top-left corner,
- and the 48ร48 hit circles from
EN2.PI
.
It's also the third TH03 gameplay push in a row that features inappropriate ASM code in places that really, really didn't need any. As usual, the code is worse than what Turbo C++ 4.0J would generate for idiomatic C code, and the surrounding code remains full of untapped and quick optimization opportunities anyway. This time, the biggest joke is the sprite offset calculation in the hit circle rendering code:
But while we've all come to expect the usual share of ZUN bloat by now, this is also the first push without either a ZUN bug or a landmine since I started using these terms! ๐ It does contain a single ZUN quirk though, which can also be found in the hit circles. This animation comes in two types with different caps: 12 animation slots across both playfields for the enemy circles shown in alternating bright/dark yellow colors, whereas the white animation for the player characters has a cap ofโฆ 1? P2 takes precedence over P1 because its update code always runs last, which explains what happens when both players get hit within the 16 frames of the animation:
SPRITE16 uses the PC-98's EGC to draw these single-color sprites. If the EGC is already set up, it can be set into a GRCG-equivalent RMW mode using the pattern/read plane register (0x4A2
) and foreground color register (0x4A6
), together with setting the mode register (0x4A4
) to 0x0CAC
. Unlike the typical blitting operations that involve its 16-dot pattern register, the EGC even supports 8- or 32-bit writes in this mode, just like the GRCG. ๐ As expected for EGC features beyond the most ordinary ones though, T98-Next simply sets every written pixel to black on a 32-bit write. Comparing the actual performance of such writes to the GRCG would be ๐ yet another interesting question to benchmark.
Next up: I think it's time for ReC98's build system to reach its final form.
For almost 5 years, I've been using an unreleased sane build system on a parallel private branch that was just missing some final polish and bugfixes. Meanwhile, the public repo is still using the project's initial Makefile that, ๐ as typical for Makefiles, is so unreliable that BUILD16B.BAT
force-rebuilds everything by default anyway. While my build system has scaled decently over the years, something even better happened in the meantime: MS-DOS Player, a DOS emulator exclusively meant for seamless integration of CLI programs into the Windows console, has been forked and enhanced enough to finally run Turbo C++ 4.0J at an acceptable speed. So let's remove DOSBox from the equation, merge the 32-bit and 16-bit build steps into a single 32-bit one, set all of this up in a user-friendly way, and maybe squeeze even more performance out of MS-DOS Player specifically for this use case.