- 📝 Posted:
- 🚚 Summary of:
- P0266, P0267, P0268, P0269, P0270, P0271, P0272, P0273, P0274, P0275, P0276, P0277
- ⌨ Commits:
- (mly)
eb2b0c8...b356884
, (mly)b356884...1c70db0
, (mly)1c70db0...db0c195
, (BGM packs)2f9bce5...45087c2
, (Seihou)P0256...a9ca081
, (Seihou)a9ca081...8db918f
, (Seihou)8db918f...3de48ab
, (Seihou)3de48ab...9467705
, (Seihou)9467705...241a6c9
, (Seihou)241a6c9...P0275
, (Seihou)dbc369f...883ac40
, (Seihou)883ac40...6ac72f3
- 💰 Funded by:
- Ember2528, [Anonymous]
- 🏷 Tags:
📝 Over two years since the previous largest delivery, we've now got a new record in every regard: 12 pushes across 5 repos, 215 commits, and a blog post with over 14,000 words and 48 pieces of media. 😱 Who would have thought that the superficially simple task of putting SC-88Pro recordings into Shuusou Gyoku would actually mainly focus on deep research into the underlying MIDI files? I don't typically cover much music-related content because it's a non-issue as far as PC-98 Touhou code is concerned, so it's quite fitting how extensive this one turned out. So here we go, the result of virtually unlimited funding and patience:
- The SC-88Pro recording controversy
- Undefined SysEx behavior
- Resolving the controversy, and making a choice (contains personal opinion)
- A Unix-style command-line MIDI filter (in Rust BTW)
- Visualizing MIDI files (for science, and not for playing them on a keyboard)
- Shuusou Gyoku's individual loop quirks 🎺
- Rewriting pbg's MIDI code
- Putting together the BGM packs
- Outgrowing miniaudio (and raging about single-file C libraries for a while)
- Remaining implementation details
- Pricing changes (and no, not everything's getting more expensive)
So where's the controversy? Romantique Tp obviously made the best and most careful real-hardware SC-88Pro recordings of all of ZUN's old MIDIs, including the original (OST) and arranged (AST) soundtrack of Shuusou Gyoku, right? Surely all I have to do now is to cut them into seamless loops to save a bit of disk space, and then put them into the game? Let's start at the end of the track list with the name registration theme, since it's light on instruments and has an obvious loop point that will be easy to spot in the waveform. But, um… wait a moment, that very first drum note comes a bit late, doesn't it?
That's… not quite the accuracy and perfection I was expecting. But I think I know what we're seeing and hearing there. Let's look at the first few MIDI events on the drum channel:
Yup. That's the sound of a vintage hardware synth being slow and taking a two-digit number of milliseconds to process a barrage of simultaneous Program Change messages, playing a MIDI file that doesn't take this reality into account and expects program changes to happen instantly.
I can only speak from my own experience of writing MIDIs for hardware synths here, but having the first note displaced by 50 ms is very much not the way a composer would have intended the music to be heard if the note is clearly notated to occur on the beat. If you had told me about such an issue when playing one of my MIDIs on a certain synth, I would have thanked you for the bug report! And I would have promptly released a fixed version of the MIDI with the Program Change events moved back by a beat or two. In the case of Shuusou Gyoku's MIDIs, this wouldn't even have added any additional delay in-game, as all of these files already start with at least one beat of leading silence to make room for setting Roland-specific synth parameters.
OK, but that's just a single isolated bass drum hit. If we wanted to, we could even fix this issue ourselves by splicing the same note from around the loop end point. Maybe this is just an isolated case and the rest of Romantique Tp's recordings are fine? Well…
This one is even worse. Here, the delay is so long relative to the tempo of the piece that the intended five drum hits pretty much turn into four.
This type of issue doesn't even have to be isolated to the very beginning of a piece. A few of the tracks in both the OST and AST start with an anacrusis on just one or two channels and leave the Program Change event barrage at the beginning of the first full measure. In 幻想科学 ~ Doll's Phantom for example, this creates a flam-like glitch where the bass on channel 2 is pretty much on time, but the crash hit on channel 10 only follows 50 ms later, after the SC-88Pro took its sweet time to process all the Program Change events on the channels between:
Let's listen to that at half speed:
Sure, all of this is barely noticeable in casual listening, but very noticeable if you're the one who now has to cut these recordings into seamless loops. And these are just the most obvious timing issues that can be easily pinpointed and documented – the actual worst aspects are all the minor tempo and timing fluctuations throughout most of the pieces. With recordings that deviate ever so slightly from the tempo defined in the MIDI files, you can no longer rely on mathematically exact sample positions when cutting loops. Even if those positions do work out from time to time, there'd pretty much always be a discontinuity in the waveform at both ends of the loop, manifesting as a clearly audible click. In the end, the only way of finding good loop points in existing recordings involves straining your ears and listening very, very closely to avoid any audible glitches. 😩
But if you've taken a look at the second tabs in the clips above, you will have noticed that we don't necessarily have to be stuck with recordings from real hardware. In late 2015, Roland released Sound Canvas VA, a VST plugin that emulates the classic core of Roland's old Sound Canvas lineup, including the SC-88Pro. As long as we run such a software synthesizer through a quality VST host, a purely software-based solution should be way superior for recording looped BGM:
- By moving from real-time recording to an offline rendering paradigm, we get perfectly accurate note timing, as it no longer matters how long the synth takes to produce each output sample.
- We stay entirely in the digital realm instead of going from digital (SC-88Pro) to analog (RCA cable) to digital (line-in recording) again, removing any chance for noise or distortion to ruin audio quality.
- We get to directly render at 44,100 Hz instead of being limited to the 32,000 Hz signal coming out of the SC-88Pro's DAC. This can be easily noticed in the half-speed video above, whose SCVA version retains significantly more sibilant high-frequency content compared to the more muffled sound of Romantique Tp's recording.
- Good recordings from real hardware involve careful observation of the output volume. You ideally want to set the synth's master volume to a level that's simultaneously quiet enough to avoid clipping and loud enough to ensure the highest possible signal-to-noise ratio, which requires lots of trial-and-error attempts for each individual track. With a softsynth, this becomes a non-issue: 📝 As we've seen during the last look at Shuusou Gyoku's sound, rendering to 32-bit floating-point .WAV files would preserve all waveform information above 0 dBFS. So we could simply render everything at Sound Canvas VA's default maximum volume and later algorithmically scale the volume of the rendered files to fit within 0 dBFS, without having to touch SCVA's sluggish interface.
- Doing that also makes it feasible to preserve loudness differences between the pieces of a soundtrack instead of eradicating them by normalizing the volume of each individual track to the digital maximum.
- Finally, it's much more time-efficient. We simply hit foobar2000's Convert button and get all MIDIs rendered within a few seconds each, instead of having to wait the entire length of a piece.
Any drawbacks? For our use case, all of them are found in the abysmal software quality of everything around the synth engine. As it's typical for the VST industry, Sound Canvas VA is excessively DRM'd – it takes multiple seconds to start up, and even then only allows a single process to run at any given time, immediately quitting every process beyond the first one with a misleading Parameter File1 Read Error message box. I totally believe anyone who claims that this makes SCVA more annoying than real hardware when composing new music. Retro gamers also dislike how Roland themselves no longer sells the 32-bit builds they used to offer for the first few versions. These old versions are now exclusively available through resellers, or on the seven seas.
But as far as the SC-88Pro emulation is concerned, there don't seem to be any technical reasons against it. There is a long thread over at VOGONS discussing all sorts of issues, but you have to dig quite deep to find any clear descriptions of bugs in SCVA's synth engine. Everything I found either only applies to the SC-55 emulation and not the SC-88Pro, was fixed by Roland in the meantime, or turned out to be a fixable bug in a MIDI file.
Nevertheless, Romantique Tp has a very negative opinion about SCVA, getting quite angry and defensive in this instance where someone favorably compared SCVA to their recordings.
Edit (2024-03-10): These days, Romantique Tp has a much more favorable opinion on SCVA as well.
8 years after their release, however, the community unanimously accepts the Romantique Tp recordings as the intended way to listen to ZUN's old MIDIs, so choosing Sound Canvas VA for our Shuusou Gyoku builds might be a bad idea purely for PR reasons. At best, people would slightly wonder why I intentionally went with the opposite of the accepted reference recordings, but at worst, this entire project could face a violent backlash…
But wait, we've already heard one obvious difference between the real SC-88Pro and Sound Canvas VA. Let's listen to the very first clip again:
Ha! You can clearly hear a panning echo in the real-hardware recording that is missing from the Sound Canvas VA rendering. That's an obvious case of a core system effect not being reproduced correctly. If even that's undeniably broken, who knows which other subtle bugs SCVA suffers from, right? Case closed, Romantique Tp was right all along, SCVA is trash, real hardware reigns supreme
Actually, let's look closer into this one. Panning delay effects like this are typically reverb-related, but General MIDI only specifies a single controller to specify the per-channel reverb level from 0 to 127. Any specific characteristics of the reverb therefore have to be configured using vendor-specific system-exclusive messages, or SysEx for short.
So it's down to one of the four SysEx messages at the beginning of the MIDI file:
Since these byte strings represent Roland-specific instructions, we can't learn anything from a raw MIDI event dump alone here. No problem though, let's just load these files into some old MIDI sequencer that targeted Roland synths, open its MIDI event list, and then they will be automatically decoded into a human-readable representation…
…or at least that's what I expected. In Yamaha land, XGworks has done that for Yamaha's own XG SysEx messages ever since 1997:
But for Roland synths, there's… nothing similar? Seriously? 😶 Roland fanboys, how do you even live?! I mean, they are quick to recommend the typical bloated and sluggish big-name DAWs that take up multiple gigabytes of disk space, but none of the ones I tried seemed to have this feature. They can't have possibly been flinging around raw byte strings for the past 33 years?!
But once you look more into today's MIDI community, it becomes clear that this is exactly what they've been doing. Why else would so many people use the word complicated
to describe Roland SysEx, or call it an old school/cryptic communication protocol in hexadecimal format
? The latter is particularly hilarious because if you removed the word cryptic, this might as well describe all of MIDI, not just SysEx. Everything about this is a tooling issue, and Yamaha showed how easily it could have been solved. Instead, we get Sound Canvas experts, who should know more about the ecosystem than I do, making the incredible mental leap from "my DAW doesn't decode or easily generate SysEx" to "SysEx is antiquated" to "please just lift up these settings to the VST level and into my proprietary DAW's proprietary project format, that would be so much better"…
Thankfully that's not entirely true. After some more digging and configuration, I found a somewhat workable solution involving a comparatively modern sequencer called Domino:
- Download either Domino's original Japanese version or the partial English translation. The .zip file on the release page contains a full standalone build.
- Open the File → Preferences menu and associate your MIDI output device with a module map. This makes sense for SysEx encoding/generation since it can limit the options in the UI to what's actually available on your target hardware, but is also required for selecting the respective SysEx map into Domino's SysEx decoder. There is no technical reason for this because SC-88Pro SysEx messages can be uniquely identified by the three vendor, device, and model ID bytes that every message starts with, but would be too easy and user-friendly. The perception of SysEx being a black art must be upheld at all costs.
- Load a MIDI file and let Domino "analyze" it:
- Strangely enough, this will take quite a while – on my system, this analysis step runs at a speed of roughly 4.25 KB/s of MIDI data. Yes, kilobytes.
- Unfortunately, "control change macro restoration" also seems to mean that you don't get to see any raw bytes when selecting the respective MIDI track in the UI, but at least we get what we were looking for:
Alright, that's something we can work with. The GS Reset message is something that every Roland GS MIDI should start with, but it's immediately followed by a message that Domino failed to decode? The two subsequent reverb parameters make sense, but panning delays typically have more parameters than just a reverb level and time.
That unknown SysEx message shares much of the same bytes with the decoded ones though. So let's do what we maybe should have done all along, return to caveman, and check the SC-88Pro manual:
And that's where we find what this particular issue boils down to. The missing SysEx message is clearly intended to be a Reverb Macro command, whose value can range from 0 to 7 inclusive on the SC-88Pro, but ZUN tries to specify Reverb Macro #14h
, or 20 in decimal. The SC-88Pro manual does not specify what happens if a SysEx message wants to write an invalid value to a valid address, which means that we've firmly entered the territory of undefined behavior.
Edit (2024-03-10): Romantique Tp confirmed that the real SC-88Pro clamps these Reverb Macro IDs to the supported range of 0-7. Therefore, the appropriate course of action for guaranteeing the same sound on other Roland synths would be to fix the MIDI file and specify Reverb Macro #7 instead. But since this behavior remains technically undefined, we can still argue about ZUN's intention behind specifying the Reverb Macro like this:
- Clearly, ZUN did want to specify a valid Reverb Macro, but made a typo when manually entering the SysEx byte string, as he was forced to do thanks to terrible tooling. He clearly liked the resulting sound though, so the track should still be preserved with the panning reverb intact.
- Clearly, the typical behavior for MIDI synths is to ignore invalid and unsupported SysEx messages, because validating user input is an important characteristic of quality software. This is what SCVA does, and what we hear in its rendering is the default hall reverb with ZUN's level and time adjustments. Therefore, SCVA is right, and the fact that we get a panning delay on the real SC-88Pro is a bug in real hardware.
- Clearly, ZUN did not care enough about the reverb to specify a valid Reverb Macro. Whether we get the default reverb or a panning delay is an irrelevant performance detail, and does intentionally not matter when it comes to the intended sound of this track – especially since these four SysEx messages are the full extent of Roland GS-specific sound design in this piece, and the rest of it only uses standard MIDI features.
In fact, 32 out of the 39 MIDIs across both of Shuusou Gyoku's soundtrack use this invalid Reverb Macro. The only ones that don't are
- both versions of Gates' theme (天空アーミー), which use the equally invalid Reverb Macro #11,
- both versions of Milia's theme (プリムローズシヴァ), which use Reverb Macro #0 (Room 1),
- and, again, the three arranged MIDIs that ZUN released last (シルクロードアリス, 魔女達の舞踏会, and 二色蓮花蝶 ~ Ancients), which feature a more detailed effect setup with custom chorus and EQ settings. In the case of Reimu's theme, these settings are even commented within the MIDI file.
And that's where this quest seemed to end, until Romantique Tp themselves came in and suggested that I take a closer look at the GS Advanced Editor, or GSAE for short.
I was aware of this tool, but hadn't initially considered it because it's always described as just a SysEx generator/encoder. In fact, the very existence of such a tool made no sense to me at first, and seemed to prove my point that the usability of GS SysEx was wholly inferior to what I was used to in Yamaha land. Like, why not build at least a tiny and stripped-down MIDI sequencer around this functionality that would allow you to insert SC-88Pro-specific messages at any point within a sequence, and not just the beginning? I can see the need for such a tool in today's world of closed-source DAWs where hardware MIDI modules are niche and retro and are only kept alive by a small community of enthusiasts. But why would its developers guarantee that MIDI composers would have to hop between programs even back in 1997? I can only imagine that they saw how every just slightly advanced MIDI sequencer or DAW back then already used its own project format instead of raw Standard MIDI Files, and assumed that composers would therefore be program-hopping anyway?
However, GSAE does support the import of settings from a MIDI file and features a SysEx history window that decodes every newly processed Roland SysEx byte string, which is all I was looking for. So let's throw in that same MIDI and…
F0 41 10 42 12 40 01 30 14 7B F7
message at the top.Now that's some wild numbers. An equally invalid Reverb Character, and Reverb Level and Time values that even exceed their defined range of 0-127? Could it be that GSAE emulates the real-hardware response to invalid Reverb Macros here, and gives us the exact reverb setting we can hear in Romantique Tp's recording? This could even be the reason why GSAE is still used and recommended within today's Roland MIDI sequencing scene, and hasn't been supplanted by some more modern open-source tool written by the community.
In any case, these values have to come from somewhere, so let's reverse-engineer GSAE and figure out the logic behind them. Shoutout to IDR for being a great help with its automatic generation of IDC debug symbols for the Delphi standard library, and even including a few names of application-level widget class methods by reading Delphi-specific type information from the binary. This little sub-project made me also come around to appreciating Ghidra, whose decompiler and data type manager helped a lot and allowed me to find the relevant code section within just a few hours.
A~nd it turns out that the values all come from out-of-bounds accesses into arrays on the stack. If we combine 25, 235, and 132 back into a 32-bit value, we get 0x19EB84
, which is the virtual address of the relevant function's stack frame base pointer.
But it gets even more hilarious: If you enable debug text output via Option → Other Options → SMF → Insert text events to setup measures and export these imported settings back into a MIDI file, GSAE not only retains these invalid Reverb Macro IDs, but stringifies them via a simple lookup into a hardcoded string pointer array, again without any bounds checks. The effects of this are roughly what you would expect:
- Reverb Macro IDs between 8 and 27 simply insert wrong strings from adjacent string pointer arrays
- Reverb Macro 28 crashes GSAE
- Reverb Macro 64 causes GSAE to vomit 65,512 bytes of garbage into the MIDI file
In the end, we have Domino not decoding the Reverb Macro message, and GSAE, the premier SysEx tool for Roland synths, responding to it in even more undefined and clearly bugged ways than real hardware apparently does. That's two programs confirming that whatever ZUN intended was never supposed to work reliably. And while we still don't know exactly what these reverb parameters are supposed to be, these observations solve the mystery as far as I'm concerned, and solidify my personal opinion on the matter.
So what do we do now, and which version do we go with? Optimally, I'd offer both versions and turn this controversy into a personal choice so that everybody wins… and Ember2528 agreed and generously provided all the funding to make it happen. 💸
If you haven't picked your favorite yet, here are some final arguments:
The Romantique Tp recordings certainly have something going for them with their provenance of coming from real hardware, and the care that Romantique Tp put into manually recording every single track, warts and all. I wholeheartedly agree that preserving the raw sound of playing the MIDI files into the hardware without thinking about bugs or quirks is an important angle to take when it comes to preservation. It's good that these recordings exist – after all, you wouldn't know which musical elements you'd possibly be missing in an emulation if you have nothing to compare it to. Even the muffled sound in the half-speed clip above can be an argument in their favor, as the SC-88Pro's DAC operates at 32 kHz and you wouldn't expect any meaningful frequency content between 16,000 and 22,050 Hz to begin with. Any frequency content in that range that does remain in Romantique Tp's recording is simply 📝 rolled-off imaging noise added during the ADC's resampling process.
All this is why they are a definite improvement over kaorin's 2007 recordings of only the AST, which used to be the previous reference recordings within the community. Those had all of the same timing issues and more, in addition to being so excessively volume-boosted that 0.15% of the samples across the entire soundtrack ended up clipped. That's 6.25 seconds out of 68:39m being lost to pure digital noise.
Most importantly though: ZUN himself said that only the real SC-88Pro will play back these files as he intended them to sound. This quote is likely where the tagline of Romantique Tp's entire recording project came from in the first place:
> 全てのデエタはSC-88ProもしくはSC-8850(ロオランド社)にて最適に聴けるように調整してあります > それ以外の音源でも、作者の意図した音ではない場合があります。 — ZUN on 東方幻想的音楽, his old MIDI page
However. ZUN is not exactly known for accurately and carefully preserving the legacy of his series, or really doing anything beyond parading his old games as unobtainable showpieces at conventions. With all the issues we've seen, preferring real hardware is ultimately just that: an angle, and a preference. This is why I disagree with the heavy and uncritical advertising that is mainly responsible for elevating the Romantique Tp recordings to their current reference status within the community, especially if at least half of the alleged superiority of real hardware is founded on undefined behavior that can easily be fixed in the MIDI files themselves if people only bothered to look.
Here's where I stand: MIDI files are digital sheet music first and foremost, not an inferior version of tracker modules where the samples are sold separately. As such, the specific synth a MIDI file was written for is merely a secondary property of the composition – and even more so if the MIDI file contains little to nothing in terms of sound design and mostly restricts itself to the basic feature set of General MIDI. In turn, synth quirks and bugs are not a defined part of the composition either, unless they are clearly annotated and documented in the file itself. And most importantly: If the MIDI file specifies a certain timing and a recording fails to reproduce that timing, then that recording is not an accurate representation of the MIDI file.
In that regard, Sound Canvas VA is not only the closest alternative to the real thing
, as a few people in the MIDI and retrogaming scene do have to admit, but superior to the real thing. I'll gladly take clarity and perfect timing accuracy in exchange for minor differences in effects, especially if the MIDI file does not explicitly and correctly define said effects to begin with. If I want a panning delay as part of the reverb, I add the respective and correct SysEx message to define one – and if I don't, I do not care about the reverb. You might still get a panning delay on a certain synth, and you might even prefer how it sounds, but it's ultimately a rendering artifact and not a consciously intended part of the composition. In that way, it's similar to the individual flavor a musician adds to a performance of a piece of classical music.
And as far as the differences in frequency response and resonant filters are concerned: In Yamaha land, these are exactly the main distinguishing factors between vintage WF-192XG sound cards (resembling the real SC-88Pro in these characteristics) and the S-YXG50 softsynth (resembling SCVA). Once I found out about that softsynth and how much clearer it sounded in comparison, I sold that old PCI sound card soon after.
In the interest of preservation though, there's still one more unexplored solution that could be the ideal middle ground between the two approaches:
- Play the MIDIs through a real-hardware SC-88Pro again
- Capture the actually observed system-exclusive settings that fall within the synth's supported and documented ranges
- Insert them back into the MIDI file, creating a new bugfixed version
- Re-record that bugfixed version through Sound Canvas VA
Edit (2024-03-10): And since Romantique Tp has confirmed what exactly happens on real hardware, I'm going to do exactly that. These bugfixed Sound Canvas VA renderings will be a free bonus of the single next Shuusou Gyoku push, and will add another angle to the preservation of these soundtracks. In the meantime though, the Sound Canvas VA packs will sound like they do in the preview videos above.
Or, you know… Maybe none of this actually matters. Here's beatMARIO streaming some Shuusou Gyoku gameplay using what looks like a real-hardware SC-8850, which plays these MIDIs with occasionally noticeably different instrument patches and no panning delay in the name registration theme, and he still enjoyed every second of it. Imagine undefined SysEx behavior not even being consistent within the same family of Roland synths… nah, I'm done arguing, let's get back to the actual work and cut some loops.
Just to be clear: I'm not suggesting that Romantique Tp should have been the one to cut their recordings into loops, or even just the one who defined where the loop points are supposed to be. On the surface, this seems to be a non-issue, and you'd just pick a point wherever each track appears to loop, right? But with 39 MIDIs to cut and all the financial support from Ember2528, it made sense to also solve this problem more thoroughly, and algorithmically detect provably correct loop points for all of these files. Who knows, maybe we even find some surprises that make it all worth it?
This is the algorithm I came up with:
- At a basic level, we loop over the list of MIDI events and return the earliest and longest subrange that is immediately followed by an identical copy.
- MIDI players, however, need loop point definitions that use MIDI pulse units rather than event list indices. This is especially necessary for multi-track/SMF Type 1 sequences, which would otherwise require one loop start/end index pair per track, and then it still wouldn't work because some of the tracks might not even have an event at the loop start/end point. This requires the detection algorithm and the player to agree on how to map event indices to time points and back, and simply going for the first event of each pulse (i.e., any event with a nonzero delta time) makes the most sense here. In turn, we can skip any potential start or end events that have a delta time of 0, speeding up the algorithm significantly for typical compositions with a high degree of polyphony.
- Naively considering just the raw MIDI events works for MIDI playback. But as soon as we want to cut a recording based on the detected loop points, we need to account for the fact that MIDI playback is inherently stateful. Each of the 16 channels at the protocol level features at least the 128 continuous controllers (CCs) with a 7-bit state, the 14-bit pitch bend controller, and the 7-bit instrument program value, in addition to the global tempo of the piece. As a result, two ranges of events might look identical, but can still sound differently if the events before the first range changed one piece of state which is then only touched again near the end of that range. This requires us to track the full MIDI state at both the start and end of a loop, and reject any potential loop that differs in these states:
Delta Pulse Beat Event +0 1200 2:240 Controller { CC 7, value 50 } +240 1440 3:000 NoteOn { Key 60 } +240 1680 3:240 NoteOn { Key 65 } +240 1920 4:000 NoteOn { Key 67 } +240 2160 4:240 NoteOn { Key 70 } +240 2400 5:000 Controller { CC 7, value 100 } +0 2400 5:000 NoteOn { Key 67 } +240 2640 5:240 NoteOn { Key 65 } +240 2880 6:000 NoteOn { Key 60 } +240 3120 6:240 NoteOn { Key 65 } +240 3360 7:000 NoteOn { Key 67 } +240 3600 7:240 NoteOn { Key 70 } +240 3840 8:000 Controller { CC 7, value 100 } +0 3840 8:000 NoteOn { Key 67 } +240 4080 8:240 NoteOn { Key 65 } +240 4320 9:000 NoteOn { Key 60 } +240 4560 9:240 NoteOn { Key 65 } +240 4800 10:000 NoteOn { Key 67 } +240 5040 10:240 NoteOn { Key 70 } +240 5280 11:000 Controller { CC 7, value 100 } +0 5280 11:000 NoteOn { Key 67 } +240 5520 11:240 NoteOn { Key 65 }
In this example, a naive event-level scan would detect a loop between beats 3 and 6 as the same events are immediately repeated between beats 6 and 9. However, the piece starts with the first four notes at a channel volume of 50, which is only set to its later value of 100 on beat 5. Therefore, the actual loop ranges from beat 5 to 8. In turn, the piece needed to be at least 11 beats long to include the full second copy of the looped events and prove the loop as such. - This check can be a bit too strict in some cases, though. A channel might start with one of its CCs at a specific value but then change the same CC to a different value at a later point before playing the first note. In such a case, the detected loop would be delayed to the second CC change even though the initial CC value has no impact on the sound. By filtering these redundant CC changes, we get to move the loop start point of a few tracks (original 夢機械 ~ Innocent Power and arranged 魔法少女十字軍) back by a few seconds, to the position you'd expect.
- Finally, we reject any overlong loops that themselves fully consist of multiple successive copies of the first N events.
Shuusou Gyoku's original MIDI files hide the original game's lack of MIDI looping by simply duplicating the looping sections enough times so that a typical player won't notice. The algorithm we have so far, however, would return a much longer loop if a MIDI file contains more than three successive copies of a looping section. The original version of ハーセルヴズ in particular repeats its 8 looping bars a total of 15 times before the MIDI ends, and this condition is necessary to detect the actual 8-bar loop instead of a 56-bar one.
Of course, this algorithm isn't perfect and won't work for every MIDI file out there. It doesn't consider things like differently ordered events within the same MIDI pulse, (non-)registered parameter numbers, or the effect that SysEx messages can have on the state of individual channels. The latter would require the general SysEx decoding logic that I would have liked to have for the research above… actually, let's add an issue and add the project to the order form. I'd really like to see a comprehensive open-source cross-vendor SysEx decoder library in my lifetime.
As for the implementation, I was happy to write some Rust again for a change, as it's a great fit for these standalone greenfield command-line tools that don't have to directly interact with the legacy C++ code bases that this project usually deals with. It's even better if the foundational functionality is not just available in a crate, but in four, with the community already having gone through multiple iterations to arrive at a tried and tested winner. Who knows, maybe I even get to rewrite this website in it one day? Just for the sheer meme value of doing so, of course.
I also enjoyed this a lot from a technical point of view:
- You might think that Rust's typical safety guarantees don't matter for the problem at hand. But then you accidentally write
-=
instead of+=
for au32
that starts out at 0, and Rust immediately panics instead of silently underflowing tou32::MAX
. This must have saved me at least 5 minutes of debugging the resulting logic error. - As it turns out, my loop detection algorithm is embarrassingly parallel. You might initially think about it in a sequential way because we always want the earliest occurrence of the longest repeating section of MIDI events, which means that each new loop candidate further into the track has to be longer than the previous one. But since we always iterate over the entire MIDI, it makes perfect sense to divide and conquer the problem. Let's split the list of possible loop end points into equal chunks, scan them all in parallel for the earliest and longest loop within that chunk, and then pick the earliest and longest loop among those intermediate results as the final one. In Rust, you don't even have to think much about the chunks, as all of that can be easily done by replacing the iteration with Rayon's parallel fold and adding a
reduce()
with the same condition for the final step. This sped up the algorithm by exactly the number of cores in my system.
This algorithm works well for the long MIDI files of Shuusou Gyoku's OST that all contain multiple duplicates of their loop section, but it quickly reaches its limit with the AST. Following the classic two-loop + fade-out format, that soundtrack was meant to be played back in generic MIDI players, and not to actually be put back into the game in looped form. Since the loop algorithm did, in fact, find inconsistencies even in the OST, two copies of the apparent loop are sometimes not enough to prove cases where the actual loop ends much later than you think it does. In a few cases, it would be enough to simply remove all volume change events from the fade-out to prove the actual loop, but in others, the algorithm would need MIDI event data far past the end of the fade-out.
However, just giving up and not looping any of these tracks would be equally unfortunate. So how about shifting the question, from what's the best loop in this MIDI file
to what's the best loop if the MIDI didn't fade out and instead repeated its apparent second loop a third time
? As long as the detected loop in such a pre-processed file ends before the repeated range, it's still a valid loop in terms of the unmodified original.
Ideally, we want to do this pre-processing programmatically with the same Rust library instead of manually editing the MIDI. Many sequencers (and especially XGworks) apply significant changes to a MIDI file's internal structure when saving its internal representation back to a MIDI file, which might even mess with our loop algorithm. So it would be very nice to have a more trustworthy tool that applies only the edit we actually want, and perfectly retains the rest of the MIDI.
And that's how this sub-project turned into a small suite of command-line MIDI operations in the classic Unix filter/pipeline style: Each command reads a MIDI file from stdin
, transforms it, and outputs text or the resulting MIDI file on stdout
. This way, we gain maximum transparency and reproducibility as I can document the unique pre-processing steps for each AST track by simply providing the command lines. And sure, we're re-encoding and re-decoding the full MIDI sequence at every step along such a pipeline, but computers are fast, Rust and the midly library in particular are ⚡ blazingly fast ⚡, and the usability benefits of this pipeline model far outweigh any theoretical performance drops.
Here's the full list of commands that made it into the resulting mly
tool:
- cut: Extremely basic removal of MIDI events within a certain range.
- dump: Dumps all MIDI events into a textual table. All event lists in this blog post are based on this output.
- duration: Shows the duration of a MIDI file in pulses, beats, seconds, and PCM samples.
- filter-note: Removes all Note On events within a certain range, retaining all other events. This allows us to generate separate intro and loop MIDIs, whose renderings we can then splice back into a single loopable waveform with no discontinuities, which is not guaranteed when rendering a single MIDI file. This provides the last missing piece needed for rendering perfect, sample-accurate loops through Sound Canvas VA.
- loop-find: The loop detection algorithm described above.
- loop-unfold: Duplicates MIDI events from a given point to the end of the track. A budget solution for the problem of creating synthetic loops – arbitrary copying of arbitrary subranges to arbitrary destinations would have been undeniably nicer, but also much more complex, and I didn't need that full flexibility for the task at hand.
- smf0: Flattening multi-track/SMF Type 1 MIDI sequences into single-track/SMF Type 0 ones. Having this conversion as a distinct operation in our toolset allows other operations to exclusively support SMF Type 0 if a Type 1 implementation would either take significant additional effort or just duplicate the Type 0 flattening algorithm. This group of operations includes loop-find, cut, and even the real-time output for duration because tempo events can theoretically occur on any track.
This feature set should strike a good balance between not spending too much of the Shuusou Gyoku budget on tangential problems, but still offering a decent solution for the problem at hand. As a counterexample, the obvious killer feature – deserializing a dump
back into a Standard MIDI File – would have gone way past the budget. While there are crates that free you from the need to write manual parsing code for basic data structures, they would instead require a lot of attribute boilerplate – and if the library that provided the structures doesn't already come with these attributes, you now have to duplicate all the structures, and convert back and forth between the original structures and your copies. Not to mention that we'd still have to write code for the high-level structure of the dump
output…
If we put it all together, this is what we can do:
$ <ssg_02.mid mly loop-find Best loop in note space: 4 events (between event #[117, 121[ and [121, 125[) First note: event 71 / pulse 960 / beat 2:000 / 0:00:800m Loop start: event 117 / pulse 1680 / beat 3:240 / 0:01:400m Loop end: event 121 / pulse 1920 / beat 4:000 / 0:01:600m $ <ssg_02.mid mly cut 466: | mly loop-unfold 240: | mly -r 44100 loop-find Track #0: Removing events #[16439, 19881[ Track #0: Repeating events #[8344, 16439[ at the end of the sequence Best loop in note space: 8095 events (between event #[5625, 13720[ and [13720, 21815[) First note: event 71 / pulse 960 / beat 2:000 / 0:00:800m Loop start: event 5625 / pulse 75361 / beat 157:001 / 1:03:531m Loop end: event 13720 / pulse 183841 / beat 383:001 / 2:34:726m Best loop in recording space: 8095 events (between event #[5709, 13804[ and [13804, 21899[) First note: event 71 / pulse 960 / beat 2:000 / 0:00:800m / sample 35280.00 Loop start: event 5709 / pulse 77280 / beat 161:000 / 1:05:163m / sample 2873667.66 Loop end: event 13804 / pulse 185760 / beat 387:000 / 2:36:358m / sample 6895375.27
Translation:
- The best loop found in the raw MIDI file spans 4 events and 200 milliseconds. Clearly, this is not the loop we're looking for.
- Let's cut off all events from the start of the fade-out to the end, do a loop-unfold copy of all events from the position during the apparent second loop that corresponds to where the fade-out started, and try looking for a loop in that modified MIDI.
- The resulting loop is 1:31m long, which is exactly what we were hoping to find.
- The note space loop represents the earliest possible event range with equivalent per-channel controller and pitch bend state at both ends. This loop is only appropriate for MIDI players, as its bounds can fall into the middle of notes that are played with a different channel state at the start and end of the loop. This is why it doesn't show any sample positions.
- The recording space loop ensures that this doesn't happen. It's also always placed on a Note On event with non-zero velocity, which eases the splicing of separate
filter-note
recordings. This way, it's enough to remove leading silence from the loop part and mix it exactly at the indicated sample position. - The detected loop is also nowhere close to the cut point at beat 466, matching our condition for validity. All events within the loop came from ZUN's original composition, and the cut/loop-unfold combo merely provided the remaining 63% of events necessary to prove this loop as such.
So, where are these loop quirks that justify why some of these audio files are longer than you'd think they should be? Just listing them as text wouldn't really communicate just how minor these are. It would be much nicer to visualize them in a way that highlights the exact inconsistencies within a fixed range of MIDI measures. Screenshots of MIDI sequencer or DAW windows won't capture these aspects all too well because these programs are geared toward fine-grained editing of single tracks, not visualization of details across all channels.
Typical MIDI visualizers, however, are on the complete opposite end of the spectrum. In recent years, MIDI visualization
has become synonymous with the typical Synthesia style of YouTube videos with a big keyboard at the bottom, note bars flying in from the top, and optional fancy effects once those notes hit the top of the keyboard. The Black MIDI community has been churning out tons of identically looking MIDI visualizers in recent years that mainly seem to differ in the programming language they're written in, and in how well they can cope with the blackest of black MIDIs.
Thankfully, most of these visualizers are open-source and have small and manageable codebases. The project with the most GitHub stars and the most generic name seemed to be the best starting point for hacking in the missing features, despite using GLSL shaders which I had no prior experience with. It was long overdue that I did something with GLSL though – it added a nice educational aspect to these hacks, and it still was easier than deciphering whatever the fastest and hyper-optimized Rust visualizer is doing.
Still, this visualizer needed a total of 18 small features and bugfixes to be actually usable for demonstrating Shuusou Gyoku's loop quirks. As such, these hacks turned into yet another tangential sub-project that could have easily consumed another two pushes if I cleaned up the code and published the result. But that would have really gone way past the budget for something that people might not even care about. So here's what we're going to do:
- I've added this MIDI visualizer as a new goal to the order form. This goal is eligible for microtransactions, so you don't have to fund a full push to see the first changes committed and released.
- The upstream project seems to have been abandoned recently, which is the perfect excuse for not even trying to merge in my sweeping changes with a series of pull requests. The code sure needs a lot of cleanup and deduplication, and especially a more build system-friendly way of embedding its shader source code.
- Every backer who supports this goal with at least 0.1 pushes or microtransactions will get a Windows binary with my current hacked-in changes as a preview, immediately after the purchase. Shoutout to the MIT license for letting me do this 😛
As usual, once the code is done, the final cleaned-up version will be available for free for everyone, in both source code and binary release form.
Alright then! Here's how to read the visualizations:
- The transparency of each note represents its velocity multiplied by the channel volume and expression. To spot volume inconsistencies, you'd compare the opacity of equivalent notes in the two ranges.
- The X-axis of these visualizations uses linear/real time, so the width of each measure represents the exact time it takes to be played relative to the other measures in the visualized range. To spot tempo inconsistencies, you'd compare the distance between the bar lines.
- Notes that are duplicated on two or more channels may be colored differently in the loop start and end views. These are rendering order inconsistencies and don't communicate anything about the MIDI.
-
Stage 1 theme (フォルスストロベリー), original and arranged version: The string and harmonica channels are slightly louder on the apparent first loop than on the others.
- Apparent loop:
- 0:01m – 1:31m
- Actual loop:
- 1:04m – 2:34m
-
Mei and Mai's theme (ディザストラスジェミニ), arranged version: The one and only quirk that's caused by different notes – the first loop has an E♭ on the slap bass channel in measure 32, but the second loop has a G♭ in the corresponding measure 72.
- Apparent loop:
- 0:01m – 1:02m
- Actual loop:
- 0:50m – 1:51m
-
Stage 3 theme (華の幻想 紅夢の宙), original and arranged version:
The trumpet channel starts out panned to the center of the stereo field (64), before being left-panned by 25% (48) at 1:04m, where it stays for the rest of the track.
- Apparent loop:
- 0:01m – 1:29m
- Actual loop:
- 1:04m – 2:32m
I didn't come up with a good way of visualizing panning in a 2D plane, so you have to trust your ears with this one. -
Marie's theme (機械サーカス ~ Reverie), arranged version: Every apparent loop modulates up by a semitone 16 measures before it ends, and remains in that new key at the start of the next loop, so the piece technically doesn't loop at all. The original stays in G♯m throughout.
-
Stage 5 theme (カナベラルの夢幻少女), original version: The ritardando near the supposed end of the first loop drops from 145 BPM to 118 BPM, but only to 129 BPM in all further loops.
- Apparent loop:
- 0:01m – 1:39m
- Actual loop:
- 1:33m – 3:11m
The loop start and end points are in the respective next measure past this range. -
Stage 6 theme (アンティークテラー), arranged version: The string channel starts out with the maximum expression of 127, but then only goes up to 120 after some fading notes later in the piece, where it stays for the beginning of the second loop.
- Apparent loop:
- 0:01m – 1:53m
- Actual loop:
- 0:13m – 2:05m
Same here. -
VIVIT-captured-'s first theme (夢機械 ~ Innocent Power), arranged version: Has a unique ending section that starts in Gm and then modulates through Em and Fm before it fades out on F♯m.
-
VIVIT-captured-'s second theme (幻想科学 ~ Doll's Phantom), original and arranged version: Another fade-related 127 vs. 120 expression inconsistency, this time on the orange square channel.
- Apparent loop:
- 0:01m – 1:32m
- Actual loop:
- 1:03m – 2:34m
-
VIVIT-captured-'s third theme (少女神性 ~ Pandora's Box), original and arranged version: Another tempo inconsistency: A slightly differently shaped ritardando before the bell tree hit in the supposed first loop.
- Apparent loop:
- 0:40m – 1:57m (original), 0:41m – 1:59m (arranged)
- Actual loop:
- 1:17m – 2:34m (original), 1:18m – 2:37m (arranged)
-
Marisa's theme (魔女達の舞踏会), arranged version: Has a unique 8-bar ending section that is first played in Cm and then loops in C♯m while fading out.
-
Ending theme (ハーセルヴズ), arranged version: Probably the best-known one out of these, and I'm talking of course about the beautiful ending section. I'm making the executive decision to not loop this track in-game, and letting it fade to silence instead.
Before we package up these looped soundtracks, let's take a quick look at how they would be shown off in the Music Room. The Seihou Music Rooms carry over the per-channel keyboards from TH05, add the current per-channel volume, expression, and pan pot values, and top it off with a fake spectrum analyzer. All of these visualizations rely on MIDI data, and the Music Room would feel very dull and boring without them. Just look at Kioh Gyoku, whose Music Room basically turns into a still image in WAVE mode.
Retaining these visualizations even when playing waveform BGM was very important for me, and not just because it would make for a unique high-quality feature that would break new ground. It can also double as proof that the waveform versions are, in fact, in perfect sync with both the MIDIs they are based on, and, by extension, the respective stage scripts.
However, this would require the game to process the MIDIs and update the internal visualization state without simultaneously playing them back through the WinMM / MME / midiOut*()
API. And just like graphics and text rendering, Shuusou Gyoku's original code came with zero architectural separation between platform-independent processing logic and platform-specific playback…
So I accidentally rewrote almost the entire MIDI code to achieve said separation. This also provided a great occasion to modernize this code and add some much-needed robustness for potential MIDI mods, while retaining the original code's approach of iterating over raw SMF byte streams. It might all have been very excessive for a delivery that was supposed to be just about waveform BGM support, but on the plus side, MIDI output is now portable to any other system's MIDI API as well.
Surprisingly though, it was Shuusou Gyoku's original MIDI timing that quickly turned out to be rather inaccurate, and not the waveforms. The exact numbers vary depending on the piece, but the game played back every MIDI about 1% slower than notated, adding about 2 or 3 seconds to their total playback time after 5 minutes. Tempo changes in particular were the biggest causes of desynchronizations with the waveforms…
To understand how this can happen to begin with, we have to look closer at how you're supposed to use the midiOut*()
API. This API is as low-level as it gets, only covering the transmission of a single MIDI message to the selected output device right now. There is no concept of note timing at this low level, so it's completely up to the program to parse delta times and tempo change events out of the MIDI file and correctly time the calls to this API for each MIDI message. With all the code that runs between the API and the actual renderer of the synth for every single message, the resulting timing can only ever be an approximation of the MIDI file. This doesn't really matter for the timescales and polyphony levels of typical music because, again, computers are fast, but such an API is fundamentally unsuitable for accurately playing back even just a moderately complex million-note Black MIDI.
Shuusou Gyoku handles this required manual timing in the simplest possible way: It runs a MIDI processing function (Mid_Proc()
in the code) at an interval of 10 ms, which processes and instantly sends out all MIDI events that have occurred at any point within the last 10 ms, maintaining merely their order. This explains not only why the original game incremented its MIDI TIMER
by multiples of 10, but also the infamous missing drums when playing the soundtrack through the Microsoft GS Wavetable Synth:
- ZUN reduced all drum notes to the minimum possible length allowed by the 480 PPQN pulse resolution of these MIDI files.
- In regular music notation, this corresponds to 1/1920th notes.
- While the exact real-time length in purely mathematical terms depends on the tempo of a piece, it only has to be ≥13 BPM for a 1/1920th note to be shorter than 10 ms.
- Therefore, the higher the BPM, the higher the chance that both a drum note's Note On and Note Off messages are sent within the same call to
Mid_Proc()
, with the respective twomidiOut*()
API calls only being at best a two-digit number of microseconds apart. - So it only makes sense why cheap MIDI synths that don't even respond to reverb or release time messages completely drop any note with such a short length. After all, at a sampling rate of 44,100 Hz, a note would have to be at least 22.7 µs long to be represented by even a single PCM sample.
- This also extends to the visualizations above, and was the reason why I chose to render all drum notes as fixed-size diamonds. Otherwise, they would barely be visible.
But while sending MIDI events in such quantized chunks might not be perfect, it can't be the cause behind multi-second playback slowdowns. Instead, this issue has to boil down to the way Shuusou Gyoku times each individual message, and specifically how it converts between MIDI pulse units and real-time (milli)seconds. pbg's original MIDI code chose to do this in an equally confusing and inaccurate way: it kept two counters that tracked the current MIDI pulse before and after the latest tempo change, used the value of the latter counter to decide which events to process, and only added the pulse equivalent of 10 ms to this counter at the end of Mid_Proc()
in the then current tempo. The commit message for my rewritten algorithm details the problems with this approach using nice ASCII art in case you're interested, but in short, the main problem lies in how the single final addition can only consider a single tempo change within each call to Mid_Proc()
. If a MIDI file contains tempo ramps with less than 10 ms between each different tempo, the original game would only use the last of these tempo values as the basis for converting the entire 10 ms back into MIDI pulses. Not to mention that maybe MIDI pulses aren't the best unit in a game that still 📝 treats the FPU as lava and doesn't use any fixed-point means of increasing the resolution of the 10 ms→pulse division either…
On the contrary, it's much more accurate to immediately convert every encountered MIDI delta time to a real-time quantity and use that unit for event timing, especially if we want to restrict ourselves to integer math. Signed 64-bit integers are enough to fit the product of the slowest possible MIDI tempo ((224 - 1) µs per quarter note) and the highest possible MIDI delta time (228 - 1) at nanosecond precision (103), with one bit to spare. Then, we arrive at a much simpler timing algorithm:
- Each simultaneously playing track gets a next event timer, starting out at 0
- When looking at the next event, add the converted nanosecond value of its delta time to this timer
- Subtract the equivalent of 10 ms from each track's timer at the beginning of the processing function
- As long as the timer is ≤0, process and send the next message
The additive nature of this timer not only naturally allows more than one event to happen within a single Mid_Proc()
call, but also averages out any minor timing inconsistencies across the length of a track.
This new algorithm did improve the overall timing accuracy, but only barely, shaving off just ≈100 ms of the total duration. Turns out that the main source behind the slowness was hiding somewhere else entirely, in the single line that deserializes tempo values from MIDI's big-endian representation into the native integer format:
assert(length_of_tempo_message == 3); uint32_t tempo = 0; for(int i = 0; i < length_of_tempo_message; i++) { - tempo += ((tempo << 8) + (*track_data++)); + tempo = ((tempo << 8) + (*track_data++)); }
Yup – the original code performed two additions per byte, which incorrectly added the interim value at every byte to the final result, and yielded a tempo that is ≈0.8% / ≈1 BPM slower than notated in the MIDI file, matching the number we were looking for. That's why the |
/OR
operator is the safer one to use in such a bit-twiddling context…
But now I'm curious. This is such a tiny bug that is bound to remain unnoticed until someone compares the game's MIDI output to another renderer. It must have certainly made it into other games whose MIDI code is based on Shuusou Gyoku's, or that pbg was involved with. And sure enough, not only did this bug survive Kioh Gyoku's OOP refactoring, but it even traveled into Windows Touhou, where it remained in every single game that supported MIDI playback. Now we know for a fact that pbg's Program Support role in the TH06 credits involved sharing ready-made, finished code with ZUN:
Amusingly, ZUN's compiler even started optimizing the combination of left-shifting and addition to a multiplication with 257 for TH09, which even sort of highlights this bug if you're used to reading x86 ASM.
That leaves support for MIDI loop points as the only missing feature for syncing MIDI data with a looping waveform track. While it didn't require all too much code, pbg's original zero-copy approach of iterating over raw MIDI data definitely injected a lot of complexity into the required branches. Multi-track/SMF Type 1 files require quite a bit of extra thought to correctly calculate delta times across loop boundaries that reach past the end of the respective track, while still allowing the real-time delta values to be resynchronized at tempo changes within the loop – and yes, 3 of ZUN's 19 arranged MIDI files actually do use more than one track, so this wasn't just about maximizing MIDI compatibility for mods. I stuck to the original approach mostly as a challenge and to prove that it's possible without first parsing the entire MIDI sequence into a friendlier internal representation, but I absolutely do not recommend this to anyone else.
After hardcoding the loop points detected by mly into the binary, we only need to call Mid_Proc()
once per frame in the Music Room and pass the frame delta time instead of the 10 ms constant. And then, we get this:
MIDI TIMER
now shows off the arguably more interesting current MIDI pulse value rather than just formatting the PASSED TIME
in milliseconds. Ironically, displaying this value in a constantly counting way takes more effort now – the new nanosecond-based timing code doesn't use any measure of total MIDI pulses anymore, and they don't naturally fall out of the algorithm either. Instead, the code remembers the total pulse value of the last event it processed and adds the real-time duration that has passed since, similar to the original timing algorithm.This naturally causes the timer to jump from the loop end pulse to the loop start pulse, proving that
Mid_Proc()
is in fact looping the sequence.
Alright, now we know what to package:
-
We're going to have 8 BGM packs for each permutation of soundtrack (OST / AST), sound source (Romantique Tp / Sound Canvas VA), and codec (FLAC / Vorbis), making up 1.15 GiB of music data in total.
When looking at the package names, you will notice that I don't particularly highlight the FLAC versions aslossless
. And for good reason – the Romantique Tp recordings had dithering and noise shaping applied to them, and the Sound Canvas VA versions will necessarily have to be volume-normalized and quantized to 16-bit during the conversion to FLAC. If we wanted a BGM pack with the actual raw Sound Canvas VA output, we'd have to implement WavPack support, which is the only lossless codec that supports 32-bit float – and even that codec could only compress these files down to 14 MiB per minute of music, or 508 MB for the entire original soundtrack. That's 1.4× the size of an equivalentthbgm.dat
! - The whole packaging process will be complex enough to warrant a build system. I'd also like to generate an extensive README file for each package, not least to describe the Sound Canvas VA rendering and loop-cutting process in complete detail.
- The AST packs need to bundle the MIDI files from ZUN's site for Music Room visualization. We might as well add a 9th MIDI-only AST pack then, as it will naturally fall out of the packaging pipeline anyway. Some people sure love their MIDI synths, after all.
-
The OST packs can fall back on the original game's MIDI files from
MUSIC.DAT
for their Music Room visualization, so there's no need to bundle those and infringe copyright. Ironically, the game will still require aMUSIC.DAT
even if you use a BGM pack, if only for the one number in that file that says that Shuusou Gyoku's soundtrack consists of 20 tracks in total. - ZUN didn't arrange タイトルドメイド, so we need to copy the OST version recorded with the respective sound source into the AST pack.
Unfortunately, we still haven't reached the end of the complications and weird issues that haunt Shuusou Gyoku's music:
The original game reads the in-game track title directly out of the first Sequence Name event of the playing MIDI file. The waveform equivalent would be the Vorbis comment
TITLE
tag, which therefore should exactly match the original track's title, down to the exact placement of whitespace. As usual, if I emphasize minor things like this, it's not without reason: 幻想科学 ~ Doll's Phantom inconsistently uses halfwidth spaces at both sides of the ~, and wouldn't fit into the Music Room's limited space otherwise.- However, the AST MIDI files jam a bunch of other metadata into their Sequence Names, roughly following the format
【 $title 】 from 秋霜玉 for sc88Pro comp.ZUN
The track titles should definitely not appear in this format in-game, but how do we get rid of this format without hardcoding either the names or the magic to parse the names out of this format? The absolute state of GS SysEx tooling rears its ugly head one final time in three of the AST MIDIs, which for some reason are missing the Roland vendor prefix byte in all of their SysEx messages and are therefore undeniably bugged. There even seemed to be another SysEx-related bug which Romantique Tp explained away, but not this one:
ssg_04.mid
0:000 SysEx( 10 42 12 40 00 7F 00 41 F7) 0:240 SysEx( 10 42 12 40 01 30 14 7B F7) 0:360 SysEx( 10 42 12 40 01 33 14 78 F7) 0:420 SysEx( 10 42 12 40 01 34 50 3B F7)
ssg_05.mid
0:000 SysEx( 10 42 12 40 00 7F 00 41 F7) 0:240 SysEx( 10 42 12 40 01 30 14 7B F7) 0:360 SysEx( 10 42 12 40 01 33 00 0C F7) 0:420 SysEx( 10 42 12 40 01 34 14 77 F7)
ssg_10.mid
0:000 SysEx( 10 42 12 40 00 7F 00 41 F7) 0:240 SysEx( 10 42 12 40 01 30 14 7B F7) 0:360 SysEx( 10 42 12 40 01 33 00 0C F7) 0:420 SysEx( 10 42 12 40 01 34 60 2B F7)
ssg_04.mid
0:000 SysEx(41 10 42 12 40 00 7F 00 41 F7) GS Reset 0:240 SysEx(41 10 42 12 40 01 30 14 7B F7) Reverb Macro #20 0:360 SysEx(41 10 42 12 40 01 33 14 78 F7) Reverb Level 20 0:420 SysEx(41 10 42 12 40 01 34 50 3B F7) Reverb Time 80
ssg_05.mid
0:000 SysEx(41 10 42 12 40 00 7F 00 41 F7) GS Reset 0:240 SysEx(41 10 42 12 40 01 30 14 7B F7) Reverb Macro #20 0:360 SysEx(41 10 42 12 40 01 33 00 0C F7) Reverb Level 0 0:420 SysEx(41 10 42 12 40 01 34 14 77 F7) Reverb Time 20
ssg_10.mid
0:000 SysEx(41 10 42 12 40 00 7F 00 41 F7) GS Reset 0:240 SysEx(41 10 42 12 40 01 30 14 7B F7) Reverb Macro #20 0:360 SysEx(41 10 42 12 40 01 33 00 0C F7) Reverb Level 0 0:420 SysEx(41 10 42 12 40 01 34 60 2B F7) Reverb Time 96
The irony of using invalid Reverb Macros within already invalid SysEx messages is not lost on me. This is something we should fix even before running these files through Sound Canvas VA in order to render these with the reverb settings that ZUN clearly (and, for once, unironically) intended.
For perfect preservation of the original BGM/gameplay synchronicity, it makes sense for the waveform versions to retain the leading 1 or 2 beats of silence that the original MIDI files use for their SysEx setup. While some of the AST tracks use a slightly different tempo compared to their OST counterparts, they would still be largely in sync as ZUN didn't rearrange the layout of their setup area… except for, once again, the three tracks used in the Extra Stage. Marisa's and Reimu's boss themes aren't too bad with their 4 beats of setup, but シルクロードアリス takes the cake with a whopping 12 beats of leading silence. That's 5 seconds from the start of the Extra Stage to the first note you'd hear. 🐌
2) and 4) could theoretically be worked around in Shuusou Gyoku's MIDI code, but there's no way around editing the MIDI files themselves as far as 3) is concerned. Thus, it makes sense to apply all of the workarounds to the AST MIDIs as part of the BGM build process – parsing the titles out of the 【brackets】, inserting the Roland vendor prefix byte where necessary, and compressing the setup bars in the Extra Stage themes to match their OST counterparts. Adding any hidden magic to the MIDI code would only have needlessly increased complexity and/or annoyed some modder in the future who would then have to work around it.
Ideally, these edits would involve taking the mly dump
output, performing the necessary replacements at a plaintext level, and rebuilding the result back into a MIDI file, bu~t we're unfortunately missing the latter feature. Luckily, someone else had the same idea 13 years ago and
wrote a tool in C that does exactly what we need. Getting it to compile in 2024 only required fixing a typical C thing… why are students and boomers defending this antique of a language again? 🙄
The single most glaring issue, however, is the drastic difference in volume between the individual tracks in both soundtracks. While Romantique Tp had to normalize each track to the maximum possible volume individually as a consequence of the recording process, the Sound Canvas VA renderings reveal just how inconsistent the volume levels of these MIDI files really are:
So how do we interpret this? Is this a bug, because no one in their right mind would want their music to clip on purpose, and that in turn means that everything about these volume levels is arbitrary and unintentional? Or is this a quirk, and ZUN deliberately chose these volume levels for compositional reasons? It certainly would make sense for the name registration theme.
Once again, the AST version of シルクロードアリス is the worst offender in this regard as well, but it might also provide some evidence for the quirk interpretation. The fact that almost all of its MIDI channels blast away at full volume might have been an accident that could have gone unnoticed if the volume knob of ZUN's SC-88Pro was turned rather low during the time he arranged this piece, but the excessive left-panning must have been deliberate. Even Romantique Tp agrees:
"Western-style piece", but it's not.
And that's with the volume already normalized. Because this one channel of this one track is almost twice as loud as anything else in the AST, we would consequently have to bring down the volume of every other arranged track and the right channel of the same track by almost 50% if we wanted to maintain the volume differences between the individual tracks of the AST. In the process, we lose almost one entire bit of dynamic range. At this rate, you might even consider remixing and remastering the entire thing, but that would involve so many creative decisions to definitely fall into fanfiction territory…
However, normalizing each track to a peak level of 0 dBFS makes much more sense for in-game playback if you consider how loud Shuusou Gyoku's sound effects are. Once again, the best solution would involve offering both versions, but should we really add two more SCVA BGM packs just to cover volume differences?
ReplayGain solves this exact problem for regular music listening in a non-destructive way by writing the per-track and per-album gain levels into an audio file's metadata. Since we need metadata support for titles anyway, we can do something similar, albeit not exactly the same for two reasons:
- ReplayGain is specified to target an average volume of −17 dBFS, whereas we'd like to target a peak volume of 0 dBFS in order to always use the entire available digital scale. We've got some loud sound effects to compete with, after all.
- ReplayGain expresses its gain values in dB, which is cumbersome to work with. In the realm of PCM, volume changes don't need to involve more than a simple multiplication, so let's go with a simple scalar
GAIN FACTOR
.
And so, we hard-apply the volume-level gain during the conversion from 32-bit float to FLAC to preserve the volume differences between the tracks, calculate the track-level GAIN FACTOR
based on the resulting peak levels, add a volume normalization toggle to the Sound / Config
menu, enable it by default, and thus make everyone happy. ✅
The final interesting tidbit in building these packages can be found in the way the Sound Canvas VA recordings are looped. When manually cutting loops, you always have to consider that the intro might end with unique notes that aren't present at the end of the loop, which will still be fading out at the calculated loop start point. This necessitates shifting the loop start point by a few bars until these notes are no longer audible – or you could simply ignore the issue because ZUN's compositions are so frantic that no one would ever notice.
With the separate intro and loop files generated by mly, on the other hand, the reverb/release trails are immediately visible and, after trimming trailing silence, exactly define the number of samples that the calculated loop start point needs to be shifted by. The .loop
file then remains always exactly as long, in samples, as the duration of the loop reported by mly. If a piece happens to have a constant tempo whose beat duration corresponds to an integer number of samples, we get some very satisfying, round loop durations out of this process. ☺️
So let's play it all back in-game… and immediately run into two unexpected miniaudio limitations, what the…?!
- miniaudio uses a fixed linear function for its fade-out envelope, and doesn't offer anything else? We might not even want a logarithmic one this time because symmetry with MIDI's simple quadratic curve would be neat, but we sure don't want a linear function – those stay near the original volume for too long, and then turn quiet way too quickly.
- There is no way to access FLAC metadata from miniaudio's public API, even though the library bundles the author's own FLAC library which has this feature?
📝 Back when I evaluated miniaudio, I alluded that I consider single-file C libraries to be massively overrated, and this is exactly why: Once they grow as massive as miniaudio (how ironic), they can quickly lead to their authors treating their dependencies as implementation details and melting down the interfaces that would naturally arise. In a regular library, dr_flac would be a separate, proper dependency, and the API would have a way to initialize a stream from an externally loaded drflac
object. But since the C community collectively pretends that multi-file libraries are a burden on other developers, miniaudio ended up with dr_flac copy-pasted into its giant single file, with a silly ma_
namespacing prefix added to all its functions. And why? Did we have to move so far in the other direction just because CMake doesn't support globbing? That's a symptom of CMake not actually solving any problem, not a valid architectural decision that libraries should bend around. 🙄
So unless we fork and hack around in miniaudio, there's now no way around depending on a second, regular copy of dr_flac. Which has now led to the same project organization bloat that single-file libraries originally set out to prevent…
Sigh. At this rate, it makes more sense to just copy-paste and adapt the old BGM streaming code I wrote for thcrap in late 2018, which used dr_flac directly, and extend it with metadata support. With the streaming code moved out of the platform layer and into game logic, it also makes much more sense to implement the squared fade-out curve at that same level instead of copy-pasting and adjusting an unhealthy amount of miniaudio's verbose C code.
While I'm doing the same for the old Vorbis streaming code, it would also make sense to rewrite that one to use stb_vorbis instead of the old libogg+libvorbis reference libraries. There's no need to add two more dependencies if miniaudio already comes with stb_vorbis.c
, and that library is widely acclaimed. So, integration should be a breeze, right?
Well, surprise, rarely have I seen a C library so actively hostile toward being integrated. Both of its API variants are completely unreasonable:
- The pulldata API pulls Vorbis data as needed from either a memory buffer containing the entire Vorbis file, or a C
FILE*
handle.
Effectively, this forces either you to give up disk streaming completely, or your program into C's terrible I/O API with all its buffering slowness and Unicode issues on Windows. The documentation even goes on to suggest just modifying the code if you need anything else, which might be acceptable in the strange world of game development this library originates from, but it sure isn't in the kind of open-source development I do. - The pushdata API expects the caller to gradually feed chunks of Vorbis data. How large do these chunks have to be? Nobody knows – and, even worse, the API doesn't retain any of the data already pushed in. If the buffer you passed is too small, which you don't get to know in advance, you have to pass the same data plus more in the next call. I get that you might want an API like this to avoid dynamic memory allocations, but not only does this API perform plenty of allocations itself, it actively forces its caller to
realloc()
over and over again. 🙄 The lack of seeking support reveals that this API is geared towards live-streamed audio, and it might very well be acceptable in such a case, but it's nothing we could use for BGM.
What happened to the tried-and-true idea of providing a structure with read, tell, and seek callbacks, and then providing an optional variant for C FILE*
handles if you absolutely must? Sure, the whole point of Vorbis is to be small and nobody these days would care about spending a few MB on keeping an entire Vorbis file in memory, but come on. If pulldata made the deliberate and opinionated choice to only support buffers of complete Vorbis streams and argued in the name of simplicity that hand-coded disk streaming isn't worth it in this day and age, I might have even been convinced. And this is from the guy who popularized the concept of single-file C libraries in the first place?
Oh well, tupblocks go brrr. libvorbis definitely shows its age with all the old command-line tools in the lib/
directory that they never moved away and that we now have to remove from our glob. But even that just adds a single line to the Tupfile, and then we get to enjoy its much friendlier API. That sure beats the almost 800 lines of code that miniaudio had to write to integrate stb_vorbis… which I can't even link because the file is too big for GitHub. 🤷
At this point, it would have even made sense to upgrade from a 24-year-old lossy codec to an 11-year-old lossy codec and use Opus instead, since the enforced 48,000 Hz sampling rate is a non-issue when you control the entire audio pipeline. But let's keep compatibility with existing thcrap mods for now.
The last time I added dependencies, 📝 I wondered whether just downloading and extracting official Windows binary builds might be superior to pasting batch script duct tape over the usability issues of Git submodules. However, I still wanted to try out Git's sparse checkout feature before, in an attempt to remove all the unneeded bloat… and as it turned out, this might just be the idealistic and perfect nirvana of vendoring libraries in C++ projects. I particularly like how the limitations of its default mode (always checking out all files within each directory level that shows up in a filter) can be turned into a guideline about how to structure a repository: All non-essential stuff that consumers of your code might not need – tests, high-level documentation, or optional features – should go into a subdirectory where it can be easily filtered.
And that's how the size of our libs/
directory went down from 82.7 MiB in the P0256 build to 30.4 MiB in the P0275 build, despite adding 4 more libraries in the latter. Now if only this didn't require even more duct tape to actually set up shallow clones correctly…
In the end, the Windows build ended up using only a single one of the miniaudio features that DirectSound doesn't have, and that's the ability to use the more modern WASAPI instead of DirectSound. We're still going to use miniaudio for the Linux port, but as far as Windows is concerned, it would be quite nice to backport BGM streaming to the game's original DirectSound backend. The P0275 build is pushing 1 MiB of binary size for a game that originally came in a 220 KiB binary, so it would remove a noticeable amount of bloat from GIAN07.EXE
, but it would also allow waveform BGM to work in the Windows 98-compatible i586 build. If that sounds cool to you, this is the issue you want to fund.
That only left some logic and UI busywork to put it all together, which means that we've almost reached the end of things to talk about! Here's what it all looks like:
- BGM pack selection is done in-game through a new submenu. The
<Download>
option will open the BGM pack release page in the system's preferred browser:This window presented a great occasion for already implementing the generic boilerplate for vertically scrolling windows with an unlimited number of items. That will come in quite handy once we introduce better replay support… 👀 - Even with per-track BGM volume normalization, Shuusou Gyoku's sound effects are still a bit too loud in comparison, especially when mixed on top of that excessively and unfixably left-panned AST version of the Extra Stage theme. Adding separate volume controls for BGM and sound effects really was the only sustainable solution here, and conveniently checks an important quality-of-life box the original game lacked. So important that it was the very first issue I added to the GitHub tracker of my fork:
- I really wanted to have Japanese help text in these menus, as it makes them look just so much more consistent and polished. Many thanks to Elfin, who responded to my bounty offer, and will most likely also provide localizations for future features.
In-game music titles are now consistently right-aligned. Leading whitespace in 4 of the original MIDI Sequence Names suggests that pbg might have intended these titles to be centered within the 216 maximum pixels that the original code designated for music titles, but none of those 4 had the correct amount of spaces that would have been required for exact centering:
#04 |・・・幻想帝都・・・・ #07 |・・天空アーミー・・・ #11 |・魔法少女十字軍・・・ #16 |・・シルクロードアリス
TITLE
tags in the BGM packs… at the cost of significantly changing the animation. 🤔Maybe, all this whitespace had the explicit purpose of making the animation look the way it did originally? But hard-padding the title tags in the BGM packs would be so dumb… 😩 Let's keep it like this for now and fix the animation later. - At startup, the game now shows a new screen if any of the game's .DAT files are missing, displaying their expected absolute path. This is bound to be very important on Linux because each distribution might have its own idea of where these files are supposed to be stored. But even on Windows, this allows
GIAN07.EXE
to at least run and show something if one or more of these files are not present, instead of crashing at the first attempt of loading anything from them.The ¥
instead of\
is, 📝 once again, a font issue. Good luck finding a font not named MS Gothic that looks good when rendered in this game… - On a more unfortunate note, I dropped the i586 build from this release. Visual Studio 2022's CRT implements the new filesystem and threading code using Win32 API functions that are only available on Vista or later and are not covered by the one ready-made KernelEx package I was able to find, so I couldn't easily test such a build on Windows 98 anymore. Resurrecting the i586 build would therefore involve additional platform abstraction layers that we wouldn't need otherwise. Writing them wouldn't be too expensive, but it only makes sense if there's actual demand. Backporting waveform BGM to DirectSound to restore feature parity would also be a good idea here, as it would avoid the need to litter the current code with
#ifdef
s at any place that references anything related to BGM packs.
After half a year of being bought out way past the cap, I've finally got some small room left for new orders again. If it weren't for this blog post and the required research and web development work, this delivery would have probably come out in early January, taking half the time it ended up taking. So I really have to start factoring the blog posts into the push prices in a better and fairer way.
Meanwhile, the hate toward my day job only keeps growing, but there's little point in looking for a new one as long as ReC98 remains this motivating and complex. It leaves pretty much no cognitive room for any similarly demanding job. Thus, I want 2024 to be the year where ReC98 either becomes profitable enough to be my only full-time job, or where we conclusively find out that it can't, I go look for a better day job, and ReC98 shifts to a slower pace. Here's the plan:
- From now on, I will immediately increase the push price whenever we reach 100% of the cap, either directly through new orders or indirectly through existing subscriptions. The price increase will be relative to how long it took to reach that point since the last re-opening.
- If the store continues selling out, I will aim for per push by the end of the year.
- In exchange, microtransactions (i.e., deliveries containing just code and no blog posts) will now be half the price of regular pushes for the same amount of delivered code. Or in other words: If you want to fund a goal that's eligible for microtransactions, you can now decide whether your fixed amount of money goes to 2× coding work and 0× blogging, or 1× coding work and 1× blogging.
- I'll permanently increase the default level of the cap from 8 to 10 pushes. The past 12 months were full of mod releases that raised the bar, and 2024 shows no signs of stopping that trend.
- If we ever reach per push, I plan to hire people for some of the contribution-ideas or anything else that might improve this project. (Well-produced YouTube videos about the findings of this project might be a nice idea!) At that point, I will have reached my goal of living decently off this project alone, and it's time for others to make money in this space as well.
With the new price of per push, this means that there's now a small window in which you can get a full push worth of functionality for , until the current cap is filled up again.
Next up: Probably TH02's endings to relax a bit. Maybe we're also getting some new Touhou-related contributions?