ReC98

📝 Posted:: 2022-10-31 23:58 UTC
🚚 Summary of:: P0218, P0219, P0220, P0221, P0222
⌨ Commits:: (Website) 21f0a4d...8ebf201, (Website) 8ebf201...52375e2, (Website) 52375e2...ba6359b, (Website) ba6359b...94e48e9, (Website) 94e48e9...358e16f
💰 Funded by:: [Anonymous], Yanga, Ember2528
🏷 Tags:

Yes, I'm still alive. This delivery was just plagued by all of the worst luck: Data loss, physical hard drive failure, exploding phone batteries, minor illness… and after taking 4 weeks to recover from all of that, I had to face this beast of a task. 😵

Turns out that neither part of improving video performance and usability on this blog was particularly easy. Decently encoding the videos into all web-supported formats required unexpected trade-offs even for the low-res, low-color material we are working with, and writing custom video player controls added the timing precision resistance of HTML <video> on top of the inherent complexity of frontend web development. Why did this need to be 800 lines of commented JavaScript and 200 lines of commented CSS, and consume almost more than 5 pushes?! Apparently, the latest price increase also seemed to have raised the minimum level of acceptable polish in my work, since that's more than the maximum of 3.67 pushes it should have taken. To fund the rest, I stole some of the reserved JIS trail word rendering research pushes, which means that the next towards anything will go back towards that goal.

The codec situation is especially sad because it seems like so much of a solved problem. ZMBV, the lossless capture codec introduced by DOSBox, is both very well suited for retro game footage and remarkably simple too: DOSBox-X's implementation of both an encoder and decoder comes in at under 650 lines of C++, excluding the Deflate implementation. Heck, the AVI container around the codec is more complicated to write than the compressed video data itself, and AVI is already the easiest choice you have for a widely supported video container format.
Currently, this blog contains 9:02 minutes of video across 86 files, with a total frame count of 24,515. In case this post attracts a general video encoding audience that isn't familiar with what I'm encoding here: The maximum resolution is 640×400, and most of the video uses 16 colors, with some parts occasionally using more. With ZMBV, the lossless source files take up 43.8 MiB, and that's even with AVI's infamously bad overhead. While you can always spend more time on any compression task and precisely tune your algorithm to match your source data even better, 43.8 MiB looks like a more than reasonable amount for this type of content.

Especially compared with what I actually have to ship here, because sadly, ZMBV is not supported by browsers. 😔 Writing a WebAssembly player for ZMBV would have certainly been interesting, but it already took 5 pushes to get to what we have now. So, let's instead shell out to ffmpeg and build a pipeline to convert ZMBV to the ill-suited codecs supported by web browsers, replacing the previously committed VP9 and VP8 files. From that point, we can then look into AV1, the latest and greatest web-supported video codec, to save some additional bandwidth.

But first, we've got to gather all the ZMBV source files. While I was working on the 📝 2022-07-10 blog post, I noticed some weirdly washed-out colors in the converted videos, leading to the shocking realization that my previous, historically grown conversion script didn't actually encode in a lossless way. 😢 By extension, this meant that every video before that post could have had minor discolorations as well.
For the majority of videos, I still had the original ZMBV capture files straight out of DOSBox-X, and reproducing the final videos wasn't too big of a deal. For the few cases where I didn't, I went the extra mile, took the VP9 files, and manually fixed up all the minor color errors based on reference videos from the same gameplay stage. There might be a huge ffmpeg command line with a complicated filter graph to do the job, but for such a small 4-digit number of frames, it is much more straightforward to just dump each frame as an image and perform the color replacement with ImageMagick's -opaque and -fill options.

So, time to encode our new definite collection of source files into AV1, and what the hell, how slow is this codec? With ffmpeg's libaom-av1, fully encoding all 86 videos takes almost 9 hours on my mid-range development system, regardless of the quality selected.
But sure, the encoded videos are managed by a cache, and this obviously only needs to be done once. If the results are amazing, they might even justify these glacial encoding speeds. Unfortunately, they don't: In its lossless -crf 0 mode, AV1 performs even worse than VP9, taking up 222 MiB rather than 182 MiB. It might not sound bad now, but as we're later going to find out, we want to have a lot of keyframes in these videos, which will blow up video sizes even further.

So, time to go lossy and maybe take a deep dive into AV1 tuning? Turns out that it only gets worse from there:

The alternative libsvtav1 encoder is fast and creates small files… but even on the highest-quality settings, -crf 0 and -qp 0, the video quality resembled the terrible x264 YUV420P format that Twitter enforces on uploaded videos.
I don't remember the librav1e results, but they sure weren't convincing either.
libaom-av1's -usage realtime option is a complete joke. 771 MiB for all videos, and it doesn't even compress in real time on my system, more like 2.5× real-time. For comparison, a certain stone-age technology by the name of "animated GIF" would take 54.3 MiB, encode in sub-realtime (0.47×), and the only necessary tuning you need is an easily googled palette generation and usage filter. Why can't I just use those in a <video> tag?! These results have clearly proven the top-voted just use modern video codecs Stack Overflow answers wrong.
What you're actually supposed to do is to drop -cpu-used to maybe 2 or 3, and then selectively add back prediction filters that suit your type of content. In our case, these are
- -enable-palette
- -enable-rect-partitions and friends
- -enable-intrabc (source)
and maybe others, depending on much time you want to waste.

Because that's what all this tuning ended up being: a complete waste of time. No matter which tuning options I tried, all they did was cut down encoding time in exchange for slightly larger files on average. If there is a magic tuning option that would suddenly cause AV1 to maybe even beat ZMBV, I haven't found it. Heck, at particularly low settings, -enable-intrabc even caused blocky glitches with certain pellet patterns that looked like the internal frame block hashes were colliding all over the place. Unfortunately, I didn't save the video where it happened.

So yeah, if you've already invested the computation time and encoded your content by just specifying a -crf value and keeping the remaining settings at their time-consuming defaults, any further tuning will make no difference. Which is… an interesting choice from a usability perspective. I would have expected the exact opposite: default to a reasonably fast and efficient profile, and leave the vast selection of tuning options for those people to explore who do want to wait 5× as long for their encoder for that additional 5% of compression efficiency. On the other hand, that surely is one way to get people to extensively study your glorious engineering efforts, I guess? You know what would maybe even motivate people to intrinsically do that? Good documentation, with examples of the intent behind every option and its optimal use case. Nobody needs long help strings that just spell out all of the abbreviations that occur in the name of the option…
But hey, that at least means there's no reason to not use anything but ZMBV for storing and archiving the lossless source files. Best compression efficiency, encodes in real-time, and the files are much easier to edit.

OK, end of rant. To understand why anyone could be hyped about AV1 to begin with, we just have to compare it to VP9, not to ZMBV. In that light, AV1 is pretty impressive even at -crf 1, compressing all 86 videos to 68.9 MiB, and even preserving 22.3% of frames completely losslessly. The remaining frames exhibit the exact kind of quality loss you'd want for retro game footage: Minor discoloration in individual pixels, so minuscule that subtracting the encoded image from the source yields an almost completely black image. Even after highlighting the errors by normalizing such a difference image, they are barely visible even if you know where to look. If "compressed PNG size of the normalized difference between ZMBV and AV1 -crf 1" is a useful metric, this would be its median frame among the 77.7% of non-lossless frames:

The lossless source image — That's frame 455 (0-based) of 📝 YuugenMagan's reconstructed Phase 5 pattern on Easy mode. The AV1 version does in fact expand the original image's 16 distinct colors to 38.

The same image encoded in AV1 — That's frame 455 (0-based) of 📝 YuugenMagan's reconstructed Phase 5 pattern on Easy mode. The AV1 version does in fact expand the original image's 16 distinct colors to 38.

For comparison, here's the 13th worst one. The codec only resorts to color bleeding with particularly heavy effects, exactly where it doesn't matter:

Whether you can actually spot the difference is pretty much down to the glass between the physical pixels and your eyes. In any case, it's very hard, even if you know where to look. As far as I'm concerned, I can confidently call this "visually lossless", and it's definitely good enough for regular watching and even single-frame stepping on this blog.
Since the appeal of the original lossless files is undeniable though, I also made those more easily available. You can directly download the one for the currently active video with the ⍗ button in the new video player – or directly get all of them from the Git repository if you don't like clicking.

Unfortunately, even that only made up for half of the complexity in this pipeline. As impressive as the AV1 -crf 1 result may be, it does in fact come with the drawback of also being impressively heavy to decode within today's browsers. Seeking is dog slow, with even the latencies for single-frame stepping being way beyond what I'd consider tolerable. To compensate, we have to invest another 78 MiB into turning every 10^th frame into a keyframe until single-stepping through an entire video becomes as fast as it could be on my system.
But fine, 146 MiB, that's still less than the 178 MiB that the old committed VP9 files used to take up. However, we still want to support VP9 for older browsers, older hardware, and people who use Safari. And it's this codec where keyframes are so bad that there is no clear best solution, only compromises. The main issue: The lower you turn VP9's -crf value, the slower the seeking performance with the same number of keyframes. Conversely, this means that raising quality also requires more keyframes for the same seeking performance – and at these file sizes, you really don't want to raise either. We're talking 1.2 GiB for all 86 videos at -crf 10 and -g 5, and even on that configuration, seeking takes 1.3× as long as it would in the optimal case.

Thankfully, a full VP9 encode of all 86 videos only takes some 30 minutes as opposed to 9 hours. At that speed, it made sense to try a larger number of encoding settings during the ongoing development of the player. Here's a table with all the trials I've kept:

Codec	`-crf`	`-g`	Other parameters	Total size	Seek time
VP9	32	20	-vf format=yuv420p	111 MiB	32 s
VP8	10	30	-qmin 10 -qmax 10 -b:v 1G	120 MiB	32 s
VP8	7	30	-qmin 7 -qmax 7 -b:v 1G	140 MiB	32 s
AV1	1	10		146 MiB	32 s
VP8	10	20	-qmin 10 -qmax 10 -b:v 1G	147 MiB	32 s
VP8	6	30	-qmin 6 -qmax 6 -b:v 1G	149 MiB	32 s
VP8	15	10	-qmin 15 -qmax 15 -b:v 1G	177 MiB	32 s
VP8	10	10	-qmin 10 -qmax 10 -b:v 1G	225 MiB	32 s
VP9	32	10	-vf format=yuv422p	329 MiB	32 s
VP8	0-4	10	-qmin 0 -qmax 4 -b:v 1G	376 MiB	32 s
VP8	5	30	-qmin 5 -qmax 5 -b:v 1G	169 MiB	33 s
VP9	63	40		47 MiB	34 s
VP9	32	20	-vf format=yuv422p	146 MiB	34 s
VP8	4	30	-qmin 0 -qmax 4 -b:v 1G	192 MiB	34 s
VP8	4	40	-qmin 4 -qmax 4 -b:v 1G	168 MiB	35 s
VP9	25	20	-vf format=yuv422p	173 MiB	36 s
VP9	15	15	-vf format=yuv422p	252 MiB	36 s
VP9	32	25	-vf format=yuv422p	118 MiB	37 s
VP9	20	20	-vf format=yuv422p	190 MiB	37 s
VP9	19	21	-vf format=yuv422p	187 MiB	38 s
VP9	32	10		553 MiB	38 s
VP9	32	10	-tune-content screen	553 MiB
VP9	32	10	-tile-columns 6 -tile-rows 2	553 MiB
VP9	15	20	-vf format=yuv422p	207 MiB	39 s
VP9	10	5		1210 MiB	43 s
VP9	32	20		264 MiB	45 s
VP9	32	20	-vf format=yuv444p	215 MiB	46 s
VP9	32	20	-vf format=gbrp10le	272 MiB	49 s
VP9	63			24 MiB	67 s
VP8	0-4		-qmin 0 -qmax 4 -b:v 1G	119 MiB	76 s
VP9	32			107 MiB	170 s

The bold rows correspond to the final encoding choices that are live right now. The seeking time was measured by holding → Right on the 📝 cheeto dodge strategy video.

Yup, the compromise ended up including a chroma subsampling conversion to YUV422P. That's the one thing you don't want to do for retro pixel graphics, as it's the exact cause behind washed-out colors and red fringing around edges:

The same image encoded in VP9, exhibiting a severe case of chroma subsampling — The worst example of chroma subsampling in a VP9-encoded file according to the above metric, from frame 130 (0-based) of 📝 Sariel's restored leaf "spark" animation, featuring smeared-out contours and even an all-around darker image, blowing up the image to a whopping 3653 colors. It's certainly an aesthetic.

But there simply was no satisfying solution around the ~200 MiB mark with RGB colors, and even this compromise is still a disappointment in both size and seeking speed. Let's hope that Safari users do get AV1 support soon… Heck, even VP8, with its exclusive support for YUV420P, performs much better here, with the impact of -crf on seeking speed being much less pronounced. Encoding VP8 also just takes 3 minutes for all 86 videos, so I could have experimented much more. Too bad that it only matters for really ancient systems…
Two final takeaways about VP9:

-tune-content screen and the tile options make no difference at all.
All results used two-pass encoding. VP9 is the only codec where two passes made a noticeable difference, cutting down the final encoded size from 224 MiB to 207 MiB. For AV1, compression even seems to be slightly worse with two passes, yielding 154,201,892 bytes rather than the 153,643,316 bytes we get with a single pass. But that's a difference of 0.36%, and hardly significant.

Alright, now we're done with codecs and get to finish the work on the pipeline with perhaps its biggest advantage. With a ffmpeg conversion infrastructure in place, we can also easily output a video's first frame as a poster image to be passed into the <video> tag. If this image is kept at the exact resolution of the video, the browser doesn't need to wait for an indeterminate amount of "video metadata" to be loaded, and can reserve the necessary space in the page layout much faster and without any of these dreaded loading spinners. For the big /blog page, this cuts down the minimum amount of required resources from 69.5 MB to 3.6 MB, finally making it usable again without waiting an eternity for the page to fully load. It's become pretty bad, so I really had to prioritize this task before adding any more blog posts on top.

That leaves the player itself, which is basically a sum of lots of little implementation challenges. Single-frame stepping and seeking to discrete frames is the biggest one of them, as it's technically not possible within the <video> tag, which only returns the current time as a continuous value in seconds. It only sort of works for us because the backend can pass the necessary FPS and frame count values to the frontend. These allow us to place a discrete grid of frame "frets" at regular intervals, and thus establish a consistent mapping from frames to seconds and back. The only drawback here is a noticeably weird jump back by one frame when pausing a video within the second half of a frame, caused by snapping the continuous time in seconds back onto the frame grid in order to maintain a consistent frame counter. But the whole feature of frame-based seeking more than makes up for that.
The new scrubbable timeline might be even nicer to use with a mouse or a finger than just letting a video play regularly. With all the tuning work I put into keyframes, seeking is buttery smooth, and much better than the built-in <video> UI of either Chrome or Firefox. Unfortunately, it still costs a whole lot of CPU, but I'd say it's worth it. 🥲

Finally, the new player also has a few features that might not be immediately obvious:

Keybindings for almost everything you might want them for, indicated by hovering on top of each button. The tab switchers additionally support the ↑ Up and ↓ Down keys to cycle through all tabs, or the number keys to jump to a specific tab. Couldn't find a way to indicate these mappings in the UI yet.
Per-video captions now reserve the maximum height of any caption in the layout. This prevents layout reflows when switching through such videos, which previously caused quite annoying lag on the big /blog page.
Useful fullscreen modes on both desktop and mobile, including all markers and the video caption. Firefox made this harder than it needed to be, and if it weren't for display: contents, the implementation would have been even worse. In the end though, we didn't even need any video pixel sizes from the backend – just as it should be…
… and supporting Firefox was definitely worth it, as it's the only browser to support nearest-neighbor interpolation on videos.
As some of the Unicode codepoints on the buttons aren't covered by the default fonts of some operating systems, I've taken them from the Catrinity font, licensed under the SIL Open Font License. With all the edits I did on this font, that license definitely was necessary. I hope I applied it correctly though; it's not straightforward at all how to properly license a Modified Version of an original font with a Reserved Font Name.

And with that, development hell is over, and I finally get to return to the core business! Just more than one month late. Next up: Shipping the oldest still pending order, covering the TH04/TH05 ending script format. Meanwhile, the Seihou community also wants to keep investing in Shuusou Gyoku, so we're also going to see more of that on the side.

⮜ Blog