Articles | x264 & How I Learned to Love The Goat (2018)

Information in this article has been amended/developed in a new article, the 2018 Video Processing article. Please refer to this article for the latest information. This article is preserved for educational purposes, but some information is out of date or incorrect.

In this article I will break down several observations, experiments, and technical pukestains that came as a result of preliminary processing for several LP's during Winter 2017 and early 2018.

Note that although I use megui, you can reference this article for ffmpeg as well. However, some differences in options, including their limits and names, has been observed between the two applications (potentially due to different forks in the x264 libraries they use, though I'm really not certain).

Section 1 - Colorspace & Ico

Colorspace Premise

In this section of the article I will break down the colorspace and conversion fiasco that occurred during December 2017, which was a result of discovering color inconsistencies in console recordings via the Blackmagic Intensity Pro between Quarter 4 of 2016 and Quarter 4 2017. The goal of documenting these observations is to promote awareness on the dangers of colorspace conversations to motion picture media and the functionality of related hardware and software in a format sensible to someone without formal technical education (such as myself) or experience in producing video content.

Colors in an image are typically expressed in RGB - Red, Green, Blue, with values between 0 (Black) to 255 (White). This colorspace is known as RGB Full. However, many more colorspaces exist, such as RGB Limited, though most of them are relevant only for specific processes, such as commercial-grade printing. sRGB is one of the most common colorspaces seen in media, but when you enter the realm of Television media, such as the Playstation 3 in this case, you'll start encountering more options that can cause a lot of problems for you if you aren't aware of what they are doing. Such was the case when a drunken Swede verified my first two Ico segments, and pointed out there was color crushing in the Grays and Blacks.

If you look up Color Spaces in wikipedia or similar resources you'll likely stumble across a load of 3d models and tables as well as a variety of unique terminology. Colorspaces exist for all kinds of specific functions, such as printers. The subject can be rather daunting to get into at the academic level, but at the internet superweeb level only one basic concept needs to be understood: colorspaces are a form of compression and you want to avoid the results of ineffective compression. Particularly, you want to avoid something that is known as "crushing" - a process that results in "flattened" or inaccurate colors. This process can occur as a result of any number of the steps that the source media took to reach your display, including your media player, the encode process, and the source media that the content developer recorded. Even GPU control panels can enable settings for certain video players that crush colors by default! Awareness of the problem is the first step to ensuring we are viewing imagery closest to the source as possible. Suffice to say, converting between colorspaces is extremely undesirable in very much the same way converting from being woke to liberalism is.

Color Crushing

In years prior, I was informed by Nefarius of an option for AVISynth that forced a colorspace called Rec709 when encoding a video. This colorspace fixed some major color inconsistencies that were appearing in my earlier videos, most noticeably with Greens such as the water in Civilization 5 or grass in Dragon's Dogma. Without manually stating this command, MeGUI will default the encoded video to a colorspace intended for Televisions for some braindead reason, not only warping colors but causing them to become "crushed" - e.g, you are losing shades of color, and thus details. I was able to identify these inconsistencies easily enough by comparing my Fraps source to the final encodes.

Fraps uses the yuvj420 (weed) colorspace, so going from a RGB (Game) source to Fraps (weed) is always going to result in "compression" and crushing. The end result can still be reasonably high quality as long as we avoid a second compression/colorspace conversion (aka AVISynth's default setting). Forcing AVISynth to Rec709 resulted in a massive spike in quality and an insignificant size increase from the new color shades the videos now possessed. From what I've read, 609 (Default) and 709 aren't necessarily extremely different in colors, but the process of converting between them is what becomes extremely destructive, making awareness of this process all the more important. 709 is allegedly a default for many consumer applications when dealing with video larger than 480p, so I'm not sure why it isn't the default here.

Fraps has the option to record in "Full RGB", but tests with this yielded that roughly half the videos produced were corrupt. Since Fraps is at best alpha software and has never received meaningful updates, and there are no performance-conscious alternatives, the assumption is made that the quality loss from source to Fraps is unavoidable. Thankfully weedspace conversion isn't too bad.

J420

J420 is exactly like I420, but with a full range ("digital", 0-255) luma (Y) component instead of limited range ("analog", 16-240). - Source: VideoLAN Wiki

However, colors can be "crushed" before it even reaches the point that I am recording it with Fraps.

The Ico Incident

Crushed black frame from PS3 Fraps Source vs Capcom Logo from encoded DMC3.

When Ico's crushed colors were brought to my attention I referenced older productions, such as DMC and Demons' Souls, whom I did not recall to have crushed colors. It was easy to determine if they had crushed colors or not, because loading screens and logos typically had 0,0,0 (Black) backgrounds. In a crushed (RGB Limited and equivalent in this specific context) colorspace, that 0,0,0 will become 16,16,16 or something similar, which results in a "Gray" background. Incidentally, the Ico recordings had a 16,16,16 background where there should have been 0,0,0. This in itself is innocuous, except for causing extremely severe encoding artifacts in very specific places, until one realizes what the crushing is actually doing to the rest of the video.

When considering our intent with these recordings, which is to encode them into a fairly heavily compressed format (CRF 28 in original Ico configurations), this problem's magnitude is compounded by the fact that x264 historically struggles with dark grays. The aforementioned logo artifacting resulted in hysterical bands of color radiating outwards from the logo images as well as flickering, and during certain locations and cinematics the game's dark gray shades turned into a vomity mess unless I pushed the encode quality all the way to Bluray levels, which more than doubled the resulting file sizes. Not simply was I losing quality by the colorspace conversion I was also handing the encode process a needlessly difficult to process source and aggravating the weakest aspects of x264's compression.

Component ptbi "Full RGB" vs ptbi "Limited RGB", or "How I Learned To Love The Dick". Even in bright locations the crushing effect obliterates color depth.

Curiously, switching from RGB Full to RGB Limited in ptbi (the application I use to record the capture card stream) resulted in a seemingly correct colorspace. Originally, I had assumed that the console was handing me a crushed source, but that didn't make sense given that Component cables supported RGB and my previous videos had no issues. At first I assumed ptbi's colorspace labels were flipped, and Limited was Full, but HKS checked the source and informed me that this was not the case. More research was necessary to figure out what was going on.

To confirm that a colorspace crush was in fact occuring, HKS had me load up a gradient in the Playstation browser and take comparison screenshots.

Source vs ptbi's "RGB Full" vs ptbi's "RGB Limited"

This test yielded bizarre results. Were the problem only a result of ptbi mislabelling Full and Limited, then the Limited colorspace would be 1:1 with the source. HKS confirmed this issue did not exist on his own system with the same software, but one critical difference existed between his setup and mine - he used HDMI with a splitter to shred Sony's pointless HDCP encryption, and I used Component.

Component has other issues as well that were brought to attention by these tests. You may have noticed in the first comparison frames that the 16,16,16 black wasn't clean - in fact, the image was heavily distorted, something you'll also notice if you zoom into the black gradients in the above images. This is analog noise, something that also heavily fucks with my video compression since it introduces a swath of unnecessary information into frames.

Further research into ptbi's source code yielded why "Limited" was producing a closer, but not quite accurate, colorspace - ptbi actually attempts to upscale the image colors and bring them closer to RGB Full, a shader that is not active within the RGB Full configuration. What we're seeing in the RGB Full configuration is what the capture card is actually handing us (presumably after YUV/sRGB conversion), and the Limited configuration is ptbi's upscale shader at work. It seems reasonable to assume this shader was actually intended for Limited RGB signals to begin with, and the fact it works as well as it does for a YUV signal is simply luck.

This sent me on a long google adventure, trying to figure out precisely what was going on. Ultimately, I discovered the source of the problem - the capture card itself. The Blackmagic Intensity Pro only supports YUV for component connections. This is not the same colorspace that Fraps uses, it's a limited, CRT-era configuration that greatly damages anything that wasn't specifically built for it. This entirely arbitrary limitation is what is resulting in the crushing, and the other videos being "not broken" in the past are only as a result of ptbi's shader, entirely without me even knowing this was going on.

Welp.

The only solution to the problem is to avoid the gimped Component pipeline and use a stripped HMDI signal.

Section 2 - The Dastardly Devil of Compression & Divinity

During preliminary tests for Skyrim and Divinity: Original Sin 2, I came into a difficult crossroads in which I had to determine how I would present runs like these titles - those that compressed very poorly due to extremely high motion, high noise, and high contrast. In the past, with Hunted, I had opted to (inaccurately) drop the resolution for the video, which seemed like it would also be a necessary step for Divinity and Skyrim in present day.

My first encode for D:OS2 segment put the final video around 10gb for ~2:45 hours, a size-to-length ratio that smashed existing record-holding LP sizes, like Dark Souls 3 (10gb for 4 hours). Skryim was also handing me close to 500mb for only a few minutes of high-motion content. The kicker is that both of these examples were at 28 CRF - extremely low quality.

There were several challenges to overcome circulating the weeks worth of time lost to figuring out the ideal resolution to the troublesome filesizes, and in this portion of the document I will lay out my observations and experiments.

Compression versus Motion Estimation versus Brain Butter

As detailed in this article, Motion Estimation is a critical element to squeezing every little bit of pudding out of the roadkill when we're dealing with a constant quality as low as 28. However, in specific cases it simply isn't enough. All Motion Estimation does is distribute data more intelligently; if our motion picture requires an amount of data it simply doesn't have to distribute then we'll simply end up rubbing our dick against the cheesegrater from the wrong direction. The natural solution is to, of course, lower the CRF and raise the amount of data we can hand the encoder. However, even dropping to 26 CRF causes the file sizes to balloon wildly out of control. In a small scale media project it would perhaps be more efficient to use bitrate Zones, 2pass encoding, or something that gives us a bit more control in when and what data gets distributed.

Still, I did try to push the Motion Estimations to their maximum settings. This had basically no return. MultiHex and subpixel refinement 9 are about as good as you're going to objectively get without dropping as much as 50% encode speed per increase which, as far as our tests were yielding, was placebo returns at best. I then turned my attention to the other, less explored options of x264, those whose documentation is less than stellar.

B-Frames, P-Frames, Tricking The Brain

Video compression efficiency relies rather heavily on smoke and mirrors. If you've ever paused one of my videos, you'll probably notice that much of it looks rather pukey. Certainly a lot moreso than in motion. That's because the individual frames are, especially in the more aggressive settings, pretty pukey. There's a lot more going on than just "motion makes things smoother", though. x264 is actually dividing its encode operations into many different types of frames and attempting to manage the way it compresses videos through these frames. The most well-known kind of frame is a "Keyframe" (which possesses a needlessly difficult to remember name in x264), which makes Seeking easier. I've also read that frames cannot look past keyframes for estimation, so they tend to not be very common - usually a multiplier of the FPS such as x10.

However, Megui presents you with options related to several less-obvious frame types that, superficially, seem like they might help me out with compressing my troublesome footage. Unfortunately, megui's popups don't tell you much of the story necessary to understand these settings, and even after combing articles and wikis all I've really been able to conclude is that I am not alone in my confusion.

From what I can tell, one of the most important settings that simultaneously dictates a huge chunk of how the encode process works but also doesn't give you any real room to work is the "Number of B-Frames" setting. In Constant Quality this is regarded to be basically 2 or 3. Anything more causes noticeable quality impacts. In other quality settings you can change this, but the effects of doing so are not well-documented (if at all).

Sets the maximum number of concurrent B-frames that x264 can use. B-frames are similar to P-frames, except they can use motion prediction from future frames as well. - Source: WikiBooks

A B-frame can be seen almost as "padding" - a frame that contains only changes between frames. What exactly a P-frame is, is not explained anywhere immediately visible. One can make fairly concise conclusions on the differences, though - P-frames are larger and contain more data, while B-frames are compression-conscious filler. x264, through the "Adaptive B-Frames" setting, attempts to distribute the frames in a manner to preserve quality in demanding situations. Even so, in nowhere is it recommended to use above 3. Setting this value to 2 caused a noticeable, but not colossal, size difference. Thus, although this setting seems like it could allow us to more aggressively pull quality out of our not as demanding scenes, in practice it is never noted to be effective for doing so.

The "Number of Extra I-Frames" is specifically related to the codec looking for scene changes to place something akin to keyframes. The name of this value is misleading, as it's actually a sensitivity value that seeks to combat the whitewashing of color/light bleeding between compressed frames when a large shift in visuals suddenly occurs. The "Number of Reference Frames" did not incite any notable change in quality in my tests, but dramatically impacted encode speed.

In conclusion, although spending days reading and testing the functionality of the options offered a lot of insight into how the codec works, it yielded nothing valuable for my goals. The reseach does, however, cement that my existing settings - and x264's presets - are pretty much the best you can get out of this menu. What the research does help us understand is how the increase of bitrate actually effects the overall distribution of new information. It isn't a flat increase - rather, you can think of it as a curve. Darks, low-motion and otherwise low-demanding content won't ramp up in filesize nearly as much as the high-contrast, high-motion content I was testing. Therefore, my tests were biasing the results towards extreme-case produce: a run containing a mix of high and low demanding footage wouldn't all escalate by a 50% increase in filesize. However, I had at the very least two Divinity segments that were almost entirely consisted of this troublesome material, and couldn't guess what the Skyrim run would actually look like as a body in terms of compression challenges. I had to treat all content as troublesome, and therefore assume I needed a solution that fit the runs as a whole. Ergo, I couldn't simply increase quality and just hope other segments wouldn't be so large. I needed to bite the head of the penis to get a taste of the sack afterwards.

Rescaling is Rubbish

Tests with sprites and previous experiences had already prepared my anus for the fiasco that would come as a result of experimenting with optimal downscaling. However, I found that Divinity's sharp image contrasts got completely obliterated by Lanczos, which amost behaved like an Nvidia(TM) The Way It's Meant To be Played(TM) Blur Filter(TM) (aka FXAA(TM)) and "flattened" the image noticeably. Amusingly, performing the downscaling in Vegas resulted in an unwatchable mass of disgusting aliasing.

The different rescaling options are detailed in the Avisynth wiki. By detailed I mean you'll get an extremely vague and largely useless idea of what they do and have to test them on your own anyways. Which is precisely what I did.

Simply rescaling the image isn't enough for anything that isn't sprite footage - pixelresize, or pointresize, works fine for that. They'll introduce a great deal of aliasing and cancer in anything else. Bicubic, Bisexual and other basic settings don't produce good results either. Of course, downscaling an image is very different from upscaling it, so I turned to the tried and true filter I use for D&D images - Lanczos.

Superficially, Lanczos is the best resize filter. Blackman isn't too bad, either, and Spline appears like it would be comparatively as good (especially since it claims to deal with certain artifacts that can come as a result of Lanczos). However, research into these settings exposed me to a new set of configurations and behavior I hadn't been aware of until this point.

Tapping The Booty

parameter taps

int taps = 3

Basically, taps affects sharpness. Default 3, range 1-100. Equal to the number of filter lobes (ignoring mirroring around the origin).

Note: the input argument named taps should really be called "lobes". - Source: Avisynth Wiki

The first sentence makes perfect sense, but the rest of the description quickly becomes useless to the layman. As pretty much anyone who has ever looked into Skyrim modding or ENB knows, simply "sharpening" something is like devouring a turd, throwing it up, then blasting it with piss to polish it and calling it a "high-definition" result. Knowing how tapping the booty is actually sharpening the image would be nice, but unfortunately without being a nerd my understanding of the concept was quickly exceeded when we started talking about ear lobes in a discussion about booty.

As the following descriptions only talk about lobes and samples, something I've literally never even heard of with regards encoding, I took it upon myself to make a billion tests once more. Surely enough, telling Lanczos to tap the booty resulted in an objectively superior image. The problem was, however, that I really couldn't tell a major difference between Spline64 and Lanczos at 4 taps. The description of taps insinuated I needed 6 taps of the booty for a video the size of 800p from 1200p, which for all I know is woefully erronous assumption, but nonetheless I committed to the 6 taps and re-encoded everything for the 5th time.

Compared to the existing Lanczos resizes, the problematic Segment 2 sky rocketed from 5.5gb to 8.1gb. It makes sense - the additional sharpness meant more information was being pulled out of the goat hole to display that new information, so the overall file would be larger. I hadn't expected such an enormous jump, however - this put us within pissing distance of the fullsize encode. The difference was the full size encode was 28 CRF and my 800p Lanczos taps=6 encode was CRF 25. I effectively traded resolution for clarity in movement - 28 CRF simply couldn't deal with the grass and trees while moving, but the 800p video was fine. Furthermore, these settings smashed the sizes on darker settings, dropping the first segment to 1.3gb (with corrected framerate) from 1.4gb (fullsize). Given the historic nature of higher CRF on 720p causing major spikes in size, I was pretty content with those results, even if they weren't ideal.

Lanczos without manually defined settings versus taps=6

Notably, the fire and bushes at the top right receive significant changes and more definition in taps=6, but changes to compression are visible throughout the image.

Still frames made judging the differences in the resizes extraordinarily difficult. I needed large samples to really pit these things against each other, because the true damage Lanczos was doing in default settings was exposed best by motion within the forest scenes. When the image appears flat it looks like there is an FXAA filter being applied, which softened the entire image rather considerably and brought me right back to disgusting games such as Conan. The critical thing when judging these visuals is, again, the knowledge that video encoding is all about perceived quality at these high compression settings, which can make things problematic to judge at a glance. Furthermore, I have to test using both of my monitors - one has a slower response and has more difficulty with dark colors than the other, and videos that appear fine in that monitor may have subtle issues that become far more apparent when viewed on the newer display.

These results exposed me to a new set of choir boy revelations - the main reasoning why I was gaining size back when downscaling wasn't simply the smaller size of the image, but also because the algorithm I had been using to do so - Lanczos - softens the image and reduces fidelity by a great deal without the appropriate configurations given to it. Simply throwing CRF at it like I had done so wouldn't fix the problem. I had to re-think how I was approaching the problem altogether.

By this point an entire month had passed since when I first started critically assessing my videos for the upcoming Skyrim and in-progress Ico projects. Literally the only reason why it was taking so long was the day-long encode times each segment demanded. Since the rescaling couldn't be easily assessed on small-scale footage, I had to bite the meaty bullet and plunge into full-scale segment tests to really assess if the results I was getting were good or not. Suffice to say, I do this without question, no matter how bitter throwing away all that time is.

Why?

Regardless of the perceived quality of the game I am recording, it is my responsibility to provide a video as close to the source as possible within reason. I cannot be biased against encode quality for any given game, even one as bad as Conan, for the simple fact that if I am to present a piece of media I am therefore responsible for presenting it in the highest quality I can achieve. The content I am portraying in the project is irrelevant, quality of that presentation is my only consideration when putting together the actual videos. Too many people settle for less and that's why we have Youtube - trash for people without standards. If I can't produce something that meets my own specifications for quality then why the hell would I release it?