Good Code Design From Linux/Kernel

August 2, 2019August 28, 2019 Leandro Moreira code design, developer, pattern, ruby

Learn how Linux/FFmpeg C partial codebase is organized to be extensible and act as if it were meant to have “polymorphism”. Specifically, we’re going to briefly explore how Linux concept of everything is a file works at the source code level as well as how FFmpeg can add support fast and easy for new formats and codecs.

Good software design – Introduction

To write useful and long term maintainable software we tend to look out for patterns and group them into abstractions and it seems that’s the case for devs behind Linux and FFmpeg too.

Software design

When we’re creating software, we’re building data structures and defining their behaviors and dependencies. The way we create and link them can be seen as the design/architecture of the software.

Let’s say we’re building a media framework that encodes/decodes video and audio. The codecs AV1, H264, HEVC, and AAC all do some common operations and if we can provide a generic abstraction that holds these common operations and data we can use this concept instead of relying on the concrete idea of what a specific codec does.

Through the years many developers noticed that software with a good design is a good idea that pays off as software grows in complexity.

This is one of the ideas behind the good design for software, to rely on components that are weakly linked and with boundaries around what it should do.

Ruby

Maybe it’s easier to see all these concepts in practice. Let’s code a quick pseudo media stream framework that provides encoding and decoding for several codecs.

	class AV1
	def encode(bytes)
	end
	def decode(bytes)
	end
	end

	class H264
	def encode(bytes)
	end
	def decode(bytes)
	end
	end

	# …

	supported_codecs = [AV1.new, H264.new, HEVC.new]

	class MediaFramework
	def encode(type, bytes)
	codec = supported_codecs.find {\|c\| c.class.name.downcase == type}

	codec.encode(bytes)
	end
	end

view raw

ruby.rb

hosted with ❤ by GitHub

This pseudo-code in ruby tries to recreate what we’re discussing above, there is an implicit concept here of what operations a codec must have, in this case, the operations are encode and decode. Since ruby is a dynamically typed language any class can present these two operations and act as a codec for us.

Developers sometimes may use the words: contract, API, interface, behavior and operations as synonyms.

This design might be considered good because if we want to add a new codec we just need to provide an implementation and add it to the list, even the list could be built in a dynamic way but the idea is that this code seems easy to extend and maintain because it tries to keep link between the components weak (low coupling) and each component does only what it should do (cohese).

Rails framework even enforce some way to organize the code, it adopts the model-view-controller (MVC) architecture

Golang

When we go (no pun intended) to a statically typed language like golang we need to be more formal, describing the required types but it’s still doable.

	type Codec interface {
	Encode(data []int) ([]int, error)
	Decode(data []int) ([]int, error)
	}

	type H264 struct {
	}

	func (H264) Encode(data []int) ([]int, error) {
	// … lots of code
	return data, nil
	}

	var supportedCodecs := []Codec{H264{}, AV1{}}

	func Encode(codec string, data int[]) {
	// here we can chose e use
	// supportedCodecs[0].Encode(data)
	}

view raw

file.go

hosted with ❤ by GitHub

The interface type in golang is much more powerful than Java’s similar construct because its definition is totally disconnected from the implementation and vice versa. We could even make each codec a ReadWriter and use it all around.

Clang

In the C language we still can create the same behavior but it’s a little bit different.

	struct Codec
	{
	int (encode)(*int);
	int (decode)(*int);
	};


	int h264_encode(int bytes)
	{
	// …
	}

	int h264_decode(int bytes)
	{
	// …
	}

	struct Codec av1 =
	{
	.encode = av1_encode,
	.decode = av1_decode
	};

	struct Codec h264 =
	{
	.encode = h264_encode,
	.decode = h264_decode
	};

	int main(int argc, char *argv[])
	{
	h264.encode(argv[1]);
	}

view raw

file.c

hosted with ❤ by GitHub

Code inspired by https://www.bottomupcs.com/abstration.xhtml

We first define the abstract operations (functions in this case) in a generic struct and then we fill it with the concrete code, like the av1 decoder and encoder real code.

Many other languages have somewhat similar mechanisms to dispatch methods or functions as if they were part of an agreed protocol and then the system integration code can deal only with this high-level abstractions.

Linux Kernel – Everything is a file

Have you ever heard the expression everything is a file in Linux? The idea is to have a common interface for all kinds of resources in Linux, for instance, Linux handles network socket, special files (like /proc/cpuinfo) or even USB devices as files.

This is a powerful idea that can make easy to write or use programs for linux since we can rely in a set of well known operations from this abstraction called file. Let’s see this in action:

	# the first case is the easiest, we're just reading a plain text file
	$ cat /etc/passwd
	root:x:0:0:root:/root:/bin/bash
	daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
	…

	# now here, we think we're reading a file but we are not! (technically yes.. anyway)
	$ cat /proc/meminfo
	MemTotal: 2046844 kB
	MemFree: 546984 kB
	MemAvailable: 1535688 kB
	Buffers: 162676 kB
	Cached: 892000 kB

	# and finally we open a file (using fd=3) for read/write
	# the "file" being a socket, we then send a request to this file >&3
	# and we read from this same "file"
	$ exec 3<> /dev/tcp/www.google.com/80
	$ printf 'HEAD / HTTP/1.1\nHost: http://www.google.com\nConnection: close\n\n' >&3
	$ cat <&3
	HTTP/1.1 200 OK
	Date: Wed, 21 Aug 2019 12:48:40 GMT
	Expires: -1
	Cache-Control: private, max-age=0
	Content-Type: text/html; charset=ISO-8859-1
	P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
	Server: gws
	X-XSS-Protection: 0
	X-Frame-Options: SAMEORIGIN
	Set-Cookie: 1P_JAR=2019-08-21-12; expires=Fri, 20-Sep-2019 12:48:40 GMT; path=/; domain=.google.com
	Set-Cookie: NID=188=K69nLKjqge87Ymv4h-gAW_lRfLCo7-KrTf01ULtY278lUUcaNxlEqXExDtVB104pdA8CLUZI8LMvJv26P_D8RMF3qCDzLTpjji96B9v_miGlZOIBro6pDreHP0yW7dz-9myBfOgdQjroAc0wWvOAkBu-zgFW_Of9VpK3IfIaBok; expires=Thu, 20-Feb-2020 12:48:40 GMT; path=/; domain=.google.com; HttpOnly
	Accept-Ranges: none
	Vary: Accept-Encoding
	Connection: close

view raw

bash.sh

hosted with ❤ by GitHub

This only is possible because the concept of a file (data structure and operations) was design to be one of the main way to communicate among sub-systems. Here’s a gist of the file_operations’ API.

	struct file_operations {
	struct module *owner;
	loff_t (llseek) (struct file , loff_t, int);
	ssize_t (read) (struct file , char __user , size_t, loff_t );
	ssize_t (write) (struct file , const char __user , size_t, loff_t );
	//…
	}

view raw

file_operations.c

hosted with ❤ by GitHub

The struct file_operations define what one should expect from a concept of what file can do.

	const struct file_operations ext4_dir_operations = {
	.llseek = ext4_dir_llseek,
	.read = generic_read_dir,
	//..
	};

view raw

ext4.c

hosted with ❤ by GitHub

Here we can see the directory implementation of these operations for the ext4 file system.

	static const struct file_operations proc_cpuinfo_operations = {
	.open = cpuinfo_open,
	.read = seq_read,
	.llseek = seq_lseek,
	.release = seq_release,
	};

view raw

proc_cpuinfo_operations.c

hosted with ❤ by GitHub

And even the cpuinfo proc files is done over this abstraction. When you’re operating files under linux you’re actually dealing with the VFS system, this system delegates to the proper implementation file implemenation.

Source: https://ops.tips/blog/what-is-that-proc-thing/

FFmpeg – Formats

Here’s an overview of FFmpeg flow/architecture that shows that the internal componets are linked mostly to the abstract concepts like AVCodec but not directly to their implemenation, H264, AV1 or etc.

remuxing_libav_components — FFmpeg architecture view from transmuxing flow

For the input files, FFmpeg creates a struct called AVInputFormat that is implemented by any format (video container) that wants to be used as an input. MKV files fill this structure with its implementation as the MP4 format too.

	typedef struct AVInputFormat {
	const char *name;
	const char *long_name;
	const char *extensions;
	const char *mime_type;
	ff_const59 struct AVInputFormat *next;
	int raw_codec_id;
	int priv_data_size;
	int (read_probe)(const AVProbeData );
	int (read_header)(struct AVFormatContext );
	}

	// matroska

	AVInputFormat ff_matroska_demuxer = {
	.name = "matroska,webm",
	.long_name = NULL_IF_CONFIG_SMALL("Matroska / WebM"),
	.extensions = "mkv,mk3d,mka,mks",
	.priv_data_size = sizeof(MatroskaDemuxContext),
	.read_probe = matroska_probe,
	.read_header = matroska_read_header,
	.read_packet = matroska_read_packet,
	.read_close = matroska_read_close,
	.read_seek = matroska_read_seek,
	.mime_type = "audio/webm,audio/x-matroska,video/webm,video/x-matroska"
	};

	// mov (mp4)

	AVInputFormat ff_mov_demuxer = {
	.name = "mov,mp4,m4a,3gp,3g2,mj2",
	.long_name = NULL_IF_CONFIG_SMALL("QuickTime / MOV"),
	.priv_class = &mov_class,
	.priv_data_size = sizeof(MOVContext),
	.extensions = "mov,mp4,m4a,3gp,3g2,mj2",
	.read_probe = mov_probe,
	.read_header = mov_read_header,
	.read_packet = mov_read_packet,
	.read_close = mov_read_close,
	.read_seek = mov_read_seek,
	.flags = AVFMT_NO_BYTE_SEEK \| AVFMT_SEEK_TO_PTS,
	};

view raw

ffmpeg.c

hosted with ❤ by GitHub

This design allows new codecs, formats, and protocols to be integrated and released easier. DAV1d (an av1 open-source implementation) was integrated into FFmpeg May this year and you can follow along the commit diff to see how easy it was. In the end, it needs to register itself as an available codec and follow the expected operations.

	+AVCodec ff_libdav1d_decoder = {
	+ .name = "libdav1d",
	+ .long_name = NULL_IF_CONFIG_SMALL("dav1d AV1 decoder by VideoLAN"),
	+ .type = AVMEDIA_TYPE_VIDEO,
	+ .id = AV_CODEC_ID_AV1,
	+ .priv_data_size = sizeof(Libdav1dContext),
	+ .init = libdav1d_init,
	+ .close = libdav1d_close,
	+ .flush = libdav1d_flush,
	+ .receive_frame = libdav1d_receive_frame,
	+ .capabilities = AV_CODEC_CAP_DELAY \| AV_CODEC_CAP_AUTO_THREADS,
	+ .caps_internal = FF_CODEC_CAP_INIT_THREADSAFE \| FF_CODEC_CAP_INIT_CLEANUP \|
	+ FF_CODEC_CAP_SETS_PKT_DTS,
	+ .priv_class = &libdav1d_class,
	+ .wrapper_name = "libdav1d",
	+};`

view raw

diff.diff

hosted with ❤ by GitHub

No matter the language we use we can (or at least try to) build a software with low coupling and high cohesion in mind, these two basic properties can allow you to build easier to maintain and extend software.

Use URL.createObjectURL to make your videos start faster

August 10, 2018September 16, 2020 Leandro Moreira pattern

faster-start-up

During our last hackathon, we wanted to make our playback to start faster. Before our playback starts to show something to the final users, we issue around 5 to 6 requests (counting some manifests) and the goal was to cut as much as we can.

Screen Shot 2018-08-10 at 8.55.20 PM

The first step was very easy, we just inverted the code logic from the client side to the server side, and then we injected the prepared player on the page.

Pseudo Ruby server side code:

some_api = get("http://some.api/v/#{@id}/playlist")
other_api = get("http://other.api/v/#{@some_api.id}/playlist")
# ...
@final_uri = "#{protocol}://#{domain}/#{path}/#{manifest}"

Pseudo JS client side code:
new Our.Player({source: {{ @final_uri }} });


Okay, that’s nice but can we go further? Yes, how about to embed our manifests into our page?! It turns out that we can do that with the power of URL.createObjectURL, this API gives us an URL for a JS blob/object/file.
// URL.createObjectURL is pretty trivial
// to use and powerfull as well
 var blob = new Blob(["#M3U8...."]
            , {type: "application/x-mpegurl"});
 var url = URL.createObjectURL(blob);

Pseudo Ruby server side code:
some_api = get("http://some.api/v/#{@id}/playlist")
other_api = get("http://other.api/v/#{@some_api.id}/playlist")
# ...
@final_uri = "#{protocol}://#{domain}/#{path}/#{manifest}"
@main_manifest = get(@final_uri)
@sub_manifests = @main_manifest
                 .split_by_uri
                 .map {|uri| get(uri)}

Pseudo JS client side code:
  var mime = "application/x-mpegurl";
  var manifest = {{ @main_manifest }};
  var subManifests = {{ @sub_manifests }};
  var subManifestsBlobURL = subManifest
                           .splitByURL()
                           .map(objectURLFor(content, mime));
  var finalMainManifest = manifest
                          .splitByLine()
                          .map(content.replace(id, subManifestsBlobURL[id]))
                          .joinWithLines();

  function objectURLFor(content, mime) {
    var blob = new Blob([content], {type: mime});
    return URL.createObjectURL(blob);
  }

  new Our.Player({
    src: objectURLFor(finalMainManifest, mime)
  })


We thought we were done but then we came up with the idea of doing the same process for the first video segment, the page now will weight more but the player would almost play instantaneously.
// for regular text manifest we can use regular Blob objects
// but for binary data we can rely on Uint8Array
var segment = new Uint8Array({{ segments.first }});

By the way, our player is based on Clappr and this particular test was done with hls.js playback which does use the fetch API to get the video segments, fetching this created URL works just fine.
The animated gif you see at the start of the post was done without the segment on the page optimization. And we just ignored the possible side effects on the player ABR algorithm (that could think it has a high bandwidth due to the fast manifest fetch).
Finally, we can make it even faster using the MPEG Dash and its template timeline format, we can use shorter segments sizes and we can tune the ABR algorithm to be initially faster.

How video codec works

October 1, 2017 Leandro Moreira high scalability, pattern

Slides from the QConSP 17 (pt-BR)

June 5, 2017June 19, 2017 Leandro Moreira developer, distributed systems, high scalability, pattern, tests, unix

Watch the video.

How to measure video quality perception

October 9, 2016May 16, 2020 Leandro Moreira developer, distributed systems, pattern, tests

Update 3 (05/16/2020): Wrote an updated guide to use VMAF through FFmpeg.

Update 2 (01/06/2016): Fixed reference video bitrate unit from Kbps to KBps

Update 1 (10/16/2016): Anne Aaron presented the VMAF at the Demuxed 2016.

When working with videos, you should be focusing all your efforts on best quality of streaming, less bandwidth usage, and low latency in order to deliver the best experience for the users.

This is not an easy task. You often need to test different bitrates, encoder parameters, fine tune your CDN and even try new codecs. You usually run a process of testing a combination of configurations and codecs and check the final renditions with your naked eyes. This process doesn’t scale, can’t we just trust computers to check that?

bit rate (bitrate): is a measure often used in digital video, usually it is assumed the rate of bits per seconds, it is one of the many terms used in video streaming.

same resolution, different bitrates.

codec: is an electronic circuit or software that compresses or decompresses digital content. (ex: H264 (AVC), VP9, AAC (HE-AAC), AV1 and etc)

We were about to start a new hack day session here at Globo.com and since some of us learned how to measure the noise introduced when encoding and compressing images, we thought we could play with the stuff we learned by applying the methods to measure video quality.

We started by using the PSNR (peak signal-to-noise ratio) algorithm which can be defined in terms of the mean squared error (MSE) in decibel scale.

PSNR: is an engineering term for the ratio between the maximum possible power of a signal and the power of corrupting noise.

First, you calculate the MSE which is the average of the squares of the errors and then you normalize it to decibels.

MSE = ∑ ∑ ( [n1[i]-n2[i]] ) ^ 2 / m * n

*n1 is the original image, n2 the comparable image, m and n are the image size

PSNR = 10 log₁₀ ( MAX ^ 2 / MSE )

*MAX is the maximum possible pixel value of the image

view raw

math.math

hosted with ❤ by GitHub

For 3D signals (colored image), your MSE needs to sum all the means for each plane (ie: RGB, YUV and etc) and then divide by 3 (or 3 * MAX ^ 2).

To validate our idea, we downloaded videos (720p, h264) with the bitrate of 3400 kbps from distinct groups like News, Soap Opera and Sports. We called this group of videos the pivots or reference videos. After that, we generated some transrated versions of them with lower bitrates. We created 700 kbps, 900 kbps, 1300 kbps, 1900 kbps and 2800 kbps renditions for each reference video.

Heads Up! Typically the pivot video (most commonly referred to as reference video), uses a truly lossless compression, the bitrate for a YUV420p raw video should be 1280x720x1.5(given the YUV420 format)x24fps /1000 = 33177.6KBps, far more than what we used as reference (3400KBps).

We extracted 25 images for each video and calculate the PSNR comparing the pivot image with the modified ones. Finally, we calculate the mean. Just to help you understand the numbers below, a higher PSNR means that the image is more similar to the pivot.

	700 kbps	900 kbps	1300 kbps	1900 kbps	2800 kbps	3400 kbps
Soap Op.	35.0124	36.5159	38.6041	40.3441	41.9447	∞
News	28.6414	30.0076	32.6577	35.1601	37.0301	∞
Sports	32.5675	34.5158	37.2104	39.4079	41.4540	∞

screen-shot-2016-10-08-at-9-15-24-am — A visual sample.

We defined a PSNR of 38 (from our observations) as the ideal but then we noticed that the News group didn’t meet the goal. When we plotted the News data in the graph we could see what happened.

The issue with the video from the News group is that they’re a combination of different sources: External traffic camera with poor resolution, talking heads in a studio camera with good resolution and quality, some scenes with computer graphics (like the weather report) and others. We suspected that the News average was affected by those outliers but this kind of video is part of our reality.

kitbcrnx2uuu4 — The different video sources are visible in clusters. (PSNR(frames))

We needed a better way to measure the quality perception so we searched for alternatives and we reached one of the Netflix’s posts: an approach toward a practical perceptual video quality metric (VMAF). At first, we learned that PSNR does not consistently reflect human perception and that Netflix is creating ways to approach this with the VMAF model.

They created a dataset with several videos including videos that are not part of the Netflix library and put real people to grade it. They called this score of DMOS. Now they could compare how each algorithm scores against DMOS.

They realized that none of them were perfect even though they have some strength in certain situations. They adopted a machine-learning based model to design a metric that seeks to reflect human perception of video quality (a Support Vector Machine (SVM) regressor).

The Netflix approach is much wider than using PSNR alone. They take into account more features like motion, different resolutions and screens and they even allow you train the model with your own video dataset.

“We developed Video Multimethod Assessment Fusion, or VMAF, that predicts subjective quality by combining multiple elementary quality metrics. The basic rationale is that each elementary metric may have its own strengths and weaknesses with respect to the source content characteristics, type of artifacts, and degree of distortion. By ‘fusing’ elementary metrics into a final metric using a machine-learning algorithm – in our case, a Support Vector Machine (SVM) regressor”

Netflix about VMAF

The best news (pun intended) is that the VMAF is FOSS by Netflix and you can use it now. The following commands can be executed in the terminal. Basically, with Docker installed, it installs the VMAF, downloads a video, transcodes it (using docker image of FFmpeg) to generate a comparable video and finally checks the VMAF score.

	# clone the project (later they'll push a docker image to dockerhub)
	git clone –depth 1 https://github.com/Netflix/vmaf.git vmaf
	cd vmaf
	# build the image
	docker build -t vmaf .
	# get the pivot video (reference video)
	wget http://www.sample-videos.com/video/mp4/360/big_buck_bunny_360p_5mb.mp4
	# generate a new transcoded video (vp9, vcodec:500kbps)
	docker run –rm -v $(PWD):/files jrottenberg/ffmpeg -i /files/big_buck_bunny_360p_5mb.mp4 -c:v libvpx-vp9 -b:v 500K -c:a libvorbis /files/big_buck_bunny_360p.webm
	# extract the yuv (yuv420p) color space from them
	docker run –rm -v $(PWD):/files jrottenberg/ffmpeg -i /files/big_buck_bunny_360p_5mb.mp4 -c:v rawvideo -pix_fmt yuv420p /files/360p_mpeg4-v_1000.yuv
	docker run –rm -v $(PWD):/files jrottenberg/ffmpeg -i /files/big_buck_bunny_360p.webm -c:v rawvideo -pix_fmt yuv420p /files/360p_vp9_700.yuv
	# checks VMAF score
	docker run –rm -v $(PWD):/files vmaf run_vmaf yuv420p 640 368 /files/360p_mpeg4-v_1000.yuv /files/360p_vp9_700.yuv –out-fmt json
	# and you can even check VMAF score using existent trained model
	docker run –rm -v $(PWD):/files vmaf run_vmaf yuv420p 640 368 /files/360p_mpeg4-v_1000.yuv /files/360p_vp9_700.yuv –out-fmt json –model /files/resource/model/nflxall_vmafv4.pkl

view raw

using_vmaf.sh

hosted with ❤ by GitHub

You saved around 1.89 MB (37%) and still got the VMAF score 94.

	{
	"aggregate": {
	"VMAF_feature_adm2_score": 0.9865012294519826,
	"VMAF_feature_motion_score": 2.6486005151515153,
	"VMAF_feature_vif_scale0_score": 0.85336751265595612,
	"VMAF_feature_vif_scale1_score": 0.97274233143291644,
	"VMAF_feature_vif_scale2_score": 0.98624814558455487,
	"VMAF_feature_vif_scale3_score": 0.99218556024841664,
	"VMAF_score": 94.143067486687571,
	"method": "mean"
	}
	}

view raw

vmaf_result.json

hosted with ❤ by GitHub

Using a composed solution like VMAF or VQM-VFD proved to be better than using a single metric, there are still issues to be solved but I think it’s reasonable to use such algorithms plus A/B tests given the impractical scenario of hiring people to check video impairments.

A/B tests: For instance, you could use X% of your user base for Y days offering them the newest changes and see how much they would reject it.

	MSE = ∑ ∑ ( [n1[i]-n2[i]] ) ^ 2 / m * n
	*n1 is the original image, n2 the comparable image, m and n are the image size
	PSNR = 10 log₁₀ ( MAX ^ 2 / MSE )
	*MAX is the maximum possible pixel value of the image