Image and video sequence coding is motivated by the same need presented by other sources of information, such as audio or generic data. However, its three-dimensional nature, which covers the spatial and temporal domain, substantially differentiates it from other sources of information, requiring the application of compression techniques and strategies of greater complexity and computational cost.

One of the main factors that characterise video coding is its compression ratio, which depends on the initial information volume of the video source and the capacity of the transmission channel or storage space available. The amount of information of the video sequence is determined by the spatial resolution of the image, the number of pixels in each of its horizontal and vertical dimensions, as well as its temporal resolution defined by its frame rate or temporal refresh rate. The evolution of video formats through the years has been characterised by a sharp increase of both resolutions, from the so-called QCIF format, with 176×144 pixels and temporal rates of 15 frames per second, to the emerging Ultra High Definition or UHDTV formats, named 4K and 8K, which comprise resolutions of up to 7680×4320 pixels with high temporal rates or HFR (High Frame Rates), which can reach 120 frames/s.

The following figure shows the main video formats commonly used in professional environments, evidencing how their data volume ranges between the 165MBytes/s corresponding to TV formats for standard definition, to the almost 80GB/s corresponding to the 8K format. The uncompressed storage and transmission capacities this requires are technically and economically impracticable nowadays.

Likewise, transmission channels have substantially increased their capacity, terrestrial and satellite broadcasting networks can get bandwidths between 20Mbps and 60Mbps using modulation technologies such as DVB-T/T2 and DVB-S/S2. On the other hand, mobile networks have gone from having a few hundred Kb to exceed speeds over 10Mbps, thanks to technologies such as 3G and 4G/LTE. But without a doubt, broadband is the one that has experienced the highest growth with the evolution of technologies like ADSL+, and mainly with fiber optic networks that can exceed bandwidths over 100Mb.

This simultaneous advances in both video formats and network capacities have caused a constant need of applying high video compression rates of over 10:1 for most of the proposed scenarios, a compression rate well over those obtained with statistical compression techniques, typically close to 2:1. Therefore, most video compression standards are forced to apply coding techniques with losses, perceptually masking them in a way that they are not subjectively perceived by the user.

Although the first image compression techniques are known since 1950, with differential coding techniques or DPCM, it was in the nineties, with the emergence of the standards of the H.26x family of ITU and the standards known as MPEG approved by ISO, when this advances materialised into consumption devices, thanks to the availability of hardware first, and software later, with enough computational capacity.

Even though more than two decades have passed from the emergence of the first real-time video encoders, all video compression standards used in professional and domestic environments, such as MPEG-2, H.264 (MPEG-4 AVC), and the emerging HEVC (H.265), have maintained the same hybrid coding scheme that makes use of both the high spatial redundancy of the images between one pixel and its neighbours, and the high temporal redundancy between consecutive frames.

Spatial encoding is one of the first techniques applied to image compression, and has its origin in the 80s with the demonstration of high energy compaction provided by the discrete cosine transform (DCT), over regular blocks of 8×8 pixels. This transformation facilitates the quantification in the frequency domain more efficiently than in the spatial domain, and favours the subsequent entropy encoding of its transformed coefficients. This scheme that links a transform coding and an entropy encoding is known as intra-frame coding and its images are named “I” frames, and although it does not make use of the strong temporal correlation presented by a video sequence, it presents certain advantages such as its low delay and complexity, and has been used in different standards for the provision of personal video communication services.

But without a doubt, the introduction of temporal prediction techniques based on the estimation and compensation of movement, named ME-MC (Motion Estimation – Motion Compensation), between the images of a video sequence are the ones that allowed for reaching high compression rates. The standard that introduced this technique for the first time was MPEG-1, with an ME-MC scheme exclusively unidirectional based on the estimation between 2 consecutive images, named “P” frames, and it was used mostly by the market of optical media video, with binary rates around 1.5Mbps.

In 1992, the approval of the standard known as MPEG-2 introduced a huge quantitative leap in the efficiency of encoders, thanks to the incorporation of the temporal bidirectional prediction in the same ME-MC scheme, allowing for the estimation between previous and subsequent frames to the current one, whose images are named “B” frames. The use of a coding structure combining three types of images “I”, “P” and “B” allowed for reaching high compression gains, being used to encode video formats with standard or SD resolutions, in binary rates under 5Mbps, and even of high definition or HD, under 20Mbps, being of mass use in the TV broadcasting services.

A decade after the emergence of MPEG-2, the optimisation of the different nexus coding tools with the introduction of other new tools such as the intra-frame prediction and the anti-Blocking filter within the encoding loop, converged on a new compression standard named H.264 or MPEG-4 AVC, as successful as its predecessor MPEG-2, since it allows to reduce the binary rate by half with the same perceptual quality. MPEG-4 AVC has been the booster of streaming services and OTT models on fixed and mobile networks, by facilitating the distribution of HD content with bandwidth requirements under 6Mbps and of SD in 1Mbps.

Recently, in 2013, a new evolution of MPEG-4 AVC has become a reality with the approval of the new coding standard known as HEVC (H.265), which improves again the compression efficiency of its predecessor by more than 50%, but puts more stress on the improvement of the perceptual quality of the images than in the improvement of the objective coding losses. Once again, HEVC maintains the hybrid spatial-temporal coding scheme, with improvements in the precision of the ME-MC, the discrete cosine transform and the intra-frame prediction, and it only introduces a new and more flexible image partitioning, by allowing block sizes to be adapted to the content of the image, and a second post-filter in the spatial domain named SAO (Sample Adaptive Offset), which softens the devices that can appear when applying high compression rates.

One of the main advantages of HEVC is its capacity to efficiently encode high resolution formats such as UHDTV, and with precisions over 10 bit/pixel thanks to its new block structure and its capacity to use large-size transforms of up to 32×32 pixels. 4K services can be encoded with binary rates between 10Mbps and 15Mbps, which implies compression rates close to 500:1.

Nowadays, the industry is focused on the efficient coding of the new improvements introduced by these UHDTV formats, like the extended dynamic rate or HDR (High Dynamic Range), and the improvement of the colorimetric representation of the image or WCG (Wide Color Gamut).