Color Motion Video Coded by Perceptual Components

Published in: SID Annual Meeting Proceedings, 1992


Abstract

We describe an implementation of an architecture for coding and compression of color motion video that is based upon the partition of the visual signal by the early human visual system. The architecture consists of a pyramid in space and time, with separate bands for static and moving picture elements. Experiments on a highly dynamic 256x256 color image sequence suggest acceptable quality at 1 bit/pixel.


Introduction

A wide range of emerging applications, including digital movies, HDTV, and transmission and visualization of scientific imagery(Jaworski, 1990), will require efficient methods of coding color motion video. For much of this imagery, the ultimate consumer is the human eye, and image codes should be designed to match the visual capacities of the human observer. Elsewhere we have proposed a general Perceptual Components Architecture (PCA) for digital video based upon these ideas (Watson, 1990a,b, 1991). In this report we briefly describe an implementation of PCA coding of color motion video, and provide some preliminary results on the effectiveness of the scheme.


Perceptual Components Architecture

The Perceptual Components Architecture is based on our current understanding of how imagery is decomposed in early human vision, and consists of transformations and partitions of color, spatial, and temporal dimensions. Beginning with a digital image sequence whose pixels are indexed by row, column, frame, and color (R, G, B), the color dimension is first transformed from RGB into an opponent color space, nominally white /black (WB), red/green (RG), and blue /yellow (BY). The spatial dimension is partitioned into a number of bands of spatial frequency and orientation . Finally the temporal dimension is partitioned into four bands: low, left, right, and high . The spatiotemporal bands are grouped in such a way that the signal is partitioned into components moving in particular directions. In the three-dimensional spatiotemporal frequency domain, these moving components correspond to paired regions on either side of the origin. In general form, the PCA is a type of analysis/synthesis filter bank (Woods, 1991). The signal is first decomposed by a bank of analysis filters, downsampled appropriately, quantized, upsampled, and reconstructed by means of a synthesis filter. In the present case, analysis and synthesis filters are equivalent.

Color Transform
We transform from RGB monitor primaries to a perceptually based opponent color space using the following matrix:

This matrix is based on the so-called "cardinal directions" of color space (Derrington, Krauskopf, & Lennie, 1984; Krauskopf, Williams, & Heeley, 1982; Mulligan & Ahumada, 1992) . We have not at this time properly tuned this matrix to the chromaticity coordinates of our display monitor.

Spatial Partition
The spatial partition is implemented by means of the Cortex Transform (Watson 1987a,b). This is an invertible pyramid transform which divides the frequency space using concentric rings and radial wedges (Fig. 1). The rings are a constant logarithmic distance apart. In the present work, we used radial bandwidths of one octave and orientation bandwidths of 45 degrees.



Fig. 1. Partition of the spatial frequency domain. Spatial frequencies and orientation are indicated by (u,v) and f.
The borders of each region are "softened" by convolution with a Gaussian whose width is proportional to the frequency of the filter. Each sector of the resulting partition is equivalent to all others after scaling and rotation, so that this is an example of a "wavelet" transform (Grossman & Morlet, 1984). These filters are self-inverting, in the sense that sending an image through the set of filters twice reproduces the image. This means that the sum of the squares of the filters equals 1. Because the input sequence is real, its Fourier transform has conjugate symmetry, and we consequently only need to encode half the frequency plane. We do this by using only four of the possible eight sectors within each ring, covering the upper half of the spatial frequency plane. This spatial partition is applied separately to each of the three color channels (WB, RG, and BY).

Temporal Partition
The temporal partition is accomplished by means of four filters. Each filter consists of two opposed cumulative Gaussians, as defined by the following Mathematica (Wolfram, 1991) function,

filter[length_,scale_,corner_] := Module[ {tmp,x}, tmp=Table[ Sqrt[.5 * Erfc[(x-corner*length/2-1) / scale ] ] ,{x, length/2 + 1}];
Join[tmp,Reverse[ Take[tmp,{2,length/2}]]]]

where Erfc is the complementary error function (cumulative Gaussian), length is the number of frames in a coded segment, scale is a scale factor, and corner defines the 50% cutoff of each flank, expressed in terms of the Nyquist frequency. We used length=8, scale=0.5, corner=0.25. The four filters are four copies of this filter, shifted by increments of length/4, as pictured in Fig. 2. As with the spatial filters, the temporal filters are self-inverting.



Fig. 2. Temporal filters.

Motion Components
As noted above, we exploit the conjugate symmetry of the real input sequence by using only four of the eight possible orientation filters at each spatial frequency. When the four temporal filters are applied separably to one of these four spatial filters, four spatio-temporal filters result, two of which are selective for motion components. For example, considering the spatial filter encompassing 0-45 degrees of orientation, the four resulting spatio- temporal filters are: stationary low temporal frequency, rightward motion, stationary high temporal frequency, and leftward motion. The result of applying one of these spatiotemporal filters to the image sequence twice, once in the analysis stage and once in the synthesis stage, is a complex image sequence. Retaining the real part yields the appropriate motion component. Although there have been a few recent experiments with three-dimensional subband coding of video (Karlsson, & Vetterli, 1988; Kovacevic, 1991), this is to our knowledge the first use of explicit motion components.

Sampling
After spatial and temporal filtering, each band is subsampled in space and time. The spatial subsampling is via a sampling matrix k S, where

and where k = image width/ filter width (Watson, 1987b). For an image of width 256, the highest spatial frequency filter has a width of 256 (in the frequency domain). Temporal subsampling was by 2. The resulting collection of complex samples is overcomplete by a factor of 8/3.

Quantization
Each band was quantized uniformly with a particular divisor (bin width). Various different schemes were tried, but in general divisors increased with spatial frequency, with temporal frequency, and with color (BY>RG>WB). First order entropy was computed for each band and accumulated.


Test Material

To test the implementation, we have worked with a short (8 frame) segment from an MPEG test sequence (football). The original material was cropped from 256 by 192 to 242 by 192 to remove black borders, and expanded to 256 by 256 by bicubic interpolation. The sequence contains saturated colors, high contrast luminance and color borders, and a very large amount of motion, including panning and object motion. A single frame is shown in Fig. 3.


Fig. 3. A single frame from the test sequence. The actual test sequence was in color.
For subjective evaluation, reconstructed sequences were compared to the original at a viewing distance of approximately three picture heights.


Results

Application of the method to the test sequence suggests acceptable quality rendition at rates around one bit/pixel. Optimal compression requires appropriate setting of the quantization factor for each band, which is a difficult multidimensional problem. Our search strategy was to first establish the range of sample values for each component (combination of spatial frequency, orientation, temporal filter, and color channel). The initial quantization bin width (w0) for each component was set to the range for that component divided by 256, to produce 256 possible bins. We then compressed all four orientations of one component, using bin widths of 2k w0 , k={0,...8}. We then examined each reconstructed sequence to determine the bin width yielding a perceptually lossless result. This was repeated for all spatial frequencies, temporal frequencies, and colors. For the WB channel, k was typically around 4 or 5, and nearly independent of spatial frequency. For the color channels, k= 8 for the upper 2 spatial frequencies, and around 6 for the lower resolutions.


Conclusions and Discussion

We have implemented a prototype Perceptual Components Architecture for digital color image sequence coding. Preliminary results on a brightly colored, rapidly moving , 256 by 256 test sequence suggest acceptable quality at around 1 bit/pixel. Higher resolution sequences will generally require lower bit rates (in bits/pixel) for an equivalent viewing distance (in picture heights), since the added resolution will be at relatively less visible high spatial frequencies. The current scheme is overcomplete by a factor of 8/3. This is largely due to down sampling in time by only a factor of two, in spite of the use of four time filters. We are currently examining the use of four-channel perfect reconstruction filter banks, with downsampling by 4 in time (Vaidyanathan, 1990), to reduce the redundancy to a factor of 4/3.


Acknowledgments

We thank Eero Simoncelli for useful discussions. This work was supported by NASA RTOP 506-71-51.


References

Derrington, A. M., Krauskopf, J., & Lennie, P. (1984). Chromatic mechanisms in lateral geniculate nucleus of macaque. J. Physiol, London 357, 241-265.

Grossman, A., & Morlet, J. (1984). Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM J. Math. 15, 723-736.

Jaworski, A. (1990). Earth Observing System (EOS) Data and Information System (DIS) software interface standards. Pasadena, CA: American Institute of Aeronautics and Astronautics.

Karlsson, G., & Vetterli, M. (1988). Three dimensional subband coding of video. New York: 1100-1103.

Kovacevic, J. (1991). Filter banks and wavelets: Extensions and applications Columbia University, Center for Telecommunications Research Technical Report CU/CTR/TR 257-91-38.

Krauskopf, J., Williams, D. R., & Heeley, D. W. (1982). Cardinal directions of color space. Vision Research 22, 1123-1131.

Mulligan, J. B., & Ahumada, A. J., Jr. (1992). Principled methods for color dithering based on models of the human visual system. Society for Information Display Digest of Technical Papers 23.

Vaidyanathan, P. P. (1990). Multirate digital filters, filter banks, polyphase networks, and applications: a tutorial. Proceedings of the IEEE 78(1), 56-93.

Watson, A. B. (1987a). The cortex transform: Rapid computation of simulated neural images. Computer Vision, Graphics, and Image Processing 39(3), 311- 327.

Watson, A. B. (1987b). Efficiency of an image code based on human vision. Journal of the Optical Society of America A 4(12), 2401-2417.

Watson, A. B. (1990a). Digital visual communications using a perceptual components architecture. Pasadena, CA: American Institute of Aeronautics and Astronautics.

Watson, A. B. (1990b). Perceptual-components architecture for digital video. J. opt. Soc. Amer. A 7(10), 1943-1954.

Watson, A. B. (1991). Multidimensional pyramids in vision and video. In A. Gorea (Ed.), Representations of vision: trends and tacit assumptions in vision research (pp. 17-26). Cambridge: Cambridge University Press.

Wolfram, S. (1991). Mathematica: A system for doing mathematics by computer (Second Edition ed.). New York: Addison-Wesley.

Woods, J. W. (1991). Subband image coding. Norwell, MA: Kluwer Academic Publishers.