reduction-of-dataset-size.xml

Revision 1, 7.5 kB (checked in by halliday1, 3 years ago)
Initial import of CF Conventions document.

Line
1	<chapter>
2	<title>
3	Reduction of Dataset Size
4	</title>
5
6	<para>
7	There are two methods for reducing dataset size: packing
8	and compression. By packing we mean altering the data
9	in a way that reduces its precision. By compression we
10	mean techniques that store the data more efficiently
11	and result in no precision loss. Compression only
12	works in certain circumstances, e.g., when a variable
13	contains a significant amount of missing or repeated
14	data values. In this case it is possible to make use of
15	standard utilities, e.g., UNIX <computeroutput>compress</computeroutput>
16	or GNU <computeroutput>gzip</computeroutput>, to
17	compress the entire file after it has been written. In
18	this section we offer an alternative compression method
19	that is applied on a variable by variable basis. This
20	has the advantage that only one variable need be
21	uncompressed at a given time. The disadvantage is that
22	generic utilities that don't recognize the CF conventions
23	will not be able to operate on compressed variables.
24	</para>
25
26	<section id="packed-data">
27	<title>Packed Data</title>
28	<para>
29	At the current time the netCDF interface does
30	not provide for packing data. However a simple
31	packing may be achieved through the use of the
32	optional NUG defined attributes
33	<varname>scale_factor</varname>
34	and
35	<varname>add_offset</varname>.
36	After the data values of a variable
37	have been read, they are to be multiplied by
38	the
39	<varname>scale_factor</varname>, and have
40	<varname>add_offset</varname>
41	added to
42	them. If both attributes are present, the data
43	are scaled before the offset is added. When
44	scaled data are written, the application should
45	first subtract the offset and then divide by the
46	scale factor. The units of a variable should be
47	representative of the unpacked data.
48	</para>
49	<para>
50	This standard is more restrictive than the NUG
51	with respect to the use of the
52	<varname>scale_factor</varname> and
53	<varname>add_offset</varname>
54	attributes; ambiguities and precision
55	problems related to data type conversions
56	are resolved by these restrictions. If the
57	<varname>scale_factor</varname>
58	and
59	<varname>add_offset</varname>
60	attributes are of
61	the same data type as the associated variable,
62	the unpacked data is assumed to be of the
63	same data type as the packed data. However,
64	if the
65	<varname>scale_factor</varname>
66	and
67	<varname>add_offset</varname>
68	attributes
69	are of a different data type from the variable
70	(containing the packed data) then the unpacked
71	data should match the type of these attributes,
72	which must both be of type <varname>float</varname> or both be of
73	type <varname>double</varname>. An additional restriction in this
74	case is that the variable containing the packed
75	data must be of type <varname>byte</varname>, <varname>short</varname> or <varname>int</varname>. It is
76	not advised to unpack an <varname>int</varname> into a <varname>float</varname> as
77	there is a potential precision loss.
78	</para>
79	<para>
80	When data to be packed contains missing values
81	the attributes that indicate missing values
82	(<varname>_FillValue</varname>, <varname>valid_min</varname>, <varname>valid_max</varname>, <varname>valid_range</varname>)
83	must be of the same data type as the packed
84	data. See <xref linkend="missing-data"/> for a discussion of how
85	applications should treat variables that have
86	attributes indicating both missing values and
87	transformations defined by a scale and/or offset.
88	</para>
89	</section>
90
91
92
93	<section id="compression-by-gathering">
94	<title>Compression by Gathering</title>
95	<para>
96	To save space in the netCDF file, it may be
97	desirable to eliminate points from data arrays
98	that are invariably missing. Such a compression
99	can operate over one or more adjacent axes, and
100	is accomplished with reference to a list of the
101	points to be stored. The list is constructed by
102	considering a mask array that only includes the
103	axes to be compressed, and then mapping this array
104	onto one dimension without reordering. The list is
105	the set of indices in this one-dimensional mask
106	of the required points. In the compressed array,
107	the axes to be compressed are all replaced by a
108	single axis, whose dimension is the number of
109	wanted points. The wanted points appear along
110	this dimension in the same order they appear in
111	the uncompressed array, with the unwanted points
112	skipped over. Compression and uncompression are
113	executed by looping over the list.
114	</para>
115	<para>
116	The list is stored as the coordinate variable
117	for the compressed axis of the data array. Thus,
118	the list variable and its dimension have the same
119	name. The list variable has a string attribute
120	<varname>compress</varname>, <emphasis>containing a blank-separated list
121	of the dimensions which were affected by the
122	compression in the order of the CDL declaration
123	of the uncompressed array</emphasis>. The presence of
124	this attribute identifies the list variable
125	as such. The list, the original dimensions
126	and coordinate variables (including boundary
127	variables), and the compressed variables with
128	all the attributes of the uncompressed variables
129	are written to the netCDF file. The uncompressed
130	variables can be reconstituted exactly as they
131	were using this information.
132	</para>
133	<para>
134	<example>
135	<title>Horizontal compression of a three-dimensional array</title>
136	<para>
137	We eliminate sea
138	points at all depths in a
139	longitude-latitude-depth array of
140	soil temperatures. In this case,
141	only the longitude and latitude
142	axes would be affected by the
143	compression. We construct a list
144	<varname>landpoint(landpoint)</varname> containing
145	the indices of land points.
146	</para>
147	<para>
148	<programlisting>
149	dimensions:
150	lat=73;
151	lon=96;
152	landpoint=2381;
153	depth=4;
154	variables:
155	int landpoint(landpoint);
156	landpoint:compress="lat lon";
157	float landsoilt(depth,landpoint);
158	landsoilt:long_name="soil temperature";
159	landsoilt:units="K";
160	float depth(depth);
161	float lat(lat);
162	float lon(lon);
163	data:
164	landpoint=363, 364, 365, ...;
165
166	</programlisting>
167	</para>
168	<para>
169	Since
170	<computeroutput>landpoint(0)=363</computeroutput>,
171	for instance, we know that
172	<computeroutput>landsoilt(*,0)</computeroutput>
173	maps on to point 363 of the original data with dimensions
174	<computeroutput>(lat,lon)</computeroutput>.
175	This corresponds to indices
176	<computeroutput>(3,75)</computeroutput>,
177	i.e.,
178	<computeroutput>363 = 3*96 + 75</computeroutput>.
179	</para>
180	</example>
181	</para>
182	<para>
183	<example>
184	<title>Compression of a three-dimensional field</title>
185	<para>
186	We compress a longitude-latitude-depth field of ocean salinity by eliminating points below the sea-floor. In this case, all three dimensions are affected by the compression, since there are successively fewer active ocean points at increasing depths.
187	</para>
188	<para>
189	<programlisting>
190	variables:
191	float salinity(time,oceanpoint);
192	int oceanpoint(oceanpoint);
193	oceanpoint:compress="depth lat lon";
194	float depth(depth);
195	float lat(lat);
196	float lon(lon);
197	double time(time);
198	</programlisting>
199	</para>
200	<para>
201	This information implies that
202	the salinity field should be
203	uncompressed to an array with
204	dimensions
205	<computeroutput>(depth,lat,lon)</computeroutput>.
206	</para>
207	</example>
208	</para>
209	</section>
210	</chapter>
211
212
213
214
215
216
217

Note: See TracBrowser for help on using the browser.

root/cf-conventions/trunk/docbooksrc/reduction-of-dataset-size.xml

Download in other formats: