root/cf-conventions/trunk/docbooksrc/reduction-of-dataset-size.xml

Revision 1, 7.5 kB (checked in by halliday1, 3 years ago)

Initial import of CF Conventions document.

Line 
1 <chapter>
2         <title>
3                 Reduction of Dataset Size
4         </title>
5
6         <para>
7                 There are two methods for reducing dataset size: packing
8                 and compression. By packing we mean altering the data
9                 in a way that reduces its precision. By compression we
10                 mean techniques that store the data more efficiently
11                 and result in no precision loss. Compression only
12                 works in certain circumstances, e.g., when a variable
13                 contains a significant amount of missing or repeated
14                 data values. In this case it is possible to make use of
15                 standard utilities, e.g., UNIX <computeroutput>compress</computeroutput>
16                 or GNU <computeroutput>gzip</computeroutput>, to
17                 compress the entire file after it has been written. In
18                 this section we offer an alternative compression method
19                 that is applied on a variable by variable basis. This
20                 has the advantage that only one variable need be
21                 uncompressed at a given time. The disadvantage is that
22                 generic utilities that don't recognize the CF conventions
23                 will not be able to operate on compressed variables.
24         </para>
25
26         <section id="packed-data">
27                 <title>Packed Data</title>
28                 <para>
29                         At the current time the netCDF interface does
30                         not provide for packing data. However a simple
31                         packing may be achieved through the use of the
32                         optional NUG defined attributes
33                         <varname>scale_factor</varname>
34                         and
35                         <varname>add_offset</varname>.
36                         After the data values of a variable
37                         have been read, they are to be multiplied by
38                         the
39                         <varname>scale_factor</varname>, and have
40                         <varname>add_offset</varname>
41                         added to
42                         them. If both attributes are present, the data
43                         are scaled before the offset is added. When
44                         scaled data are written, the application should
45                         first subtract the offset and then divide by the
46                         scale factor. The units of a variable should be
47                         representative of the unpacked data.
48                 </para>
49                 <para>
50                         This standard is more restrictive than the NUG
51                         with respect to the use of the
52                         <varname>scale_factor</varname> and
53                         <varname>add_offset</varname>
54                         attributes; ambiguities and precision
55                         problems related to data type conversions
56                         are resolved by these restrictions. If the
57                         <varname>scale_factor</varname>
58                         and
59                         <varname>add_offset</varname>
60                         attributes are of
61                         the same data type as the associated variable,
62                         the unpacked data is assumed to be of the
63                         same data type as the packed data. However,
64                         if the
65                         <varname>scale_factor</varname>
66                         and
67                         <varname>add_offset</varname>
68                         attributes
69                         are of a different data type from the variable
70                         (containing the packed data) then the unpacked
71                         data should match the type of these attributes,
72                         which must both be of type <varname>float</varname> or both be of
73                         type <varname>double</varname>. An additional restriction in this
74                         case is that the variable containing the packed
75                         data must be of type <varname>byte</varname>, <varname>short</varname> or <varname>int</varname>. It is
76                         not advised to unpack an <varname>int</varname> into a <varname>float</varname> as
77                         there is a potential precision loss.
78                 </para>
79                 <para>
80                         When data to be packed contains missing values
81                         the attributes that indicate missing values
82                         (<varname>_FillValue</varname>, <varname>valid_min</varname>, <varname>valid_max</varname>, <varname>valid_range</varname>)
83                         must be of the same data type as the packed
84                         data. See <xref linkend="missing-data"/> for a discussion of how
85                         applications should treat variables that have
86                         attributes indicating both missing values and
87                         transformations defined by a scale and/or offset.
88                 </para>
89         </section>
90
91
92
93         <section id="compression-by-gathering">
94                 <title>Compression by Gathering</title>
95                 <para>
96                         To save space in the netCDF file, it may be
97                         desirable to eliminate points from data arrays
98                         that are invariably missing. Such a compression
99                         can operate over one or more adjacent axes, and
100                         is accomplished with reference to a list of the
101                         points to be stored. The list is constructed by
102                         considering a mask array that only includes the
103                         axes to be compressed, and then mapping this array
104                         onto one dimension without reordering. The list is
105                         the set of indices in this one-dimensional mask
106                         of the required points. In the compressed array,
107                         the axes to be compressed are all replaced by a
108                         single axis, whose dimension is the number of
109                         wanted points. The wanted points appear along
110                         this dimension in the same order they appear in
111                         the uncompressed array, with the unwanted points
112                         skipped over. Compression and uncompression are
113                         executed by looping over the list.
114                 </para>
115                 <para>
116                         The list is stored as the coordinate variable
117                         for the compressed axis of the data array. Thus,
118                         the list variable and its dimension have the same
119                         name. The list variable has a string attribute
120                         <varname>compress</varname>, <emphasis>containing a blank-separated list
121                         of the dimensions which were affected by the
122                         compression in the order of the CDL declaration
123                         of the uncompressed array</emphasis>. The presence of
124                         this attribute identifies the list variable
125                         as such. The list, the original dimensions
126                         and coordinate variables (including boundary
127                         variables), and the compressed variables with
128                         all the attributes of the uncompressed variables
129                         are written to the netCDF file. The uncompressed
130                         variables can be reconstituted exactly as they
131                         were using this information.
132                 </para>
133                 <para>
134                         <example>
135                                 <title>Horizontal compression of a three-dimensional array</title>
136                                 <para>
137                                          We eliminate sea
138                                          points at all depths in a
139                                          longitude-latitude-depth array of
140                                          soil temperatures. In this case,
141                                          only the longitude and latitude
142                                          axes would be affected by the
143                                          compression. We construct a list
144                                          <varname>landpoint(landpoint)</varname> containing
145                                          the indices of land points.
146                                 </para>
147                                 <para>
148                                         <programlisting>
149 dimensions:
150   lat=73;
151   lon=96;
152   landpoint=2381;
153   depth=4;
154 variables:
155   int landpoint(landpoint);
156     landpoint:compress="lat lon";
157   float landsoilt(depth,landpoint);
158     landsoilt:long_name="soil temperature";
159     landsoilt:units="K";
160   float depth(depth);
161   float lat(lat);
162   float lon(lon);
163 data:
164   landpoint=363, 364, 365, ...;
165
166                                         </programlisting>
167                                 </para>
168                                 <para>
169                                         Since
170                                         <computeroutput>landpoint(0)=363</computeroutput>,
171                                         for instance, we know that
172                                         <computeroutput>landsoilt(*,0)</computeroutput>
173                                         maps on to point 363 of the original data with dimensions
174                                         <computeroutput>(lat,lon)</computeroutput>.
175                                         This corresponds to indices
176                                         <computeroutput>(3,75)</computeroutput>,
177                                         i.e.,
178                                         <computeroutput>363 = 3*96 + 75</computeroutput>.
179                                 </para>
180                         </example>
181                 </para>
182                 <para>
183                         <example>
184                                 <title>Compression of a three-dimensional field</title>
185                                 <para>
186                                         We compress a longitude-latitude-depth field of ocean salinity by eliminating points below the sea-floor. In this case, all three dimensions are affected by the compression, since there are successively fewer active ocean points at increasing depths.
187                                 </para>
188                                 <para>
189                                         <programlisting>
190 variables:
191   float salinity(time,oceanpoint);
192   int oceanpoint(oceanpoint);
193     oceanpoint:compress="depth lat lon";
194   float depth(depth);
195   float lat(lat);
196   float lon(lon);
197   double time(time);
198                                         </programlisting>
199                                 </para>
200                                 <para>
201                                         This information implies that
202                                         the salinity field should be
203                                         uncompressed to an array with
204                                         dimensions
205                                         <computeroutput>(depth,lat,lon)</computeroutput>.
206                                 </para>
207                         </example>
208                 </para>
209         </section>
210 </chapter>
211
212
213
214
215
216
217
Note: See TracBrowser for help on using the browser.