1 |
<chapter> |
---|
2 |
<title> |
---|
3 |
Reduction of Dataset Size |
---|
4 |
</title> |
---|
5 |
|
---|
6 |
<para> |
---|
7 |
There are two methods for reducing dataset size: packing |
---|
8 |
and compression. By packing we mean altering the data |
---|
9 |
in a way that reduces its precision. By compression we |
---|
10 |
mean techniques that store the data more efficiently |
---|
11 |
and result in no precision loss. Compression only |
---|
12 |
works in certain circumstances, e.g., when a variable |
---|
13 |
contains a significant amount of missing or repeated |
---|
14 |
data values. In this case it is possible to make use of |
---|
15 |
standard utilities, e.g., UNIX <computeroutput>compress</computeroutput> |
---|
16 |
or GNU <computeroutput>gzip</computeroutput>, to |
---|
17 |
compress the entire file after it has been written. In |
---|
18 |
this section we offer an alternative compression method |
---|
19 |
that is applied on a variable by variable basis. This |
---|
20 |
has the advantage that only one variable need be |
---|
21 |
uncompressed at a given time. The disadvantage is that |
---|
22 |
generic utilities that don't recognize the CF conventions |
---|
23 |
will not be able to operate on compressed variables. |
---|
24 |
</para> |
---|
25 |
|
---|
26 |
<section id="packed-data"> |
---|
27 |
<title>Packed Data</title> |
---|
28 |
<para> |
---|
29 |
At the current time the netCDF interface does |
---|
30 |
not provide for packing data. However a simple |
---|
31 |
packing may be achieved through the use of the |
---|
32 |
optional NUG defined attributes |
---|
33 |
<varname>scale_factor</varname> |
---|
34 |
and |
---|
35 |
<varname>add_offset</varname>. |
---|
36 |
After the data values of a variable |
---|
37 |
have been read, they are to be multiplied by |
---|
38 |
the |
---|
39 |
<varname>scale_factor</varname>, and have |
---|
40 |
<varname>add_offset</varname> |
---|
41 |
added to |
---|
42 |
them. If both attributes are present, the data |
---|
43 |
are scaled before the offset is added. When |
---|
44 |
scaled data are written, the application should |
---|
45 |
first subtract the offset and then divide by the |
---|
46 |
scale factor. The units of a variable should be |
---|
47 |
representative of the unpacked data. |
---|
48 |
</para> |
---|
49 |
<para> |
---|
50 |
This standard is more restrictive than the NUG |
---|
51 |
with respect to the use of the |
---|
52 |
<varname>scale_factor</varname> and |
---|
53 |
<varname>add_offset</varname> |
---|
54 |
attributes; ambiguities and precision |
---|
55 |
problems related to data type conversions |
---|
56 |
are resolved by these restrictions. If the |
---|
57 |
<varname>scale_factor</varname> |
---|
58 |
and |
---|
59 |
<varname>add_offset</varname> |
---|
60 |
attributes are of |
---|
61 |
the same data type as the associated variable, |
---|
62 |
the unpacked data is assumed to be of the |
---|
63 |
same data type as the packed data. However, |
---|
64 |
if the |
---|
65 |
<varname>scale_factor</varname> |
---|
66 |
and |
---|
67 |
<varname>add_offset</varname> |
---|
68 |
attributes |
---|
69 |
are of a different data type from the variable |
---|
70 |
(containing the packed data) then the unpacked |
---|
71 |
data should match the type of these attributes, |
---|
72 |
which must both be of type <varname>float</varname> or both be of |
---|
73 |
type <varname>double</varname>. An additional restriction in this |
---|
74 |
case is that the variable containing the packed |
---|
75 |
data must be of type <varname>byte</varname>, <varname>short</varname> or <varname>int</varname>. It is |
---|
76 |
not advised to unpack an <varname>int</varname> into a <varname>float</varname> as |
---|
77 |
there is a potential precision loss. |
---|
78 |
</para> |
---|
79 |
<para> |
---|
80 |
When data to be packed contains missing values |
---|
81 |
the attributes that indicate missing values |
---|
82 |
(<varname>_FillValue</varname>, <varname>valid_min</varname>, <varname>valid_max</varname>, <varname>valid_range</varname>) |
---|
83 |
must be of the same data type as the packed |
---|
84 |
data. See <xref linkend="missing-data"/> for a discussion of how |
---|
85 |
applications should treat variables that have |
---|
86 |
attributes indicating both missing values and |
---|
87 |
transformations defined by a scale and/or offset. |
---|
88 |
</para> |
---|
89 |
</section> |
---|
90 |
|
---|
91 |
|
---|
92 |
|
---|
93 |
<section id="compression-by-gathering"> |
---|
94 |
<title>Compression by Gathering</title> |
---|
95 |
<para> |
---|
96 |
To save space in the netCDF file, it may be |
---|
97 |
desirable to eliminate points from data arrays |
---|
98 |
that are invariably missing. Such a compression |
---|
99 |
can operate over one or more adjacent axes, and |
---|
100 |
is accomplished with reference to a list of the |
---|
101 |
points to be stored. The list is constructed by |
---|
102 |
considering a mask array that only includes the |
---|
103 |
axes to be compressed, and then mapping this array |
---|
104 |
onto one dimension without reordering. The list is |
---|
105 |
the set of indices in this one-dimensional mask |
---|
106 |
of the required points. In the compressed array, |
---|
107 |
the axes to be compressed are all replaced by a |
---|
108 |
single axis, whose dimension is the number of |
---|
109 |
wanted points. The wanted points appear along |
---|
110 |
this dimension in the same order they appear in |
---|
111 |
the uncompressed array, with the unwanted points |
---|
112 |
skipped over. Compression and uncompression are |
---|
113 |
executed by looping over the list. |
---|
114 |
</para> |
---|
115 |
<para> |
---|
116 |
The list is stored as the coordinate variable |
---|
117 |
for the compressed axis of the data array. Thus, |
---|
118 |
the list variable and its dimension have the same |
---|
119 |
name. The list variable has a string attribute |
---|
120 |
<varname>compress</varname>, <emphasis>containing a blank-separated list |
---|
121 |
of the dimensions which were affected by the |
---|
122 |
compression in the order of the CDL declaration |
---|
123 |
of the uncompressed array</emphasis>. The presence of |
---|
124 |
this attribute identifies the list variable |
---|
125 |
as such. The list, the original dimensions |
---|
126 |
and coordinate variables (including boundary |
---|
127 |
variables), and the compressed variables with |
---|
128 |
all the attributes of the uncompressed variables |
---|
129 |
are written to the netCDF file. The uncompressed |
---|
130 |
variables can be reconstituted exactly as they |
---|
131 |
were using this information. |
---|
132 |
</para> |
---|
133 |
<para> |
---|
134 |
<example> |
---|
135 |
<title>Horizontal compression of a three-dimensional array</title> |
---|
136 |
<para> |
---|
137 |
We eliminate sea |
---|
138 |
points at all depths in a |
---|
139 |
longitude-latitude-depth array of |
---|
140 |
soil temperatures. In this case, |
---|
141 |
only the longitude and latitude |
---|
142 |
axes would be affected by the |
---|
143 |
compression. We construct a list |
---|
144 |
<varname>landpoint(landpoint)</varname> containing |
---|
145 |
the indices of land points. |
---|
146 |
</para> |
---|
147 |
<para> |
---|
148 |
<programlisting> |
---|
149 |
dimensions: |
---|
150 |
lat=73; |
---|
151 |
lon=96; |
---|
152 |
landpoint=2381; |
---|
153 |
depth=4; |
---|
154 |
variables: |
---|
155 |
int landpoint(landpoint); |
---|
156 |
landpoint:compress="lat lon"; |
---|
157 |
float landsoilt(depth,landpoint); |
---|
158 |
landsoilt:long_name="soil temperature"; |
---|
159 |
landsoilt:units="K"; |
---|
160 |
float depth(depth); |
---|
161 |
float lat(lat); |
---|
162 |
float lon(lon); |
---|
163 |
data: |
---|
164 |
landpoint=363, 364, 365, ...; |
---|
165 |
|
---|
166 |
</programlisting> |
---|
167 |
</para> |
---|
168 |
<para> |
---|
169 |
Since |
---|
170 |
<computeroutput>landpoint(0)=363</computeroutput>, |
---|
171 |
for instance, we know that |
---|
172 |
<computeroutput>landsoilt(*,0)</computeroutput> |
---|
173 |
maps on to point 363 of the original data with dimensions |
---|
174 |
<computeroutput>(lat,lon)</computeroutput>. |
---|
175 |
This corresponds to indices |
---|
176 |
<computeroutput>(3,75)</computeroutput>, |
---|
177 |
i.e., |
---|
178 |
<computeroutput>363 = 3*96 + 75</computeroutput>. |
---|
179 |
</para> |
---|
180 |
</example> |
---|
181 |
</para> |
---|
182 |
<para> |
---|
183 |
<example> |
---|
184 |
<title>Compression of a three-dimensional field</title> |
---|
185 |
<para> |
---|
186 |
We compress a longitude-latitude-depth field of ocean salinity by eliminating points below the sea-floor. In this case, all three dimensions are affected by the compression, since there are successively fewer active ocean points at increasing depths. |
---|
187 |
</para> |
---|
188 |
<para> |
---|
189 |
<programlisting> |
---|
190 |
variables: |
---|
191 |
float salinity(time,oceanpoint); |
---|
192 |
int oceanpoint(oceanpoint); |
---|
193 |
oceanpoint:compress="depth lat lon"; |
---|
194 |
float depth(depth); |
---|
195 |
float lat(lat); |
---|
196 |
float lon(lon); |
---|
197 |
double time(time); |
---|
198 |
</programlisting> |
---|
199 |
</para> |
---|
200 |
<para> |
---|
201 |
This information implies that |
---|
202 |
the salinity field should be |
---|
203 |
uncompressed to an array with |
---|
204 |
dimensions |
---|
205 |
<computeroutput>(depth,lat,lon)</computeroutput>. |
---|
206 |
</para> |
---|
207 |
</example> |
---|
208 |
</para> |
---|
209 |
</section> |
---|
210 |
</chapter> |
---|
211 |
|
---|
212 |
|
---|
213 |
|
---|
214 |
|
---|
215 |
|
---|
216 |
|
---|
217 |
|
---|