1 |
<chapter> |
---|
2 |
<title>Installation</title> |
---|
3 |
|
---|
4 |
<para> |
---|
5 |
This section describes how to install Cobalt. Once these |
---|
6 |
steps are completed, Cobalt will be completely functional on the |
---|
7 |
system. |
---|
8 |
</para> |
---|
9 |
|
---|
10 |
<section> |
---|
11 |
<title>Prerequisites</title> |
---|
12 |
|
---|
13 |
<para> |
---|
14 |
Three prerequisites are required for Cobalt. Each of these, |
---|
15 |
their functions and a download location are described below. |
---|
16 |
</para> |
---|
17 |
|
---|
18 |
<variablelist> |
---|
19 |
<varlistentry> |
---|
20 |
<term>Python</term> |
---|
21 |
<listitem> |
---|
22 |
<para> |
---|
23 |
Cobalt is written in python. It requires version 2.3 or |
---|
24 |
greater. |
---|
25 |
</para> |
---|
26 |
</listitem> |
---|
27 |
</varlistentry> |
---|
28 |
<varlistentry> |
---|
29 |
<term>DB2-python</term> |
---|
30 |
<listitem> |
---|
31 |
<para> |
---|
32 |
This is a library for connecting to DB2 databases from |
---|
33 |
python. This is only required for Cobalt on BG/L |
---|
34 |
systems. It is available at |
---|
35 |
ftp://ftp.mcs.anl.gov/pub/cobalt. |
---|
36 |
</para> |
---|
37 |
</listitem> |
---|
38 |
</varlistentry> |
---|
39 |
<varlistentry> |
---|
40 |
<term>PyOpenSSL</term> |
---|
41 |
<listitem> |
---|
42 |
<para> |
---|
43 |
PyOpenSSL provides python bindings for OpenSSL. It is |
---|
44 |
required in order to support HTTPS on the server side. It |
---|
45 |
is only needed on hosts where components execute. |
---|
46 |
</para> |
---|
47 |
</listitem> |
---|
48 |
</varlistentry> |
---|
49 |
</variablelist> |
---|
50 |
</section> |
---|
51 |
|
---|
52 |
<section> |
---|
53 |
<title>Software Installation</title> |
---|
54 |
|
---|
55 |
<para> |
---|
56 |
Install python, db2-python, and pyopenssl on the server side. On |
---|
57 |
SLES9, this can be accomplished by running: |
---|
58 |
</para> |
---|
59 |
|
---|
60 |
<programlisting> |
---|
61 |
# rpm -ihv \ |
---|
62 |
ftp://ftp.mcs.anl.gov/pub/cobalt/rpms/sles9-ppc64/PyOpenSSL-0.6-1.ppc64.rpm |
---|
63 |
# rpm -ihv \ |
---|
64 |
ftp://ftp.mcs.anl.gov/pub/cobalt/rpms/sles9-ppc64/cobalt-0.97-1.ppc64.rpm |
---|
65 |
</programlisting> |
---|
66 |
|
---|
67 |
<para> |
---|
68 |
On both the client and server sides: |
---|
69 |
</para> |
---|
70 |
|
---|
71 |
<programlisting> |
---|
72 |
# rpm -ihv \ |
---|
73 |
ftp://ftp.mcs.anl.gov/pub/cobalt/rpms/sles9-ppc64/cobalt-clients-0.97-1.ppc64.rpm |
---|
74 |
</programlisting> |
---|
75 |
|
---|
76 |
</section> |
---|
77 |
|
---|
78 |
<section> |
---|
79 |
<title>Configuring the Cobalt Component Infrastructure</title> |
---|
80 |
|
---|
81 |
<para> |
---|
82 |
Cobalt uses https for data security between components and their |
---|
83 |
clients. Each machine where components run must have their own |
---|
84 |
ssl key. This can be generated by running: |
---|
85 |
</para> |
---|
86 |
|
---|
87 |
<programlisting> |
---|
88 |
# openssl req -x509 -nodes -days 1000 -newkey rsa:1024 \ |
---|
89 |
-out /etc/cobalt.key -keyout /etc/cobalt.kek |
---|
90 |
</programlisting> |
---|
91 |
|
---|
92 |
<para> |
---|
93 |
Components can be located using static records in |
---|
94 |
<filename>/etc/cobalt.conf</filename>, or by using a dynamic |
---|
95 |
service location service. The service location component |
---|
96 |
(slp.py) is bootstrapped similarly to dns; if a direct reference |
---|
97 |
to a component isn't included in |
---|
98 |
<filename>/etc/cobalt.conf</filename>, then it is looked up in |
---|
99 |
the component listed as "service-location". |
---|
100 |
</para> |
---|
101 |
|
---|
102 |
<para> |
---|
103 |
Copy the sample cobalt.conf file into place, and change the |
---|
104 |
hostname in the service location component line to the one where |
---|
105 |
cobalt components will run. Choose a secret password, and place |
---|
106 |
this in the password field of the communication section. Once |
---|
107 |
all of this is done, the cobalt component infrastructure is |
---|
108 |
completely configured. |
---|
109 |
</para> |
---|
110 |
</section> |
---|
111 |
|
---|
112 |
<section> |
---|
113 |
<title>Cobalt Component Startup</title> |
---|
114 |
|
---|
115 |
<para> |
---|
116 |
Cobalt includes four components for resource management. Each of |
---|
117 |
these components provides a specific type of functionality. |
---|
118 |
</para> |
---|
119 |
|
---|
120 |
<variablelist> |
---|
121 |
<varlistentry> |
---|
122 |
<term>Service Location Protocol</term> |
---|
123 |
<listitem> |
---|
124 |
<para> |
---|
125 |
The service location component tracks the locations of |
---|
126 |
active systems in the component. It uses a heartbeat |
---|
127 |
mechanism to detect component failure or exit. It can be |
---|
128 |
queried with the slpstat command. |
---|
129 |
</para> |
---|
130 |
</listitem> |
---|
131 |
</varlistentry> |
---|
132 |
<varlistentry> |
---|
133 |
<term>Process Manager</term> |
---|
134 |
<listitem> |
---|
135 |
<para> |
---|
136 |
The process manager starts, manages, signals, and cleans |
---|
137 |
up parallel processes. On BG/L, its functionality is |
---|
138 |
implemented using the builtin process management system |
---|
139 |
implemented by IBM. The program is |
---|
140 |
<filename>/usr/sbin/bgpm.py</filename>. Bgpm requires |
---|
141 |
several configuration parameters to be set in |
---|
142 |
<filename>/etc/cobalt.conf</filename>. These parameters |
---|
143 |
control environment setup for jobs executed. Incorrect |
---|
144 |
parameters can cause process execution to fail on nodes. |
---|
145 |
This process is started by the cobalt init.d script. |
---|
146 |
</para> |
---|
147 |
<para> |
---|
148 |
Configuration file options are documented in the bgpm(8) |
---|
149 |
man page. |
---|
150 |
</para> |
---|
151 |
<para> |
---|
152 |
On clusters, the process manager uses MPD to start |
---|
153 |
processes. The component is called |
---|
154 |
<filename>/usr/sbin/mpdpm.py</filename> and is started |
---|
155 |
by the sss-pm init script. Mpdpm doesn't currently take |
---|
156 |
any configuration file parameters. |
---|
157 |
</para> |
---|
158 |
|
---|
159 |
<para> |
---|
160 |
On Blue Gene/L, the mpirun command must work for users |
---|
161 |
on the host running bgpm.py. This is usually the service |
---|
162 |
node. In most cases, rsh/ssh must be reconfigured to allow |
---|
163 |
users to ssh from the service node to the service |
---|
164 |
node. (This allows the mpirun frontend to properly contact |
---|
165 |
the mpirun backend) |
---|
166 |
</para> |
---|
167 |
</listitem> |
---|
168 |
</varlistentry> |
---|
169 |
<varlistentry> |
---|
170 |
<term>Queue Manager</term> |
---|
171 |
<listitem> |
---|
172 |
<para> |
---|
173 |
The queue manager handles all aspects of action |
---|
174 |
aggregation related to jobs. For example, it uses the |
---|
175 |
process manager interfaces to run user jobs, as well as |
---|
176 |
prologue and epilogue scripts. It also handles job stdio |
---|
177 |
handling on systems without a global shared filesystem. |
---|
178 |
</para> |
---|
179 |
<para> |
---|
180 |
Cqm is the cobalt implementation of the queue manager. It |
---|
181 |
is common to both BG/L and clusters, though it must be |
---|
182 |
configured slightly differently for each. It uses a number |
---|
183 |
of parameters in the <filename>/etc/cobalt.conf</filename> |
---|
184 |
that control the behavior of jobs and which external |
---|
185 |
systems are used. The queue manager currently has support |
---|
186 |
for file staging (for machines without global shared |
---|
187 |
filesystems), and basic support for allocation |
---|
188 |
management. This daemon is started by the cobalt init.d |
---|
189 |
script. |
---|
190 |
</para> |
---|
191 |
<para> |
---|
192 |
All configuration file options are documented in the |
---|
193 |
cqm(8) man page. |
---|
194 |
</para> |
---|
195 |
</listitem> |
---|
196 |
</varlistentry> |
---|
197 |
<varlistentry> |
---|
198 |
<term>Scheduler</term> |
---|
199 |
<listitem> |
---|
200 |
<para> |
---|
201 |
The scheduler controls resource allocation for job |
---|
202 |
execution. It tells the queue manager when and where to |
---|
203 |
run jobs. Due to differences in scheduling requirements, |
---|
204 |
Blue Gene/L systems and clusters require different |
---|
205 |
schedulers. |
---|
206 |
</para> |
---|
207 |
<para> |
---|
208 |
Bgsched is the scheduler for Blue Gene/L systems. It |
---|
209 |
internally tracks partition state and performs DB/2 |
---|
210 |
queries to ensure coherent partition usage in case of |
---|
211 |
problems. Bgsched currently only accepts configuration |
---|
212 |
options to control database connection parameters. These |
---|
213 |
options are documented in the bgsched(8) man page. It is |
---|
214 |
started by the cobalt init.d script. |
---|
215 |
</para> |
---|
216 |
<para> |
---|
217 |
Describe the cluster scheduler here. |
---|
218 |
</para> |
---|
219 |
</listitem> |
---|
220 |
</varlistentry> |
---|
221 |
<varlistentry> |
---|
222 |
<term>Allocation Manager</term> |
---|
223 |
<listitem> |
---|
224 |
<para> |
---|
225 |
The allocation manager tracks users, their project |
---|
226 |
memberships, and time allocations. It is used by the |
---|
227 |
scheduler to control resource allocation. A common |
---|
228 |
allocation manager is used on cluster systems and Blue |
---|
229 |
Gene/L systems. It currently has no configuration file |
---|
230 |
options, and isn't started up by the cobalt init.d script |
---|
231 |
yet. |
---|
232 |
</para> |
---|
233 |
</listitem> |
---|
234 |
</varlistentry> |
---|
235 |
</variablelist> |
---|
236 |
|
---|
237 |
<para> |
---|
238 |
Once each of these components is started, an entry will appear |
---|
239 |
in the service location component. This can be displayed with another |
---|
240 |
call to <filename>/usr/sbin/slpstat.py</filename>. |
---|
241 |
</para> |
---|
242 |
|
---|
243 |
<para> |
---|
244 |
Each component can also be queried with a component specific |
---|
245 |
tool. For example, the queue manager can be queried with the |
---|
246 |
<filename>cqstat</filename> command. See the clients directory |
---|
247 |
for other commands that can connect to cobalt components. |
---|
248 |
</para> |
---|
249 |
</section> |
---|
250 |
|
---|
251 |
<section> |
---|
252 |
<title>Basic Component Testing</title> |
---|
253 |
<para> |
---|
254 |
Need to rewrite. |
---|
255 |
</para> |
---|
256 |
</section> |
---|
257 |
</chapter> |
---|