root/trunk/doc/install.xml

Revision 529, 8.1 kB (checked in by voran, 2 years ago)
  • more updates to the manual for 0.97 stuff
  • moved stuff around in the bookinfo section, added myself to the author list
Line 
1 <chapter>
2   <title>Installation</title>
3  
4   <para>
5     This section describes how to install Cobalt. Once these
6     steps are completed, Cobalt will be completely functional on the
7     system.
8   </para>
9
10   <section>
11     <title>Prerequisites</title>
12
13     <para>
14       Three prerequisites are required for Cobalt. Each of these,
15       their functions and a download location are described below.
16     </para>
17
18     <variablelist>
19       <varlistentry>
20         <term>Python</term>
21         <listitem>
22           <para>
23             Cobalt is written in python. It requires version 2.3 or
24             greater.
25           </para>
26         </listitem>
27       </varlistentry>
28       <varlistentry>
29         <term>DB2-python</term>
30         <listitem>
31           <para>
32             This is a library for connecting to DB2 databases from
33             python. This is only required for Cobalt on BG/L
34             systems. It is available at
35             ftp://ftp.mcs.anl.gov/pub/cobalt.
36           </para>
37         </listitem>
38       </varlistentry>
39       <varlistentry>
40         <term>PyOpenSSL</term>
41         <listitem>
42           <para>
43             PyOpenSSL provides python bindings for OpenSSL. It is
44             required in order to support HTTPS on the server side. It
45             is only needed on hosts where components execute.
46           </para>
47         </listitem>
48       </varlistentry>
49     </variablelist>     
50   </section>
51
52   <section>
53     <title>Software Installation</title>
54    
55     <para>
56       Install python, db2-python, and pyopenssl on the server side. On
57       SLES9, this can be accomplished by running:
58     </para>
59    
60     <programlisting>
61 # rpm -ihv \
62 ftp://ftp.mcs.anl.gov/pub/cobalt/rpms/sles9-ppc64/PyOpenSSL-0.6-1.ppc64.rpm
63 # rpm -ihv \
64 ftp://ftp.mcs.anl.gov/pub/cobalt/rpms/sles9-ppc64/cobalt-0.97-1.ppc64.rpm
65     </programlisting>
66
67     <para>
68       On both the client and server sides:
69     </para>
70
71     <programlisting>
72 # rpm -ihv \
73 ftp://ftp.mcs.anl.gov/pub/cobalt/rpms/sles9-ppc64/cobalt-clients-0.97-1.ppc64.rpm
74     </programlisting>
75
76   </section>
77
78   <section>
79     <title>Configuring the Cobalt Component Infrastructure</title>
80
81     <para>
82       Cobalt uses https for data security between components and their
83       clients. Each machine where components run must have their own
84       ssl key. This can be generated by running:
85     </para>
86    
87     <programlisting>
88 # openssl req -x509 -nodes -days 1000 -newkey rsa:1024 \
89 -out /etc/cobalt.key -keyout /etc/cobalt.kek
90     </programlisting>
91
92     <para>
93       Components can be located using static records in
94       <filename>/etc/cobalt.conf</filename>, or by using a dynamic
95       service location service. The service location component
96       (slp.py) is bootstrapped similarly to dns; if a direct reference
97       to a component isn't included in
98       <filename>/etc/cobalt.conf</filename>, then it is looked up in
99       the component listed as "service-location".
100     </para>
101
102     <para>
103       Copy the sample cobalt.conf file into place, and change the
104       hostname in the service location component line to the one where
105       cobalt components will run. Choose a secret password, and place
106       this in the password field of the communication section. Once
107       all of this is done, the cobalt component infrastructure is
108       completely configured.
109     </para>
110   </section>
111
112   <section>
113     <title>Cobalt Component Startup</title>
114
115     <para>
116       Cobalt includes four components for resource management. Each of
117       these components provides a specific type of functionality.
118     </para>
119
120     <variablelist>
121       <varlistentry>
122         <term>Service Location Protocol</term>
123         <listitem>
124           <para>
125             The service location component tracks the locations of
126             active systems in the component. It uses a heartbeat
127             mechanism to detect component failure or exit. It can be
128             queried with the slpstat command.
129           </para>
130         </listitem>
131       </varlistentry>
132       <varlistentry>
133         <term>Process Manager</term>
134         <listitem>
135           <para>
136             The process manager starts, manages, signals, and cleans
137             up parallel processes. On BG/L, its functionality is
138             implemented using the builtin process management system
139             implemented by IBM. The program is
140             <filename>/usr/sbin/bgpm.py</filename>. Bgpm requires
141             several configuration parameters to be set in
142             <filename>/etc/cobalt.conf</filename>. These parameters
143             control environment setup for jobs executed. Incorrect
144             parameters can cause process execution to fail on nodes.
145             This process is started by the cobalt init.d script.
146           </para>
147           <para>
148             Configuration file options are documented in the bgpm(8)
149             man page.
150           </para>
151           <para>
152             On clusters, the process manager uses MPD to start
153             processes.  The component is called
154             <filename>/usr/sbin/mpdpm.py</filename> and is started
155             by the sss-pm init script. Mpdpm doesn't currently take
156             any configuration file parameters.
157           </para>
158
159           <para>
160             On Blue Gene/L, the mpirun command must work for users
161             on the host running bgpm.py. This is usually the service
162             node. In most cases, rsh/ssh must be reconfigured to allow
163             users to ssh from the service node to the service
164             node. (This allows the mpirun frontend to properly contact
165             the mpirun backend)
166           </para>
167         </listitem>
168       </varlistentry>
169       <varlistentry>
170         <term>Queue Manager</term>
171         <listitem>
172           <para>
173             The queue manager handles all aspects of action
174             aggregation related to jobs. For example, it uses the
175             process manager interfaces to run user jobs, as well as
176             prologue and epilogue scripts. It also handles job stdio
177             handling on systems without a global shared filesystem.
178           </para>
179           <para>
180             Cqm is the cobalt implementation of the queue manager. It
181             is common to both BG/L and clusters, though it must be
182             configured slightly differently for each. It uses a number
183             of parameters in the <filename>/etc/cobalt.conf</filename>
184             that control the behavior of jobs and which external
185             systems are used. The queue manager currently has support
186             for file staging (for machines without global shared
187             filesystems), and basic support for allocation
188             management. This daemon is started by the cobalt init.d
189             script.
190           </para>
191           <para>
192             All configuration file options are documented in the
193             cqm(8) man page.
194           </para>
195         </listitem>
196       </varlistentry>
197       <varlistentry>
198         <term>Scheduler</term>
199         <listitem>
200           <para>
201             The scheduler controls resource allocation for job
202             execution. It tells the queue manager when and where to
203             run jobs. Due to differences in scheduling requirements,
204             Blue Gene/L systems and clusters require different
205             schedulers.
206           </para>
207           <para>
208             Bgsched is the scheduler for Blue Gene/L systems. It
209             internally tracks partition state and performs DB/2
210             queries to ensure coherent partition usage in case of
211             problems. Bgsched currently only accepts configuration
212             options to control database connection parameters. These
213             options are documented in the bgsched(8) man page. It is
214             started by the cobalt init.d script.
215           </para>
216           <para>
217             Describe the cluster scheduler here.
218           </para>
219         </listitem>
220       </varlistentry>
221       <varlistentry>
222         <term>Allocation Manager</term>
223         <listitem>
224           <para>
225             The allocation manager tracks users, their project
226             memberships, and time allocations. It is used by the
227             scheduler to control resource allocation. A common
228             allocation manager is used on cluster systems and Blue
229             Gene/L systems. It currently has no configuration file
230             options, and isn't started up by the cobalt init.d script
231             yet.
232           </para>
233         </listitem>
234       </varlistentry>
235     </variablelist>
236
237     <para>
238       Once each of these components is started, an entry will appear
239       in the service location component. This can be displayed with another
240       call to <filename>/usr/sbin/slpstat.py</filename>.
241     </para>
242
243     <para>
244       Each component can also be queried with a component specific
245       tool. For example, the queue manager can be queried with the
246       <filename>cqstat</filename> command. See the clients directory
247       for other commands that can connect to cobalt components.
248     </para>
249   </section>
250
251   <section>
252     <title>Basic Component Testing</title>
253     <para>
254       Need to rewrite.
255     </para>
256   </section>
257 </chapter>
Note: See TracBrowser for help on using the browser.