Memo:      about problems with speed of SOLVE.
==============================================
23-MAY-97                                                    26-MAY-97 18:48:22


Leonid Petrov
pet@leo.gsfc.nasa.gov


  Direct measurements of CPU time showed that diring multisession run
fast version of SOLVE (B1B3D) works faster by factor of 5-20 for forward
solution with respect to old version. It works faster by factor of 10-15 
in backward run if not to estimate covariance matrices for segmented 
parameters or 2-3 if to estimate them. But measurements of elapsed physical 
time showed that B1B3D implementation in SOLVE has advantage only by factor 
2.0 - 2.5 for short runs (5-20 sessions), 1.8-2.2 for moderate runs
(100-200 sessions) and only 1.2-1.5 for full runs (2 000 000 observations,
2700 sessions). Two questions emerge. 1) What's the reason? 2) What's to do?


        I. What's the reason?
        ~~~~~~~~~~~~~~~~~~~~~


  Departure elapsed time from CPU time may be stipulated by a variety of 
problems: 1) input/output; 2) loading executables; 3) overheads for using 
dynamic memory, 4) waiting unavailable resources, etc. Direct measurements 
showed that in the case of SOLVE the the main factor is input/ouput.

Structure of SOLVE may be presented by the following scheme:

BATCH
~~~~~|
     |
     -- (prces) cycle on sessions
           | 
           |              bf|-- GTSUP
           |-- (arcset) ----|
           |              bf|-- TRANS  3Mb(r)*
           |
           |
           |               f|-- PROC   3Mb(r),    1Mb(w)
           |-- GLOBL ------f| 
           |   ~~~~~       f|-- ARCPE  2-20Mb(r), 1-19Mb(w), 1Mb(w)*
           |                |
           |                |-- NORML  20Mb(r),   20Mb(w)
           |                |
           |               b|-- BACK   19Mb(r),   1Mb(r)*, 2Mb(w)
           |               b|
           |               b|-- CRES   2Mb(r)
           |               b|
           |               b|-- ADJST  2Mb(r)
           |             
           |-- (saves) --- f--- COPYQ  1-19Mb(r), 1-19Mb(w)


The amount of input/output for ONE SESSION is shown. Asterisk designates
the flux of data via network.


I measured elapsed and CPU time on AQUILA in different places of SOLVE 
for full run of 2710 sessions. (ITRF/ICRF solution, 2153 global parameters,
20 min atmosphere, 60 min clock, gradients, etc., including 16 user-partials)

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                CPU time    elapsed time  number of calls
                 
  1) BATCH1      341.35      25670.91          5421
  2) GLOBL        72.99        464.72          5421
  3) PROC       7259.91#      7534.81          2710
  4) POST_PRO   2569.07       3547.97          2710
  5) PRE_ARCPE  1098.90       1153.02          2710
  6) ARCPE     10741.76#     21264.28          2710
  7) BATCH2       63.00       2333.32          5420
  8) NORML       726.13#       726.13             1
  9) PRE_BACK   6539.81       6677.20          2710
 10) BACK       3215.08#      6313.47          2710
 11) CRES       3074.75#      3265.81          2710
 ========================================================
 Subtotal:     35702.75      78951.64              
Unmeasured:                   8200.00
 Total                       87150.00  (24.5 hours)

              (25400.00#)              ( 7   hours)

# marks necessary expenses. All others are overheads. We see that 
the share of overheads exceeds 70%. Input/output takes 3 times more time
then computations. Where is the main source of overheads?

1) Subroutine "saves" from BATCH the copies of CGM after processing each 
session in forward run. To the end of the run CGM reaches the size of 18.7Mb. 
The speed of disk varies from computer to computer. For bootes it is 10Mb/sec, for LEO about
4Mb/sec, for AQUILA it is 1-4Mb/sec. Since CPU time of processing of each
session on forward run is in average about 7 sec then time for saving CGM is
comparable with time of computation. Bootes runs SOLVE faster because its disks
are faster.

2) ARCPE reads and writes CGM on each step. This overhead stems from so-called
train structure of SOLVE: chain (or train) of different executables. I believe
it was original oversight in designing SOLVE. Since data migrate from one
executable to another they are read and written on disk ONLY for transferring 
them. For this reason BACK rereads CGM each time.

3) ARCPE writes and BACK reads arc-files. Since the size of arc-files are 
considerable: 1.7Gb for solution mentioned above they are put on far disks and
transferred via network. The speed of network varies from 10 to 800 Kb/sec. 
I think that average value will be 150Kb/sec if nobody does ugly things, f.e. 
set work-files on disk of another computer. Thus, transferring to and fro 
arc-file via network takes about the same time as computation.

4) Superfiles are located on different computers and they are usually read via 
network. It is substantial source of overheads. Total amount of all superfiles
is 11.2Gb.

5) Executables itself has considerable size. For processing one session in
forward run we need load 5MB and for processing one session in back run we 
need load 4.4Mb.

Total amount input/output for full solution is about 200Gb! Estimation of time 
needed to fulfill input/output of such data is in a satisfactory agreement 
with really measured time.


Other sources of overheads.

6) Partial derivatives on troposphere gradients in PROC and CRES (partl) 
are calculated each time. Tests shows that PROC and CRES works in 1.5 times
slower when we adjust troposphere gradients. I think thte time came to 
calculate them in CALC and keep them in new LCODE.

7) CGM is totally reordered at each step in ARCPE when new parameters appear.
It is slow procedure. It is not necessary to do it each time. It is 
sufficient to do it when the last arc is processed, before NORML.


        II. What's to do?
        ~~~~~~~~~~~~~~~~~


  Nowadays SOLVE 70-75% of time spends for input/output (if we increase the 
number of parameters the share will be increasing) and it is impossible to 
reduce this share without changing structure of SOLVE (that was possible I 
have done already). 

  I propose 

1) to unite BATCH and PROC, ARCPE, NORML, BACK, CRES, ADJST in one executable.
   I united ARCPE and XDDER, ADDER, COPYQ three months ago. The logic of 
   improvement of SOLVE forced to make the next step. This allows us to keep
   CGM in memory from the very beginning to the end. We will be free from 
   writing and reading CGM in forward and back run. We will be free from 
   writing and reading large files only in order to pass information from PROC
   to ARCPE, from BACK to CRES to ADJST. 

2) To change algorithm of saving/restoring and to save CGM  not after 
   processing each session but after processing each k sessions (k will
   be specified by user).

3) To modify algorithm ARCPE and make reordering CGM only after processing
   the last session.

4) To investigate feasibility writing OBS-file in "disk in memory". If this 
   UNIX feature works well it may save some time.

5) To put calculation partial derivatives on troposphere gradient to CALC 8.3

I expect that these steps allow us to reduce overheads to the level of 20-25%.
The amount of I/O will be reduced from 200GB to 20-30GB per run. It is 
possible to reduce computational time of PROC and ARCPE further but it is 
another song.

Points 1-3) may be done in the following sequence:

1) Transfer algorithm saving CGM matrix in ARCPE
2) Adding new keywords in BATCH. Changing strategy for saving CGM (each k arcs)
3) Changing reordering strategy in ARCPE.
4) At last -- the most revolutionary changes -- merging BATCH+PROC+ARCPE and
   BATCH+BACK+CRES+ADJST.

This work may be made diring 3-4 weeks.


        III. What is the fastest way to use SOLVE now.
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Since we will have SOLVE overladen by I/O for a while we should realize it 
and keep it in mind. Timing showed that when we put arcfiles on the local
disks elapsed time reduced by 40%. If we put CGM on local disk (there is 
environment variable CGM_DIR in fast version of SOLVE!) we reduced time of 
BACK run about 40%. If we don't need exact value of "chi**2/number of degrees 
of freedom" we should set up keyword FAST_COV to LOC. This trick also 
substantially reduces the time of backward run. Work- and spool- files should 
be, of course, on local disks. Executables, at least BATCH, GTSUP, TRANS, 
PROC, ARCPE, NORML, BACK, CRES, ADJST, COPYQ should be also on local disk.