Memo: about problems with speed of SOLVE. ============================================== 23-MAY-97 26-MAY-97 18:48:22 Leonid Petrov pet@leo.gsfc.nasa.gov Direct measurements of CPU time showed that diring multisession run fast version of SOLVE (B1B3D) works faster by factor of 5-20 for forward solution with respect to old version. It works faster by factor of 10-15 in backward run if not to estimate covariance matrices for segmented parameters or 2-3 if to estimate them. But measurements of elapsed physical time showed that B1B3D implementation in SOLVE has advantage only by factor 2.0 - 2.5 for short runs (5-20 sessions), 1.8-2.2 for moderate runs (100-200 sessions) and only 1.2-1.5 for full runs (2 000 000 observations, 2700 sessions). Two questions emerge. 1) What's the reason? 2) What's to do? I. What's the reason? ~~~~~~~~~~~~~~~~~~~~~ Departure elapsed time from CPU time may be stipulated by a variety of problems: 1) input/output; 2) loading executables; 3) overheads for using dynamic memory, 4) waiting unavailable resources, etc. Direct measurements showed that in the case of SOLVE the the main factor is input/ouput. Structure of SOLVE may be presented by the following scheme: BATCH ~~~~~| | -- (prces) cycle on sessions | | bf|-- GTSUP |-- (arcset) ----| | bf|-- TRANS 3Mb(r)* | | | f|-- PROC 3Mb(r), 1Mb(w) |-- GLOBL ------f| | ~~~~~ f|-- ARCPE 2-20Mb(r), 1-19Mb(w), 1Mb(w)* | | | |-- NORML 20Mb(r), 20Mb(w) | | | b|-- BACK 19Mb(r), 1Mb(r)*, 2Mb(w) | b| | b|-- CRES 2Mb(r) | b| | b|-- ADJST 2Mb(r) | |-- (saves) --- f--- COPYQ 1-19Mb(r), 1-19Mb(w) The amount of input/output for ONE SESSION is shown. Asterisk designates the flux of data via network. I measured elapsed and CPU time on AQUILA in different places of SOLVE for full run of 2710 sessions. (ITRF/ICRF solution, 2153 global parameters, 20 min atmosphere, 60 min clock, gradients, etc., including 16 user-partials) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU time elapsed time number of calls 1) BATCH1 341.35 25670.91 5421 2) GLOBL 72.99 464.72 5421 3) PROC 7259.91# 7534.81 2710 4) POST_PRO 2569.07 3547.97 2710 5) PRE_ARCPE 1098.90 1153.02 2710 6) ARCPE 10741.76# 21264.28 2710 7) BATCH2 63.00 2333.32 5420 8) NORML 726.13# 726.13 1 9) PRE_BACK 6539.81 6677.20 2710 10) BACK 3215.08# 6313.47 2710 11) CRES 3074.75# 3265.81 2710 ======================================================== Subtotal: 35702.75 78951.64 Unmeasured: 8200.00 Total 87150.00 (24.5 hours) (25400.00#) ( 7 hours) # marks necessary expenses. All others are overheads. We see that the share of overheads exceeds 70%. Input/output takes 3 times more time then computations. Where is the main source of overheads? 1) Subroutine "saves" from BATCH the copies of CGM after processing each session in forward run. To the end of the run CGM reaches the size of 18.7Mb. The speed of disk varies from computer to computer. For bootes it is 10Mb/sec, for LEO about 4Mb/sec, for AQUILA it is 1-4Mb/sec. Since CPU time of processing of each session on forward run is in average about 7 sec then time for saving CGM is comparable with time of computation. Bootes runs SOLVE faster because its disks are faster. 2) ARCPE reads and writes CGM on each step. This overhead stems from so-called train structure of SOLVE: chain (or train) of different executables. I believe it was original oversight in designing SOLVE. Since data migrate from one executable to another they are read and written on disk ONLY for transferring them. For this reason BACK rereads CGM each time. 3) ARCPE writes and BACK reads arc-files. Since the size of arc-files are considerable: 1.7Gb for solution mentioned above they are put on far disks and transferred via network. The speed of network varies from 10 to 800 Kb/sec. I think that average value will be 150Kb/sec if nobody does ugly things, f.e. set work-files on disk of another computer. Thus, transferring to and fro arc-file via network takes about the same time as computation. 4) Superfiles are located on different computers and they are usually read via network. It is substantial source of overheads. Total amount of all superfiles is 11.2Gb. 5) Executables itself has considerable size. For processing one session in forward run we need load 5MB and for processing one session in back run we need load 4.4Mb. Total amount input/output for full solution is about 200Gb! Estimation of time needed to fulfill input/output of such data is in a satisfactory agreement with really measured time. Other sources of overheads. 6) Partial derivatives on troposphere gradients in PROC and CRES (partl) are calculated each time. Tests shows that PROC and CRES works in 1.5 times slower when we adjust troposphere gradients. I think thte time came to calculate them in CALC and keep them in new LCODE. 7) CGM is totally reordered at each step in ARCPE when new parameters appear. It is slow procedure. It is not necessary to do it each time. It is sufficient to do it when the last arc is processed, before NORML. II. What's to do? ~~~~~~~~~~~~~~~~~ Nowadays SOLVE 70-75% of time spends for input/output (if we increase the number of parameters the share will be increasing) and it is impossible to reduce this share without changing structure of SOLVE (that was possible I have done already). I propose 1) to unite BATCH and PROC, ARCPE, NORML, BACK, CRES, ADJST in one executable. I united ARCPE and XDDER, ADDER, COPYQ three months ago. The logic of improvement of SOLVE forced to make the next step. This allows us to keep CGM in memory from the very beginning to the end. We will be free from writing and reading CGM in forward and back run. We will be free from writing and reading large files only in order to pass information from PROC to ARCPE, from BACK to CRES to ADJST. 2) To change algorithm of saving/restoring and to save CGM not after processing each session but after processing each k sessions (k will be specified by user). 3) To modify algorithm ARCPE and make reordering CGM only after processing the last session. 4) To investigate feasibility writing OBS-file in "disk in memory". If this UNIX feature works well it may save some time. 5) To put calculation partial derivatives on troposphere gradient to CALC 8.3 I expect that these steps allow us to reduce overheads to the level of 20-25%. The amount of I/O will be reduced from 200GB to 20-30GB per run. It is possible to reduce computational time of PROC and ARCPE further but it is another song. Points 1-3) may be done in the following sequence: 1) Transfer algorithm saving CGM matrix in ARCPE 2) Adding new keywords in BATCH. Changing strategy for saving CGM (each k arcs) 3) Changing reordering strategy in ARCPE. 4) At last -- the most revolutionary changes -- merging BATCH+PROC+ARCPE and BATCH+BACK+CRES+ADJST. This work may be made diring 3-4 weeks. III. What is the fastest way to use SOLVE now. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since we will have SOLVE overladen by I/O for a while we should realize it and keep it in mind. Timing showed that when we put arcfiles on the local disks elapsed time reduced by 40%. If we put CGM on local disk (there is environment variable CGM_DIR in fast version of SOLVE!) we reduced time of BACK run about 40%. If we don't need exact value of "chi**2/number of degrees of freedom" we should set up keyword FAST_COV to LOC. This trick also substantially reduces the time of backward run. Work- and spool- files should be, of course, on local disks. Executables, at least BATCH, GTSUP, TRANS, PROC, ARCPE, NORML, BACK, CRES, ADJST, COPYQ should be also on local disk.