FY 2001 User Survey Results: Hardware Resources

Legend:

Satisfaction Average Score

Very Satisfied 6.5 - 7

Mostly Satisfied 5.5 - 6.4

Somewhat Satisfied 4.5 - 5.4

Significance of Change

significant increase

significant decrease

not significant

Satisfaction - Compute Platforms:

Topic	No. of Responses	Average Score	Std. Dev.	Change from 2000
PVP Uptime	64	6.45	0.83	0.04
T3E Overall	92	6.23	0.96	0.22
T3E Uptime	81	6.22	1.11	0.13
PVP Overall	69	6.14	1.06	0.28
PVP Disk Configuration and I/O Performance	52	6.00	1.03	0.23
PVP Ability to Run Interactively	60	5.98	1.05	-0.13
SP Overall	84	5.82	1.39	-0.06
SP Disk Configuration and I/O Performance	54	5.67	1.35	0.47
T3E Ability to Run Interactively	74	5.64	1.30	-0.07
T3E Disk Configuration and I/O Performance	63	5.60	1.25	0.25
SP Uptime	77	5.53	1.71	-0.99
PVP Queue Structure	54	5.41	1.38	0.38
T3E Queue Structure	75	5.36	1.31	0.09
SP Queue Structure	68	5.19	1.41	-0.03
T3E Batch Wait Time	80	4.97	1.48	0.64
SP Batch Wait Time	76	4.92	1.65	0.38
SP Ability to Run Interactively	68	4.71	1.85	-0.80
PVP Batch Wait Time	59	4.56	1.65	0.30

Max Processors Used and Max Code Can Effectively Use:

Processor Type	No. of Responses	Average No. of Processors	Std. Dev.	Change from 2000
Max SP Processors Used	72	202	326	+61
Max SP Processors Can Use	56	751	994	+160
Max T3E Processors Used	73	133	151	-13
Max T3E Processors Can Use	54	356	529	+56
Max PVP Processors Used	46	10	21	+0.8
Max PVP Processors Can Use	36	30	93	+20

Satisfaction - HPSS:

Topic	No. of Responses	Average Score	Std. Dev.	Change from 2000
Reliability	83	6.63	0.62	0.24
HPSS Overall	101	6.50	0.74	0.24
Performance	89	6.36	0.92	0.16
Uptime	88	6.33	0.91	0.02
User Interface	92	6.02	1.22	-0.12

Satisfaction - Servers:

Topic	No. of Responses	Average Score	Std. Dev.	Change from 2000
Newton	15	5.47	1.19	-0.08
Escher	13	5.08	1.04	-0.17

Summary of Hardware Comments

Comments on NERSC's IBM SP: 43 responses

14	provide more interactive services
13	good machine / useful features
7	stability problems
6	provide longer queues
5	hard to use/software problems
4	don't like charging structure
3	improve turnaround time
3	slow communications
1	more inodes
1	remove small jobs

Comments on NERSC's Cray T3E: 26 responses

11	good machine / useful features
4	provide longer queues
3	improve turnaround time / obsolete
2	hard to use/software problems
2	more inodes / more disk
2	stability problems
1	provide better interactive services
1	remove small jobs

Comments on NERSC's Cray PVP Cluster: 18 responses

6	too slow / obsolete
5	good machine / useful features
3	provide longer queues / fewer checkpoints
2	improve service for big memory jobs
2	improve turnaround time
2	more inodes / more disk

Comments on NERSC's HPSS Storage System: 29 responses

10	good system
4	hard to use / software problems
3	authentication / password issues
3	need expanded functionality
3	don't like the down times
1	don't like the SRU accounting
1	performance improvements

Comments about NERSC's auxiliary servers: 5 responses

Comments on NERSC's IBM SP: 43 responses

Good machine:

Seaborg has been a terrific resource for my group. We couldn't get 90% of the work done without it.

[...] Individual processor performance is quite good.

Generally it works quite well and I prefer it to the T3E (batch turnaround seems faster)

Great machine! Perfect for my work that requires computing power

The best machine ever, no comparison to anything else that I can access.

This is a great machine ! I think it\'s really powerful ...

Great Improvement over previous machine, now that increased memory per node is available.

Have only tested small problems on seaborg; know from ASCI blue pacific that this architecture is useful for domain decomposed PDEs.

Like it a lot save for a few modifications that would be self serving. I realize that what I would like probably wouldn't be so great for the rest of the community.

The number of nodes and processors available makes this facility ideal for our purposes. [...]

The machine is great; the queues are fine; it is very speedy and responsive. [...]

The machine is excellent. The turn around time is sometimes longer than I would like. I hope that the computing power available at NERSC will continue to grow rapidly.

Fun, [...]

Provide more interactive services:

MORE/BETTER SUPPORT FOR INTERACTIVE JOBS!!! Debugging parallel programs on this machine is a total pain because you're lucky if you can get 2 interactive nodes, and even then you have to wait forever to get them. [...]

Running parallel debugging (Totalview) and profiling (Vampyr) tools interactively has not been working well for me. Totalview usually complains about not getting enough resources even for a small (4) number of processors.

I was trying to debug a program on the SP. It was an utterly frustrating and tedious process. Either seaborg was down completely, or the loadleveler would not respond, or the loadleveler would not let me run my demanding one-node-one-CPU-five-seconds-of-CPU-time job, or the debugger would not start because its license server was down. It took me days to complete a trivial debugging job which would have been a matter of hours on a workstation.

My postdoc has been getting our application up and running and I have not yet had a chance to do large runs myself. For learning the system, it is always easier to try things interactively. Certainly this was the case on T3E. It is so easy to get batch scripts wrong.

Debugging environment is atrocious.

Would like to see more resource allocation for interactive use.

When I submit an interactive job with poe and there are no processors available, it bounces back, and I have to submit it again. I find it more convenient to have the job put in a "dormant" status, and start once it gets enough processors available. This is what happens in the cray machine with mpprun. [...]

The interactive runtime limits are too short. Debugging large runs is painful. [...]

Running interactively is almost impossible since there is no separate interactive queue. Waiting around for an interactive processor to come free --- or getting lucky --- is not a good use of time. Using the debug queue is the only alternative.

[...] Running interactive is hard sometimes. There should be more processors allocated to running interactive files. [...]

The 20 minute interactive single processor limit will kill my ability to do in-situ data manipulation and analysis effectively.

Currently, running small jobs interactively to debug can be nearly impossible because no nodes are available. Thus, there is no way to debug a code using totalview. [...]

The fact that there is no interactive queue is sometime a problem, especially for totalview.

My only difficulty in running here was to try to get a serial run as a benchmark for code performance

Stability problems:

NERSC seems like they take a longer time to bring up new hardware that other centers I use and often it's starts out flakier than it should be. [...]

Improved stability. [...]

[...] It is just down a lot of time (or was in recent past). Is there a way to cut down on the down time?

[...] Seaborg is down a lot. Gseaborg was never down that much. Other than that, I have no complaints

IBM switch doesn't really seem to be reliable

I am VERY discouraged with the amount of down time lately on SEABORG. This is a very nice configuration which does NOT live up to its potential because of the hardware problems. I hope it improves. [...]

[...] uptime also needs work [...]

Provide longer queues:

The 8 hour real time limit makes my project totally impossible. It is the worse regulation imaginable.

Maximum of 8 hrs per job could be increased.

[...] Also, there are no queues with long CPU time (more than 8 hours). Sometimes, they are useful.

a longer running queue (say, 24 hours maximum) with limited number of nodes on seaborg may be useful.

a maximum cpu time greater that 8 hrs. would sometimes be helpful.

The Clock limit on SP of 8 hrs is set assuming, i guess, that all the users can run parallel code across a large # of processors. There might be cases like mine, where i have a serial code which when compiled with SMP options can run on 4-8 proc in the same node. But it takes lots of hours (~180 clock hrs), a lot more than the limit set for SP users. [...]

Hard to use / software problems:

[...] Improved documentation on queue status.

[...] The xlf compiler isn't very good, IMHO.

[...] Also, the ksh shell doesn't seem to work properly.

[...] Also, it will be of great help for people like me, who are not used to parallelizing the code, to have a consulting person who can work one on one with us to give us suggestions on how to go about doing the parallelization for my problem effectively. I know it may be too much to ask for, but i\'m sure a lot more people will come forward to use these resources in a more efficient way.

I only just started using seaborg. I can only log in from mcurie -- not from my home machine. [...]

Don't like charging structure:

[...] I also strongly disagree with the charge factor

[...] The charging of 2.5X for processor hours, however causes us to use up all of our time very quickly. We could go through 100,000 hours easily within weeks and we limit our jobs because of this. The wait in the queue is satisfactory EXCEPT at the end of a fiscal year as everyone used "premium" to get their jobs going. We had to use "premium" because "regular" left us with a six day wait time (compared with a 24 hour wait time a month earlier).

I don't like the new way the hours are charged. I don't always use 16, 32, etc processors. [...]

[...] In addition, although I get charged 2.5 times as much to run on the SP, for jobs with internode communication it really only gets slightly better performance than the T3E. What a waste.

Improve turnaround time:

[...] For a new machine (Seaborg) the queues seemed to become slow very quickly.

[...] The turn around time is sometimes longer than I would like. [...]

[...] Also, it is again, like all other NERSC machines, oversubscribed, so at first it is convenient to use, but then it becomes so slow that turn around times go out the roof!

Slow communications:

Fun, even with nasty latencies.

Limited by communications both on and off node. Not only does it need higher bandwidth and lower latency and/or truly asynchronous communications, but it also needs the ability to transfer data directly between the L2 caches of different CPU's (on- and off-node). [...]

[...] for jobs with internode communication it really only gets slightly better performance than the T3E.

More inodes:

[...] abolish inode quotas!!!!

Remove small jobs:

It seems like a lot of people are not using it effectively. That is, it's a world-class machine being used for a lot of smallish jobs that would be more effective elsewhere. NERSC in general needs a mid-class machine to take the load off of the high-end machines. [...]

Just starting / don't use:

I would like to try it out

still coming up on learning curve

Just started to utilize IBM SP3 at NERSC, so I am unable to make specific comments.

I haven't extensively used the SP since it moved into phase II although I anticipate this increasing the number of processors that some of my codes can use.

Group members will answer detailed questions. My answers to top two questions reflect the complaint level I hear.

Comments on NERSC's Cray T3E: 26 responses

Good machine / useful features:

sad if has to go

Good communication bandwidth.

One can get 4 hours a day in on the 256 pe queue, which is really good for production. The smaller queues are not very effective, in that a 16 beowulf system running 24 hours a day on 1 gigahz processors gives me a factor of three improvement in speed over the 64 pe queue running 4 hours a day and 1.5 improvement over the 128 pe queue. As this beowulf has 512 mb per pe, the 64 pe queue can still do a problem twice as large and the 128 queue one four times as large.

Fun. [...]

Have made good use of this resource for published data on parallel scaling of domain decomposed PDE solvers in the past (mostly through junior collaborators, not personally)

Its great, and is worthy rival to the SP considering the charge factor and good inter-processor comms.

The Cray compiler and Cray totalview debugger are fantastic for development. The SP3 totalview is not even in the same league. I will be extremely sad to see this machine go. The throughput is currently much faster on this machine as well.

Getting old, but still good. [...]

It's been great so far, but I must admit I'm on the steep side of the learning curve. It's easy to use and my jobs seem to go quickly. Note that I'm constantly reevaluating how effective my code is, so this answer may change in two days or two months or never.

A very nice machine. Too bad it's obsolete, and I wish they built a successor.

I am happy with the T3E.

Provide longer queues:

Is the 4-hour CPU limit extendable?

The time in the queue is far too short for todays applications - this is I know similar to other facilities. It means jobs must often be stopped and started and while most information can be stored in is somewhat frustrating that a complete calculation must take many many submits.

The small amount of time allotted for the largest jobs limits what I can get done on the T3E.

4 hours is far not enough

Improve turnaround time / obsolete:

Beginning to show its age. Batch turnaround is often slow

that thing should go into a museum.

queue much too long

Hard to use / better software / better docs:

debugging support is lousy - aren\'t there some decent debuggers for C code out there that run on Unicos? [...]

Several I/O and particularly default variable issues (I*4 vs I*8) that hindered the porting of my code, and in fact never got completely ported.

More inodes / more disk:

[...] abolish inode quotas!!!!

Disk space and inode availability is becoming a significant headache here, almost preventing useful work.

Stability problems:

[...] uptime also needs work

[...] Down too often.

Provide better interactive services:

Interactive response time is terrible.

Remove small jobs:

[...] The smaller queues are not very effective, in that a 16 beowulf system running 24 hours a day on 1 gigahz processors gives me a factor of three improvement in speed over the 64 pe queue running 4 hours a day and 1.5 improvement over the 128 pe queue. [...]

Just starting / don't use much:

only used for testing

still not into production running yet

I am not using the t3e very much anymore

Other:

i used the t3e for benchmarking some numerical application. The batch queue structure seems to make this benchmarking difficult. Benchmarking may not be a major issue for applications that are run in production mode.

Fun. I wish there were more thorough low-level docs available, but that's difficult.

Comments on NERSC's Cray PVP Cluster: 18 responses

Too slow / obsolete:

The J90/SV1 cluster has never provided the performance of the C90. It was obsolete before it was purchased. A replacement needs to be purchased. I have projects running on this system that should have been finished 3 years ago. [...]

I find interactivity (e.g. compiling) on Killeen is 3-5X slower than on the MPP platforms. I generally now avoid running on Killeen if possible

It seems that it just isn't all that fast compared to my desktop Linux box; the IMSL & NCAR libraries are the main thing.

Faster (clock time) to run PIC code on desktop machine. No support for Python problems.

Not using; no real reason to use since desktop machines are now powerful and cheap. Having it go away at the end of this FY will not be a big loss.

What is Killeen? Whether it is a T3E or a PVP Cluster, it seems to do what I want, but I wish it were 10 times faster.

Good machine / useful features:

They are essential to run some of the invaluable legacy codes I need. native double precision is a big help. some i/o cray features are also essential, along with some libraries

[...] he IMSL & NCAR libraries are the main thing.

Great cluster! Always seems to be space and runs effeciently.

Mr. David Turner and others in the USERS GROUP are most helpful and sympathetic to the needs of PVP users and I personally wish to thank them for their excellent support and cooperation which has made NERSC the most user-friendly supercomputing facility for superior scientific research in areas related to the mission of the DOE, USA.

I am pretty uninformed. the PVP Cluster is like a black box into which I drop problems. I am quite satisfied with the way it provides results. I actually am quite satisfied with the batch job wait time. Except for the time before the end of the fiscal year. With 400-500 jobs in the queue the wait can be pretty awesome.

Provide longer queues / fewer checkpoints:

Better real time limit.

To many system checkpoints which results in Gaussian failures.

Improve service for big memory jobs:

My annual mantra BIG MEMORY jobs. Now, more than ever, these are the codes the PVP cluster should be targeted for.

I run mostly highly vectorized large memory production runs (230 or 450MW). According to the CPU time limit, the full time evolution for one model requires a sequence ~10-20 single jobs, each depending on the results of the previous one. However, the batch job wait time for large memory jobs is highly unpredictable. If there is not accidentally a job of the same size that quits in the right moment, the job appears to be held in most cases for more than a week, while later submitted smaller jobs continuously refill the machines. cqstatl -f gives detailed information about submitted jobs. Sometimes, it does not list long-pending jobs anymore that are listed in cqstatl -a.

Improve turnaround time:

The batch queue wating time is intolerably long - I only use this machine as a last resort, and it\'s a pleasant surprise when anything finsihes.

I actually am quite satisfied with the batch job wait time. Except for the time before the end of the fiscal year. With 400-500 jobs in the queue the wait can be pretty awesome.

More inodes / more disk:

[...] File system inode quotas are ridiculous. More disk space is also needed.

Disk space and inode availability is becoming a significant headache here, almost preventing useful work.

Don't use much:

only used for testing

Comments on NERSC's HPSS Storage System: 29 responses

Good system:

extremely good system. Much faster than SDSC's version (I don't know why).

things were great until recent problems usings hsi from seaborg. I like the unix like interface on hsi. have not used pftp

Great. I hope hsi gets fixed soon

This is the main system that I access so that I can retrieve the NCEP Reanalysis II weather products. I then run some scripts and programs on killeen to cut the files down to the variables I need. Finally I ftp the data to our Sun. Overall I'm happy with the response!

Apart from occasional times when it goes down, it is usually excellent.

The system is excellent

easy and reliable.

It is a superb storage system, and is managed by exceptionally qualified professionals. Congratulations!

PCMDI is a large user of the HPSS to distribute climate data to a wider community. Performance has been excellent. I am especially pleased with HSI.

Great connection - super fast AND the ftp back 'home' speed is speed racer.

Hard to use / software problems:

HSI is somewhat awkward, but does the job.

We have found that LINUX ftp does not generate the information that HPSS needs to properly archive the data... but only after we stored ~5 TBytes of data. We have trouble accessing our data. [...]

I am stroing and retrieving larger and larger files as MPP hardware evolves and this is not becoming easier.

I could use a tutorial about which user interface to use in various circumstances.

Authentication / password issues:

It's annoying that this uses a different password to seaborg, mcurie ......

I don't understand why I have to use a separate login password to get to hsi. Other computer centers I work at don't seem to require this.

I'd be more satisfied if I reliably remembered by password, or if it were automounted.

Need expanded functionality:

I would like to see hsi available on linux arch.

Need high performance interface from outside NERSC and LBNL, ie ESnet sites! Need support for Globus Grid tools and authentication!

[...] These days ftp on linux is a security risk so more and more systems do not run the server... In a year or two, we need to convert to a secure file transfer system.

Don't like the down times:

I don't like the Tuesday 10-12 a.m. downtimes. It's generally just when I've gotten settled in and started to work for the day. An hour at lunchtime would be better.

Having the weekly downtime in the middle of a work day, although understandable from a staffing perspective, can be annoying. If most of NERSC's users are based in the US perhaps having it in the late afternoon on the West Coast would affect fewer users.

It always seems like I need to access data on tuesdays when the storage system is down. Is there any way that the weekly maintenance on HPSS could be moved to the evening?

Don't like the SRU accounting:

The SRU system, which includes transfer charges, appears redundant for projects with small IO requirements [say 100GB]

Performance improvements:

maybe should be faster

Don't use / don't need:

I don't know what this is.

This is not a major concern for us.

We are not production users, so we do not have huge data sets.

We don't need it

I have tended to not use HPSS with hsi, pftp or ftp. Not because of any problem with these interfaces. I just have not informed myself or felt the need for them.

Comments about NERSC's auxiliary servers: 5 responses

Slowness of network connections still limits the usefulness of these servers (especially Escher) to remote users. Those who can still rely on local workstations for most of this type of work.

i use escher mainly for access to software like IDL

the application "mediaconvert" takes over 5 minutes to load and it is painfully slow to use. Could this be remedied?

The IDL program that I use to view results from simulations run on the IBM SP works poorly at NERSC due to the older version available there (5.3 vs the current 5.4). Keeping software current would be helpful.

Next: Software Resources