- Good machine:
-
Seaborg has been a terrific resource for my group. We couldn't get 90% of the
work done without it.
[...] Individual processor performance is quite good.
Generally it works quite well and I prefer it to the T3E (batch turnaround
seems faster)
Great machine! Perfect for my work that requires computing power
The best machine ever, no comparison to anything else that I can access.
This is a great machine ! I think it\'s really powerful ...
Great Improvement over previous machine, now that increased memory per node is
available.
Have only tested small problems on seaborg; know from ASCI blue pacific that
this architecture is useful for domain
decomposed PDEs.
Like it a lot save for a few modifications that would be self serving. I
realize that what I would like probably wouldn't be
so great for the rest of the community.
The number of nodes and processors available makes this facility ideal for our
purposes. [...]
The machine is great; the queues are fine; it is very speedy and responsive.
[...]
The machine is excellent. The turn around time is sometimes longer than I would
like. I hope that the computing power
available at NERSC will continue to grow rapidly.
Fun, [...]
- Provide more interactive services:
-
MORE/BETTER SUPPORT FOR INTERACTIVE JOBS!!! Debugging parallel programs on this
machine is a total pain
because you're lucky if you can get 2 interactive nodes, and even then you
have to wait forever to get them. [...]
Running parallel debugging (Totalview) and profiling (Vampyr) tools
interactively has not been working well for me.
Totalview usually complains about
not getting enough resources even for a small (4) number of processors.
I was trying to debug a program on the SP. It was an utterly frustrating and
tedious process. Either seaborg was down completely, or the loadleveler would
not respond, or the loadleveler would not let me run my demanding
one-node-one-CPU-five-seconds-of-CPU-time job, or the debugger would not start
because its license server was down. It took me days to complete a trivial
debugging job which would have been a matter of hours on a workstation.
My postdoc has been getting our application up and running and I have not yet
had a chance to do large runs myself.
For learning the system, it is always easier to try things interactively.
Certainly this was the case on T3E. It is so easy
to get batch scripts wrong.
Debugging environment is atrocious.
Would like to see more resource allocation for interactive use.
When I submit an interactive job with poe and there are no processors
available, it bounces back, and I have to
submit it again. I find it more convenient to have the job put in a "dormant"
status, and start once it gets enough
processors available. This is what happens in the cray machine with mpprun.
[...]
The interactive runtime limits are too short. Debugging large runs is painful.
[...]
Running interactively is almost impossible since there is no separate
interactive queue. Waiting around for an
interactive processor to come free --- or getting lucky --- is not a good use
of time. Using the debug queue is the
only alternative.
[...] Running interactive is
hard sometimes. There should be more processors allocated to running
interactive files. [...]
The 20 minute interactive single processor limit will kill my ability to do
in-situ data manipulation and analysis effectively.
Currently, running small jobs interactively to debug can be nearly impossible
because no nodes are available. Thus,
there is no way to debug a code using totalview. [...]
The fact that there is no interactive queue is sometime a problem, especially
for totalview.
My only difficulty in running here was to try to get a serial run as a
benchmark for code performance
- Stability problems:
-
NERSC seems like they take a longer time to bring up new hardware that other
centers I use and often it's starts out
flakier than it should be. [...]
Improved stability. [...]
[...] It is just down a lot of time (or was in
recent past). Is there a way to cut down on the down time?
[...] Seaborg is down a lot.
Gseaborg was never down that much. Other than that, I have no complaints
IBM switch doesn't really seem to be reliable
I am VERY discouraged with the amount of down time lately on SEABORG. This is a
very nice configuration which
does NOT live up to its potential because of the hardware problems. I hope it
improves. [...]
[...] uptime also needs work [...]
- Provide longer queues:
-
The 8 hour real time limit makes my project totally impossible. It is the worse
regulation imaginable.
Maximum of 8 hrs per job could be increased.
[...] Also, there are no queues with long
CPU time (more than 8 hours). Sometimes, they are useful.
a longer running queue (say, 24 hours maximum) with limited number of nodes on
seaborg may be useful.
a maximum cpu time greater that 8 hrs. would sometimes be helpful.
The Clock limit on SP of 8 hrs is set assuming, i guess, that all the users can
run parallel code across a large # of
processors. There might be cases like mine, where i have a serial code which
when compiled with SMP options can
run on 4-8 proc in the same node. But it takes lots of hours (~180 clock hrs),
a lot more than the limit set for SP users. [...]
- Hard to use / software problems:
-
[...] Improved documentation on queue status.
[...] The xlf compiler isn't very good, IMHO.
[...] Also, the ksh shell doesn't seem to work properly.
[...] Also, it will be of great help for people like me, who are not used to
parallelizing the code, to have a consulting person
who can work one on one with us to give us suggestions on how to go about doing
the parallelization for my problem
effectively. I know it may be too much to ask for, but i\'m sure a lot more
people will come forward to use these
resources in a more efficient way.
I only just started using seaborg. I can only log in from mcurie -- not from my
home machine. [...]
- Don't like charging structure:
-
[...] I also strongly disagree with the charge factor
[...] The charging of 2.5X for
processor hours, however causes us to use up all of our time very quickly. We
could go through 100,000 hours easily
within weeks and we limit our jobs because of this. The wait in the queue is
satisfactory EXCEPT at the end of a fiscal
year as everyone used "premium" to get their jobs going. We had to use
"premium" because "regular" left us with
a six day wait time (compared with a 24 hour wait time a month earlier).
I don't like the new way the hours are charged. I don't always use 16, 32,
etc processors. [...]
[...] In addition, although I get charged 2.5 times as much to run on the
SP, for jobs with internode communication it really only gets slightly better
performance than the T3E. What a waste.
- Improve turnaround time:
-
[...] For a new machine (Seaborg) the queues seemed to become slow very
quickly.
[...] The turn around time is sometimes longer than I would like. [...]
[...] Also, it is again, like all other
NERSC machines, oversubscribed, so at first it is convenient to use, but then
it becomes so slow that turn around times go out the roof!
- Slow communications:
-
Fun, even with nasty latencies.
Limited by communications both on and off node. Not only does it need higher
bandwidth and lower latency and/or
truly asynchronous communications, but it also needs the ability to transfer
data directly between the L2 caches of
different CPU's (on- and off-node). [...]
[...] for jobs with internode communication it really only gets slightly better
performance than the T3E.
- More inodes:
-
[...] abolish inode quotas!!!!
- Remove small jobs:
-
It seems like a lot of people are not using it effectively. That is, it's a
world-class machine being used for a lot of
smallish jobs that would be more effective elsewhere. NERSC in general needs a
mid-class machine to take the load
off of the high-end machines. [...]
- Just starting / don't use:
-
I would like to try it out
still coming up on learning curve
Just started to utilize IBM SP3 at NERSC, so I am unable to make specific
comments.
I haven't extensively used the SP since it moved into phase II although I
anticipate this increasing the number of
processors that some of my codes can use.
Group members will answer detailed questions. My answers to top two questions
reflect the complaint level I hear.
- Good machine / useful features:
-
sad if has to go
Good communication bandwidth.
One can get 4 hours a day in on the 256 pe queue, which is really good for
production. The smaller queues are not very
effective, in that a 16 beowulf system running 24 hours a day on 1 gigahz
processors gives me a factor of three improvement
in speed over the 64 pe queue running 4 hours a day and 1.5 improvement over the
128 pe queue. As this beowulf has 512
mb per pe, the 64 pe queue can still do a problem twice as large and the 128
queue one four times as large.
Fun. [...]
Have made good use of this resource for published data on parallel scaling of
domain decomposed PDE solvers in the past
(mostly through junior collaborators, not personally)
Its great, and is worthy rival to the SP considering the charge factor and good
inter-processor comms.
The Cray compiler and Cray totalview debugger are fantastic for development.
The SP3 totalview is not even in the same
league. I will be extremely sad to see this machine go. The throughput is
currently much faster on this machine as well.
Getting old, but still good. [...]
It's been great so far, but I must admit I'm on the steep side of the
learning curve. It's easy to use and my jobs seem to go
quickly. Note that I'm constantly reevaluating how effective my code is, so
this answer may change in two days or two
months or never.
A very nice machine. Too bad it's obsolete, and I wish they built a
successor.
I am happy with the T3E.
- Provide longer queues:
-
Is the 4-hour CPU limit extendable?
The time in the queue is far too short for todays applications - this is I know
similar to other facilities. It means jobs must
often be stopped and started and while most information can be stored in is
somewhat frustrating that a complete calculation
must take many many submits.
The small amount of time allotted for the largest jobs limits what I can get
done on the T3E.
4 hours is far not enough
- Improve turnaround time / obsolete:
-
Beginning to show its age. Batch turnaround is often slow
that thing should go into a museum.
queue much too long
- Hard to use / better software / better docs:
-
debugging support is lousy - aren\'t there some decent debuggers for C code out
there that run on Unicos? [...]
Several I/O and particularly default variable issues (I*4 vs I*8) that hindered
the porting of my code, and in fact never got
completely ported.
- More inodes / more disk:
-
[...] abolish inode quotas!!!!
Disk space and inode availability is becoming a significant headache here,
almost preventing useful work.
- Stability problems:
-
[...] uptime also needs work
[...] Down too often.
- Provide better interactive services:
-
Interactive response time is terrible.
- Remove small jobs:
-
[...] The smaller queues are not very
effective, in that a 16 beowulf system running 24 hours a day on 1 gigahz
processors gives me a factor of three improvement
in speed over the 64 pe queue running 4 hours a day and 1.5 improvement over the
128 pe queue. [...]
- Just starting / don't use much:
-
only used for testing
still not into production running yet
I am not using the t3e very much anymore
- Other:
-
i used the t3e for benchmarking some numerical application. The batch queue
structure seems to make this benchmarking
difficult. Benchmarking may not be a major issue for applications that are run
in production mode.
Fun. I wish there were more thorough low-level docs available, but that's
difficult.
- Too slow / obsolete:
-
The J90/SV1 cluster has never provided the performance of the C90. It was
obsolete before it was purchased. A
replacement needs to be purchased. I have projects running on this system that
should have been finished 3 years ago. [...]
I find interactivity (e.g. compiling) on Killeen is 3-5X slower than on the MPP
platforms. I generally now avoid running on
Killeen if possible
It seems that it just isn't all that fast compared to my desktop Linux box;
the IMSL & NCAR libraries are the main thing.
Faster (clock time) to run PIC code on desktop machine. No support for Python
problems.
Not using; no real reason to use since desktop machines are now powerful and
cheap. Having it go away at the end of this FY will not be a big loss.
What is Killeen? Whether it is a T3E or a PVP Cluster, it seems to do what I
want, but I wish it were 10 times faster.
- Good machine / useful features:
-
They are essential to run some of the invaluable legacy codes I need. native
double precision is a big help. some i/o cray
features are also essential, along with some libraries
[...] he IMSL & NCAR libraries are the main thing.
Great cluster! Always seems to be space and runs effeciently.
Mr. David Turner and others in the USERS GROUP are most helpful and sympathetic
to the needs of PVP users and I
personally wish to thank them for their excellent support and cooperation which
has made NERSC the most user-friendly
supercomputing facility for superior scientific research in areas related to
the mission of the DOE, USA.
I am pretty uninformed. the PVP Cluster is like a black box into which I drop
problems. I am quite satisfied with the way it
provides results. I actually am quite satisfied with the batch job wait time.
Except for the time before the end of the fiscal year.
With 400-500 jobs in the queue the wait can be pretty awesome.
- Provide longer queues / fewer checkpoints:
-
The time in the queue is far too short for todays applications - this is I know
similar to other facilities. It means jobs must
often be stopped and started and while most information can be stored in is
somewhat frustrating that a complete calculation
must take many many submits.
Better real time limit.
To many system checkpoints which results in Gaussian failures.
- Improve service for big memory jobs:
-
My annual mantra BIG MEMORY jobs. Now, more than ever, these are the codes the
PVP cluster should be targeted for.
I run mostly highly vectorized large memory production runs (230 or 450MW).
According to the CPU time limit, the full time
evolution for one model requires a sequence ~10-20 single jobs, each depending
on the results of the previous one.
However, the batch job wait time for large memory jobs is highly unpredictable.
If there is not accidentally a job of the same
size that quits in the right moment, the job appears to be held in most cases
for more than a week, while later submitted
smaller jobs continuously refill the machines. cqstatl -f gives detailed
information about submitted jobs. Sometimes, it does
not list long-pending jobs anymore that are listed in cqstatl -a.
- Improve turnaround time:
-
The batch queue wating time is intolerably long - I only use this machine as a
last resort, and it\'s a pleasant surprise when
anything finsihes.
I actually am quite satisfied with the batch job wait time.
Except for the time before the end of the fiscal year.
With 400-500 jobs in the queue the wait can be pretty awesome.
- More inodes / more disk:
-
[...] File system inode quotas are ridiculous. More disk space is also needed.
Disk space and inode availability is becoming a significant headache here,
almost preventing useful work.
- Don't use much:
-
only used for testing
- Good system:
-
extremely good system. Much faster than SDSC's version (I don't know why).
things were great until recent problems usings hsi from seaborg. I like the
unix like interface on hsi. have not used pftp
Great. I hope hsi gets fixed soon
This is the main system that I access so that I can retrieve the NCEP
Reanalysis II weather products. I then run
some scripts and programs on killeen to cut the files down to the variables I
need. Finally I ftp the data to our Sun.
Overall I'm happy with the response!
Apart from occasional times when it goes down, it is usually excellent.
The system is excellent
easy and reliable.
It is a superb storage system, and is managed by exceptionally qualified
professionals. Congratulations!
PCMDI is a large user of the HPSS to distribute climate data to a wider
community. Performance has been excellent. I
am especially pleased with HSI.
Great connection - super fast AND the ftp back 'home' speed is speed racer.
- Hard to use / software problems:
-
HSI is somewhat awkward, but does the job.
We have found that LINUX ftp does not generate the information that HPSS needs
to properly archive the data... but
only after we stored ~5 TBytes of data. We have trouble accessing our data.
[...]
I am stroing and retrieving larger and larger files as MPP hardware evolves and
this is not becoming easier.
I could use a tutorial about which user interface to use in various
circumstances.
- Authentication / password issues:
-
It's annoying that this uses a different password to seaborg, mcurie ......
I don't understand why I have to use a separate login password to get to hsi.
Other computer centers I work at don't seem to require this.
I'd be more satisfied if I reliably remembered by password, or if it were
automounted.
- Need expanded functionality:
-
I would like to see hsi available on linux arch.
Need high performance interface from outside NERSC and LBNL, ie ESnet sites!
Need support for Globus Grid tools and authentication!
[...] These days ftp on linux is a security risk
so more and more systems do not run the server... In a year or two, we need to
convert to a secure file transfer system.
- Don't like the down times:
-
I don't like the Tuesday 10-12 a.m. downtimes. It's generally just when I've
gotten settled in and started to work for
the day. An hour at lunchtime would be better.
Having the weekly downtime in the middle of a work day, although understandable
from a staffing perspective, can be
annoying. If most of NERSC's users are based in the US perhaps having it in
the late afternoon on the West Coast
would affect fewer users.
It always seems like I need to access data on tuesdays when the storage system
is down. Is there any way that the
weekly maintenance on HPSS could be moved to the evening?
- Don't like the SRU accounting:
-
The SRU system, which includes transfer charges, appears redundant for projects
with small IO requirements [say
100GB]
- Performance improvements:
-
maybe should be faster
- Don't use / don't need:
-
I don't know what this is.
This is not a major concern for us.
We are not production users, so we do not have huge data sets.
We don't need it
I have tended to not use HPSS with hsi, pftp or ftp. Not because of any problem
with these interfaces. I just have not
informed myself or felt the need for them.