- Good machine:
-
Excellent platform for efficient parallel computing. Among the best managed
supercomputers, if not the best, we have pursued our work on!
Excellent support. We've gotten some custom mods to the system for our use which
has been very helpful.
Consultants are always available and helpful. Excellent collaboration.
A truely great machine. Extremely well run. ... Worldwide
the BEST machine my group uses.
It is very good machine. But too many people are using it.
Very good machine as setup, my research relies heavily on it.
This has been a very productive machine for us and the support of our efforts
has been excellent.
Always has been up when I have wanted to test my code. I like that I can get
jobs in for debugging
purposes easily.
I've been incredibly happy with the SP. Batch queue turnaround times are very
quick, and one can usually
get away with the low queue on weekends. We've investigated efficiency quite
extensively and found that
we can run on 2-4 nodes fairly effectively and have run on up to 16 nodes (16
processors per node, in all cases).
We are rewriting our code to effectively use 64+ processors and then we will
see if we are able to get our
jobs through in a timely manner. So far, using one node, we have been happy.
I am very happy with using IBM SP
... The system had fantastic uptime, I got a lot of jobs done. The
system was very stable and had high production quality unlike some other
systems, in particular clusters.
The maximum number of processors I used on seaborg is fairly low, since I did
mostly parameter studies of
smaller runs and other members of the group did large simulations. The code has
been running on more
than 1500 processors already.
I think the machine is great. I plan to do more performance analysis in the
future.
I think seaborg provides a very friendly user interface.
Great machine - many processors and sufficient memory for each.
Everything is good ...
The best experience I have had with any system! ...
It works very well. ...
Great machine. ...
The system is excellent. ...
Perfect for my needs (testing scalability of numerical algorithms).
very efficient system with well provided service
- Queue issues:
-
... Also, it would be great to have a slightly faster queue time for
regular_long jobs (current wait is about 3 days).
... though I did put in a request under the Big Splash initiative to get
another regular_long
job to go through (2 at a time) and it hasn't been carried out yet.
My code parallelizes well for many processors, but I have only used up to 80
processors in order to decrease the waiting time.
... (1) one really long queue would be handy (*) ...
A 48-hr queue would be desirable for some large jobs.
Job running is great, but the walltime hard limit is too "hard". I do not know if
there are some techniques to
flash the memory data into disk when jobs are killed. That's very important to my
time-consumed project....
The 8 hour limit on the regular queue is too short.
Queue waits for jobs larger than 32 procs have been horrible (up to 7 days for a
128 processor job). ...
The queues have gotten really bad, especially if you have something on low
priority. ...
Allocation of processors in chunks smaller than 16 would be useful. More and
longer regular_long time should be allocated.
I was very impressed by the short times I spent on the queue, but the short maximum
run-time limits really
limits the applicability of the SP for my systems of interest.
Check restart needed if system goes down. Longer time limits needed. I have been
trying to use Gaussian
98 which typically runs for 4 days so the 24 hr limit is not enough.
My jobs run most efficiently on 32 processors (or 16) over several days rather
than short periods of time
on a large number of processors. When the job is interupted data is lost so
when I restart I lose
information. It would most efficient if I could access the nodes for extended
periods, and a low number of CPU.
It would be nice to have a queue for serial jobs, where they could share a node
without being charged for
the entire node.
... It is a handicap to be charged for all 16 processors even if you use only 1.
- Needs more resources / too slow:
-
The individual processors are really slow - slow compared to every other chip I
use, Athlon, P4, etc. This
machine is a real dog.
The system is excellent. However, I wish that NERSC had a much larger and more
powerful computer. The
problems I would most like to attack requir two to three orders of magnitude more
computing power than is
currently available at NERSC. (In responding to the last question, I indicated the
maximum number of
processors my code can use per job on Seaborg. On a more powerful machine it could
effectively use thousands of processor).
... Individual processor speed is relatively slow compared with other parallel
systems I use (mostly PC based
LINUX clusters with 2.2 GHz procs.) However, the stability of the system is
better than most computers that I have used.
Processor by processor, it is much slower than Intel P4.
... But the processors are getting slow -- my code runs
30% faster on Athlon 1.4 Ghz.
CPU performance is fine. Communication performance restricts the number of
nodes our codes can use
effectively. At the number of processors we use for production, the codes
typically spend 70% of their time
communicating and only 30% of their time calculating.
I have a code which uses a lot of FFTs and thus has a decent amount of global
communication. The only
problem that I have with the IBM SP is that communication between the nodes is too
slow. Thus, it takes
the same amount of time to run my code on the Cray T3E as it does to run on the IBM SP.
Try to get more memory
... For future improvements, please increase I/O speed, it's a
limiter in many of my jobs (or increase memory to 16GB/CPU, which is obviously
too expensive). ...
- Provide more interactive services:
-
Interactive jobs are effectively impossible to run, even small serial jobs will
not launch during the day.
Debugging code is currently VERY FRUSTRATING because of LACK OF INTERACTIVE
ACCESS. You
can't run totalview unless the job clears the loadleveler, which has become
dramatically more difficult in the
last couple months.
I find it very difficult to run interactively and debug. There seems to be only
a few hours per day when I can
get any interactive work done.
I wish it was easier to get a node when running interactively. I realize that
most of the nodes are running
batch jobs, but it might make sense to allocate more nodes for interactive use.
... (4) sometimes the interactive jobs are rejected... why? can
some rationale be given to the user other than the blanket error message? (*)
...
A few more processors for interactive use would be helpful.
Needs to run(debug)interactive job on > 16 processors. ...
- Hard to use / would like additional features:
-
I just don't like IBM.
need mechanism to inquire on remaining wall clock time for job when jobs are
terminated by the system for
exceeding wall clock time, a signal should be sent to the job with some time
remaining to prepare
... I would like the default shell to be C shell.
i need to understand the POE better in order to optimize my code better. ...
... Nice machine overall,
but I miss some of the software that was on the T3E (Cray Totalview and the
superior f90 compiler).
Great machine. It would be nice to have (not in order of importance, * means more
important than the rest)-
(1) one really long queue would be handy (*) (2) a program that tries to estimate
when a queue job will run
(very hard I realize but still useful) (3) when submitting a job, the llsubmit
tells you how many units it will use
(max) so you know what you're getting into (4) sometimes the interactive jobs are
rejected... why? can
some rationale be given to the user other than the blanket error message? (*) (5)
PLEASE REOFRM THE
ACCOUNTING SYSTEM OF MPP HOURS WITH ALL THE CRAZY FACTORS (2.5 is it or not?) (**)
- Stability issues:
-
We have had some problems with jobs with high I/O demands crashing; I don't
know if the stability of
these operations could be improved or not. ...
When can we get the new version of operating system installed in order to run
MPI 32 without hanging the
code on large number of processors?
- Disk issues:
-
... The big drawback is there is no back-up for user home directory.
The number of inodes per user was too small. ...
- Need cluster computing at NERSC:
-
My main complaint is that this 2000+ supercomputer is being used in a broad user
based time-share
envionment as 10-20 clusters. The factional use with > 512 (0r even >256) is to
small. We are paying
dearly for unsued connectivity and the system is not being used properly. For the
same hardware price,
NERSC could be a "cluster center" offering 3-5x more FLOP's per year. The users
need a computer
center with more capacity (not capabiity). If we had a "cluster center", a
machine like seaborg could be
freed up the its best use....codes that need and can used > 1024 (or2048) ps. The
expansion ratio
(turn-around time/actual run time) has much improved this year (generally below 2
and used to be often >
10); but next year seaborg is going to be overloaded again
- Other:
-
Max processors depends on the job.
[comment about the survey question]
I would welcome examples on how to improve application performance on the SP with
increasing numbers of processors.
- Just starting / don't use:
-
Usage limited so far, not much to say.
Have not used it yet.
we are still working on putting mpi into our model. once we have that completed we
can use 32 maybe 64
processors to do our model runs
- Good system:
-
runs great. ...
Generally excellent support and service. What issues do arise are dealt with
promptly and professionally.
Keep up the good work
beautiful! but sometimes misused/stalled by infinitely stupid users.
- Queue and priority issues:
-
*** Obscure priority setting within STAR - Not clear why individual people get
higher priority. - Not clear
what is going on in starofl. Embedding gets always top priority over any kind of
analysis made by individual
user. + intervention of LBL staff in setting embedding/simulation priority should
be stopped. ...
NERSC response:
Rules for calculating dynamic priorities are explained in the
STAR
tutorial (05/03/02).
They are based on the user's group share,
the number of jobs the user currently has in execution and the total time used
by the running jobs. Shares for all the STAR users (with the exception of starofl and kaneta) are
equal (see bhpart -r). starofl is designated to run production that is used by the whole experiment
and kaneta does DST filtering. The justification is that no analysis could be completed
without embedding and many users run on pico-DST's produced by Masashi. STAR users should
direct their comments regarding share settings within STAR to its computing leader (Jerome
Laurent - jeromel@bnl.gov). NERSC staff does not set policies on the subdivision of shares within
the STAR group.
The short queue is 1 hour, the medium queue is 24 hours, and the long queue is 120
hours. There's only a
factor of five difference between the long and mediume queues, while there's a
factor of 24 between the
short and medium queues. An intermediate queue of 2 or 3 hours would be useful as
it is short enough to
be completed in a day but can encompass jobs that take a bit more than an hour.
NERSC response:
To answer that question we have to ask another one. How and under what circumstances
would users benefit from the introduction of this additional queue. The guess is that
the user hopes for a shorter waiting time if her/his job would be submitted to such a queue.
The PDSF cluster works under fair share settings on the cluster level. This model
allows groups of varying size and "wealth" to share the facility while minimizing
the amount of unhappiness among the users. In this model each user has a dynamic
priority assigned based on the user's group share, subdivision of shares within
a group (decided by the groups), the number of jobs the user has currently
executing and
the total time used by the user's running jobs. Jobs go into execution based on that
dynamic priority and only if two users have identical dynamic priority is the queue
priority is taken into account. So the queue priority is of secondary importance
in this model unless a pool of nodes is dedicated to run a given queue.
We use the queue length to manage the frequency with which job slots open. The short
queue runs exclusively on 30 CPU's (as well as on all the other CPU's sharing
with medium and low). This means that on average a slot for a short job opens
every 2 minutes. These settings provide for a reasonable debugging frequency
at the expense of those 30 nodes being idle when there are no short jobs.
We created a medium queue based on the analysis of average job length and in order
to provide a reasonable waiting time for the "High Bandwidth"
nodes which only run the short and the medium queues. We have 84 such CPU's.
So on average (if those nodes were running only medium and no short jobs) a slot
opens every 15 minutes. In practice a fair number of those nodes runs short jobs too
so the frequency is even better. But then again in the absence of short and medium
jobs, those nodes idle even if we have a long "long" queue.
Introducing one more queue would have a real effect only if we allocated a group on
nodes that would run that semi-medium or short queue exclusively. That would only
further increase resource fragmentation and encourage users to game the system by
subdividing jobs which only increases the LSF overhead and wastes resources.
We closely monitor the cluster load and job length and if a need shows up we will
adjust queue length and node assignments, but we do not plan on adding more queues.
- Disk issues:
-
... Not clear how the disk space is managed ***
NERSC response:
Disk vaults are assigned to experiments based on their financial contribution.
STAR subdivided their disk space between the physics working groups and Doug Olson
(DLOlson@lbl.gov) should be contacted for more details. A list of disk vaults
with their current assignments is available at:
http://pdsf.nersc.gov/hardware/machines/pdsfdv.html.
PDSF staff (by request from the experiments) does not interfere with how the disk
space is subdivided between the users but if experiments wish we can run a
cleanup script (details set by the requester). Currently this is
in place on pdsfdv15.
Data vault are not pratical to use => IO requirements
are just a patch not a true solution.
NERSC response:
Indeed disk vaults have poor performance while serving multiple clients. Very recently
we converted them from software to hardware raid which improved their bandwidth.
We also brought in an alternative solution for testing, the so called "beta" system.
PDSF users loved the performance (instead of couple tens, it could serve couple hundred
clients without loss in performance), but such systems are much more expensive
(currently factor of 4 at least) and at the end the experiments decided to go with
a cheaper hardware raid solution. We are watching the market all the time and bring
in for testing various solutions (like "beta" and "Blue Arc") and if anything that
the experiments can afford comes by, we will purchase it.
*** High bandwidth node usage limited to very specific tasks.
NERSC response:
High Bandwidth nodes (compute nodes with a large local disks - 280GB in addition to 10GB /scratch)
are all in the medium queue and access is governed by the same set of rules as for any other
compute node in the medium queue. The only restriction is the allocation of the 280GB disk where
only the starofl account has write privileges. That is necessitated by the type of analysis
STAR does and the experiment financed purchase of this disk space. If you are from STAR and
do not agree with this policy, please contact Doug Olson (DLOlson@lbl.gov).
Need more disk space on AFS. AFS should be more reliable. I find AFS is very
reliable from my laptop. It is
much less for PDSF.
NERSC response:
PDSF AFS problems result from afs cache corruptions. It is much easier to preserve
cache integrity if there is only one user (like on the laptop) than tens of users (on PDSF).
Heavy system use exposes obscure bugs not touched upon during single user access. To improve
the afs performance on PDSF we are moving away from the knfs gateway model for the interactive
nodes. The afs software for linux matured enough so that we just recently (the week of 10/14/02)
installed local afs clients on the individual pdsfint nodes. This should greatly reduce afs
problems and boot its performance.
- Would like new functionality:
-
... however, using the new INTEL fortran compiler might be useful, since it
increases speed by a
factor 2 (or more) at least in INTEL CPUs.
NERSC response:The two largest PDSF users groups require the pgf compiler,
so we can look at the INTEL FORTRAN license as an addition and not a replacement. Additionally
INTEL FORTRAN does not work with the Totalview debugger, currently the only decent option
for debugging jobs that are a mixture of FORTRAN and C++ on linux. Also INTEL licenses are pricey,
but we will check what kind of user base is there for this compiler and see whether this is
something we can afford.
- Needs more resources:
-
Buy more hardware ! when STAR is running a large number of jobs (which is almost
all the time!) it's a pain for other users..
NERSC response:PDSF is a form of cooperative. NERSC helps to finance it
(~15%) All the groups get access that is proportional to their financial contribution. These
"Shares" can be checked by issuing a bhpart command on any of the interactive nodes.
STAR is the most significant contributor, thus it gets a high share. However, the system is not
quite full all the time - please check our record for the past year at:
http://www-pdsf.nersc.gov/stats/showgraph.shtml?merged-grpadmin.gif .
We did purchase 60 compute nodes (120 CPUs) recently and we are introducing them into production
right now (a step up on the magenta line in
http://www-pdsf.nersc.gov/stats/showgraph.shtml?lsfstats.gif ).
Also it helps to look at this issue in a different way. In times of high demand everybody
is getting what they paid for (their share) and when STAR is not running, other groups can use the
resource "for free".
- Don't use:
-
What is it?
NERSC response:PDSF is a networked distributed computing environment used to meet
the detector simulation and data analysis requirements of large scale High Energy Physics (HEP) and Nuclear
Science (NS) investigations. For updated information about the facility, check out the
PDSF Home Page.
Comments on NERSC's HPSS Storage System: 31 responses
- Good system:
-
Generous space, quick network.
I really like hsi. ...
Gets the job done.
Flawless. Extremely useful.
Excellent performance for the most part, keep it up!
Not very useful to me at present, but it works just fine.
NERSC has one of the more stable HPSS systems, compared to LANL/ORNL/NCAR
I really love being able to use hsi, extremely user friendly.
This is a big improvement over the old system. I'm impressed with it's speed.
Really nice and fast. ...
HPSS us terrific
Easy to use and very efficient. We went over our allotted HPSS allocation, but
thank you for being flexible
about this. We plan to use this storage for all our simulations for years to come.
Everything is great, ...
It works well.
After the PVP cluster disappears the HPSS will still be my primary storage and
archive resource.
very useful system with high reliability
- Don't like the down times / downs need to be handled more gracefully:
-
... Downtime is at an inconvenient time of day, at least in EST.
The weekly downtime on Tuesdays from 9-noon PST is an annoyance as it occurs during
the workday for
users throughout the US. It would seem to make more sense to schedule it for a time
that takes advantage
of time differences --- e.g., afternoon on the west coast --- to minimize the
disruptions to users.
... except that it goes down for 3 hours right in the middle of my Eastern Time day
every Tuesday. How annoying.
It is unfortunate HPSS requires weekly maintenance while other systems are up. This
comment is not
specific to NERSC.
I don't like the downtimes during working hours.
The storage system does not always respond. This is fatal when the data for batch
jobs is too large to fit on
the local disk. I had several of my batch jobs hang while trying to copy the data
from the HPSS system to
temporary disk space.
- Performance improvements needed:
-
cput for recursively moving directories needs to be improved both in speed and in
reliability for large
directories.
scp can be very slow for large files
Commands that do not require file access or transfer pretty slow, e.g. listing or
moving files to different directory
large file ~10gbyes is hard to get from local desktop
- Would like new functionality:
-
It would be very helpful, if the HPSS works in the background bufferd by a huge
hard disc
would like to try srb as combined interface to hpss and metadata catalog
... What would be nice is a command that updates entire directory structures by
comparing file times and writes the newer ones to disk (like a synchronization).
Currently, I use the cput
command but it doesn't quite do this. Having such a command would be a great help
(maybe it cal already
be done with a fancy option which I don't know).
It would be nice to navigate in the HPSS file system directly from PDSF via NFS (of
course not to copy files
but to look at the directory structure). This is done at CERN.
- Hard to use:
-
too cumbersome to use effectively
- Authentication issues:
-
The new authentication system for hpss seems to be incompatible with some windows
OS secure shell
soft. Since the change was mande I have not been able to connect using my laptop. I
am still trying to get
this fixed with the help of the support people here at LLNL, but no good news so
far.
- Other:
-
When questions arise and NERSC is contacted for guidance the consultants always
come across as
condescending. Is this intentional and for what purpose?
- Don't use / don't need:
-
We are not using this at this time.
Comments about NERSC's auxiliary servers: 3 responses
-
Is there a quick way on Escher to convert a power point picture into a ps file
without using "xv"?
I have never used these servers, would I need a separate account on these?
Never used.
Next:
Software Resources