Subject: ADA Comments
From: Christopher Lester <address-withheld>
Date: Thu, 5 Feb 2004 17:15:54 +0000 (GMT)
To: dladams@bnl.gov

Hello David,

I'm a physicist working mainly in the area of SUSY searches at ATLAS.  

Karl "Ganga" Harrison, suggested I contact you to to give you some
potential "atlas user" comments about the kinds of things we users
require.

There was suggestion of presenting something to you at the meeting in
early March, but I don't want to make too many trips to CERN, and
suspect it would be easier to mention things in a mail like this.

Basically, let me begin by saying I've read your Januray 21'st ADA 
talk at Karl's suggestion, and could make some user comments about 
what I would want to help distribute MY atlas analyses.

I make no claim to be "a typical atlas user" (indeed I may be very 
a-typical) and nor do I make any claim that I am nevertheless the 
type of user that ADA needs to cater for.  But still ... the kind of 
things that me and people in my group who do similar things to me 
often need is fairly basic ... the need to launch batch jobs:

We need:

1) Occasional access to large bursts of CPU for a relatively short 
time (say 200 CPUs * 2 days) occuring at a frequency of once every 
few months - i.e. as often as we have useful ideas.

2) Minimal software infrastructure (i.e. we don't usually assume that
the far end will have any user-space software beyond libstdc++.so and
sometimes not even that).  We would never assume the presence of the
whole of the atlas offline software.

3) A common (not necessarily real time) filesystem which all jobs 
write into without doing anything special on their part.  In other 
words, we (being greedy and lazy) strongly dislike having to use one 
set of scripts to launch our jobs, one to check whether they have 
finidhed yet, and a third to grab their output (whether from a 
central server or from the infividual worker nodes) and return it to 
a central base.  We (the greedy users) want something that looks to a 
great extent very much like the lsf batch system at CERN or the csf 
batch system at RAL or any batch system you find anywhere, in which 
all jobs can write to a directory that is visible from the place of 
submission.

4) We want to (say) be able to create a binary executable, we don't
mind having to form some kind of job description file which says
something like "I need N CPUs with redhat 8.x or 9.x and 512 Mb Ram
for at least 2 hours a piece" or whatever is required for this
executable to work, but once we've made this job description, we want
-- to all intents and purposes -- to be able to ssh-or-cd to some
filesystem where we can copy this executable, check that it works by
maybe running it for 10 seconds interactively, and then get something
to submit it to loads of machines which all are under the
impression that they're just writing their output to some
directory(s) on the common filesystem that (for example) we just put
the executable on.

5) We don't need the filesystem to be real time (at the worst, the
jobs could only get their output copied back at the end of the job)  
although the more requent the updates the better so that progress may
easily be monitored.  We prefer "cd", "ls" and "tail -f" to
specialist job monitoring tools.

6) We don't need the filesystem to be writable or visible in all 
directions.  It doesn't matter very much if job 6 can't see the 
output of job 9 or if I on the "job submission node" am unable to 
write to the directory that job 11 is writing to, but I DO DEFINITELY 
want to be able to see the output that the jobs themselves are 
producing -- if only slowly or with a large latency.

7) Just because a common filesystem like the above is desired, we do
not need ALL files to be accessable that way.  Expediency may well
motivate the existence of (eg) "/scratch"-type local filesystems not
exported to central nodes at all, BUT IF THEY EXIST, then for ease of
testing and development purposes, similar scratch spaces should be
identically mounted on the front end nodes so that there are "no
surprises" when the job runs on the worker.  The environment on the 
nodes should be minimally different from the environments in which 
testing and development are done.

----------

Obviously you can see that I am not proposing anything more than what
is largely or entirely available at any cpu farm you would to
mention.


But what I hope might come out of GRID activities and the like (and
maybe ADA would/should/might play a part) is to widen the resources 
that I have access to so that I may either run my jobs for a little 
bit longer, or wait a little less longer before my jobs run.  The 
kind of thing that I would be interested in to help me distribute my 
ATLAS analyses is a tool that looks basically like a standard job 
queueing system but one with larger resources than I can presently 
get my hands on.

You can see that my emphasis is mainly on ease of use for myself, 
which (to my mind) means making the system look as much like my 
desktop machine where I do most of my development as is reasonably 
possible.

This is a feature shared by the CERN and RAL batch systems which I 
like.


----------


Well, anyway, there it is.

I hope it doesn't sound like a rant (it isn't) and I hope you don't 
take it as a criticism of ADA or naything like that.

The primary intention is just to give you an idea of what SOME atlas 
users are looking for.

I'm sure you will find plenty of others who can't live without their 
fix of IPatRec, XKalMan, cmt, meta-data catalogues, and logical file 
names, but I just wanted to give you some feed back from one of the 
people who almost never needs these high-powered things.

If you have any questions, I will do my best to answer them on the 
time scale of a week. 

Yours,

Chris


------------------------------------------------------
Christopher Lester                Tel: +44 1223 337231
Cavendish Lab., Madingley Road, Cambridge, CB3 0HE, UK