Subject: ADA Comments From: Christopher Lester Date: Thu, 5 Feb 2004 17:15:54 +0000 (GMT) To: dladams@bnl.gov Hello David, I'm a physicist working mainly in the area of SUSY searches at ATLAS. Karl "Ganga" Harrison, suggested I contact you to to give you some potential "atlas user" comments about the kinds of things we users require. There was suggestion of presenting something to you at the meeting in early March, but I don't want to make too many trips to CERN, and suspect it would be easier to mention things in a mail like this. Basically, let me begin by saying I've read your Januray 21'st ADA talk at Karl's suggestion, and could make some user comments about what I would want to help distribute MY atlas analyses. I make no claim to be "a typical atlas user" (indeed I may be very a-typical) and nor do I make any claim that I am nevertheless the type of user that ADA needs to cater for. But still ... the kind of things that me and people in my group who do similar things to me often need is fairly basic ... the need to launch batch jobs: We need: 1) Occasional access to large bursts of CPU for a relatively short time (say 200 CPUs * 2 days) occuring at a frequency of once every few months - i.e. as often as we have useful ideas. 2) Minimal software infrastructure (i.e. we don't usually assume that the far end will have any user-space software beyond libstdc++.so and sometimes not even that). We would never assume the presence of the whole of the atlas offline software. 3) A common (not necessarily real time) filesystem which all jobs write into without doing anything special on their part. In other words, we (being greedy and lazy) strongly dislike having to use one set of scripts to launch our jobs, one to check whether they have finidhed yet, and a third to grab their output (whether from a central server or from the infividual worker nodes) and return it to a central base. We (the greedy users) want something that looks to a great extent very much like the lsf batch system at CERN or the csf batch system at RAL or any batch system you find anywhere, in which all jobs can write to a directory that is visible from the place of submission. 4) We want to (say) be able to create a binary executable, we don't mind having to form some kind of job description file which says something like "I need N CPUs with redhat 8.x or 9.x and 512 Mb Ram for at least 2 hours a piece" or whatever is required for this executable to work, but once we've made this job description, we want -- to all intents and purposes -- to be able to ssh-or-cd to some filesystem where we can copy this executable, check that it works by maybe running it for 10 seconds interactively, and then get something to submit it to loads of machines which all are under the impression that they're just writing their output to some directory(s) on the common filesystem that (for example) we just put the executable on. 5) We don't need the filesystem to be real time (at the worst, the jobs could only get their output copied back at the end of the job) although the more requent the updates the better so that progress may easily be monitored. We prefer "cd", "ls" and "tail -f" to specialist job monitoring tools. 6) We don't need the filesystem to be writable or visible in all directions. It doesn't matter very much if job 6 can't see the output of job 9 or if I on the "job submission node" am unable to write to the directory that job 11 is writing to, but I DO DEFINITELY want to be able to see the output that the jobs themselves are producing -- if only slowly or with a large latency. 7) Just because a common filesystem like the above is desired, we do not need ALL files to be accessable that way. Expediency may well motivate the existence of (eg) "/scratch"-type local filesystems not exported to central nodes at all, BUT IF THEY EXIST, then for ease of testing and development purposes, similar scratch spaces should be identically mounted on the front end nodes so that there are "no surprises" when the job runs on the worker. The environment on the nodes should be minimally different from the environments in which testing and development are done. ---------- Obviously you can see that I am not proposing anything more than what is largely or entirely available at any cpu farm you would to mention. But what I hope might come out of GRID activities and the like (and maybe ADA would/should/might play a part) is to widen the resources that I have access to so that I may either run my jobs for a little bit longer, or wait a little less longer before my jobs run. The kind of thing that I would be interested in to help me distribute my ATLAS analyses is a tool that looks basically like a standard job queueing system but one with larger resources than I can presently get my hands on. You can see that my emphasis is mainly on ease of use for myself, which (to my mind) means making the system look as much like my desktop machine where I do most of my development as is reasonably possible. This is a feature shared by the CERN and RAL batch systems which I like. ---------- Well, anyway, there it is. I hope it doesn't sound like a rant (it isn't) and I hope you don't take it as a criticism of ADA or naything like that. The primary intention is just to give you an idea of what SOME atlas users are looking for. I'm sure you will find plenty of others who can't live without their fix of IPatRec, XKalMan, cmt, meta-data catalogues, and logical file names, but I just wanted to give you some feed back from one of the people who almost never needs these high-powered things. If you have any questions, I will do my best to answer them on the time scale of a week. Yours, Chris ------------------------------------------------------ Christopher Lester Tel: +44 1223 337231 Cavendish Lab., Madingley Road, Cambridge, CB3 0HE, UK