Scheduler FAQ

Questions

How do I submit a job?
What is the quickest way to learn how to use the scheduler?
I submitted a job, and it created lots of files!
It's not working! My jobs! What do I do!!! ARGH!
How do I re-submit a job that died?
How do I pass MY job name?
How do I control which library version I use in the scheduler?
How do I use the file catalog query? What is the syntax to get these or those files?
How do I use macros like doEvents.C or bfc.C with the scheduler?
In which directory do I run?
How do I tell the scheduler which queue to use?
I am having wierd problems (i.e. no output from the scheduler, is not creating any scripts, is not submitting them)
How do I know the options of a particular keyword in the catalog query? How do I know which "filetype" are available?
How do I specify parameters while submitting a job?
I getting all sort of illegal characters error. How can I use '&' '<' '>'?
How can I learn a little bit about XML?
I have heard that I can use different policies... Where are they listed?
How do a give my output file the same base-name as my input file?
How do I get more information about crashes from my log file?
How can I resubmit jobs ?
How do I kill a request or a job?
How can find out if any of my job submissions failed?
Why is the number of files selected smaller then the number of files returned by the file catalog?
How can I switch from rootd to xrootd (eXtended rootd) ?
Why do I get (/bin/cp: no match) at the end of my job and no output files back ?
Is it guaranteed that one gets each file at most only once (No duplicates) ?
How do I delete the large amount of files the scheduler produced when rm fails?
What does it mean when the scheduler tells me: The configuration requested a limitation of the number of INPUTFILE(n) environment variables at 200, the rest will be commented out ?
When using the STAR file catalog, when should the keyword "storage=xxx" be used?
I’m running over the same files over and over again, the catalog is slow\down. Can I submit my jobs without re-querying the catalog?
How do I better manage the thousands of files the scheduler writes?

1. How do I submit a job?

First thing you have to prepare your XML description. When you have prepared your files, you can type:

star-submit jobDescription.xml

where jobDescritpion.xml is the name of the file.

2. What is the quickest way to learn how to use the scheduler?

You can use one of the "cut and paste" examples. Of course, you still have to change the command line and the input files. I am sorry I couldn't prepare the exact file for you... :-)

3. I submitted a job, and it created lots of files!

Yes, this is normal. For every process a script and a list of files are created. It's something that will be fixed in the final version. You can delete them easily, since they all start with script* and fileList. Remember, though, to delete them _AFTER_ the job has finished.

4. It's not working! My jobs! What do I do!!! ARGH!

Well, you shouldn't panic like that! You can send an e-mail to the scheduling mailing list, and somebody will help you.

5. How do I re-submit a job that died?

In the comment at the beginning of each script there is the bsub command used to submit the job. You can copy and paste it the command line and execute it. Be sure you are in the same directory of the script.

6. How do I pass MY job name?

You can use the name attribute for job like this:

<job ... name="My name" ... >

7. How do I control which library version I use in the scheduler?

In the command section you can put more than one command. You can actually put a csh script. So you can write:

<command>
starver SL02i
root4star mymacro.C
</command>

8. How do I use the file catalog query? What is the syntax to get these or those files?

The file catalog is actually a separate tool from the scheduler. When you write a query, the get_file_list.pl command is used. So, the full documentation for the query is available in the file catalog manual. You will be interested in the -cond parameter, which is the one you are going to specify in the scheduler.

9. How do I use macros like doEvents.C or bfc.C with the scheduler?

If you are asking this question, it's because you have been trying to submit something like:

<command>root4star -b -q doEvents.C\(9999,\"$FILELIST\"\)</command>

This won't work because doEvent interprets $FILEST as an input file and not as a filelist. But, if you put @ before the filename, doEvents (and bfc.C, ...) will interpret the filename correctly. So you should have something like:

<command>root4star -b -q doEvents.C\(9999,\"@$FILELIST\"\)</command>

10. In which directory do I run?

Before version 1.8.6 the job will start default location for the particular batch system.

If you are using LSF jobs will execute in the directory in which you are submitting the job, which is the same directory where the scripts will be created, which is also the same directory you should be in for resubmitting the jobs.

In version 1.8.6 and above the default directory in which the job starts is define by the environment variable $SCRATCH. This will most likely be a directory local to the worker node. The base directory path will be different for every site. The directory and all its contents will be deleted as soon as the job is finished. For this reason do not ever attempt to change this directory. All files that need to be saved need to be copied back to some other directory.

11. How do I tell the scheduler which queue to use?

You don't, but you can tell how long your job is, so that the scheduler can choose the correct queue for you. Remember: the scheduler will have to work on different sites, on which queue name and lengths will be different. The scheduler will know that.

You can look at this example.

12. I am having wierd problems (i.e. no output from the scheduler, is not creating any scripts, is not submitting them)

Check you .cshrc file. First of all, as Jerome says:

***********************************************
**** DO NOT SET LSF ENVIRONEMNTS YOURSELF *****
**** DO NOT SET LSF ENVIRONEMNTS YOURSELF *****
**** DO NOT SET LSF ENVIRONEMNTS YOURSELF *****
**** DO NOT SET LSF ENVIRONEMNTS YOURSELF *****
**** DO NOT SET LSF ENVIRONEMNTS YOURSELF *****
**** DO NOT SET LSF ENVIRONEMNTS YOURSELF *****
***********************************************

Furthermore, you may want to have a look at this page

13. How do I know the options of a particular keyword in the catalog query? How do I know which "filetype" are available?

You can use the get_file_list.pl to know which values are can a particular have. For example:

[rcas6023] ~/> get_file_list.pl -keys filetype
daq_reco_dst
daq_reco_emcEvent
daq_reco_event
daq_reco_hist
daq_reco_MuDst
daq_reco_runco
daq_reco_tags
dummy_root
...

You can do the same for any keyword

14. How do I specify parameters while submitting a job?

You might want to have a look at the star-submit-template. Here is an example.

15. I getting all sort of illegal characters error. How can I use '&' '<' '>'?

This is an XML problem: '<', '&' and '>' are XML reserved, so you can't use them, but you can use 'something else' in their place. Use the following table for the conversion:

<	<
>	>
&	&

So, for example, if you need to put the following in a query:

runnumber<>2280003&&2280004

runnumber&lt;&gt;2280003&amp;&amp;2280004

Yes, it doesn't look so nice... :-( There is unfortunately not much I can do...

16. How can I learn a little bit about XML?

I suggest you have a look at this site: it has a lot of tutorials about XML, HTML and other web technologies. It's at a entry level and it has a lot of examples.

17. I have heard that I can use different policies... Where are they listed?

You can find a list of the installed policies here.

18. How do a give my output file the same base-name as my input file?

Use $FILEBASENAME as the stdout file name. This only works in version 1.6.2 and up. It only works if there is only one output file or one file preprocess.

Consult the manual if you need an example.

19. How do I get more information about crashes from my log file?

Emphasis is always placed on generating real detailed and meaningful user feedback at the prompt when errors occur. However more information can be found in the users log file. If the scheduler did not crash altogether it will append to the users log file, where the most detailed data of the internal workings can be found.

Every user that ever used the scheduler, even just once has a log file. It holds information about what the scheduler did with your job. To find out more about reading your log file click here.

20. How can I resubmit jobs?

Note: When a job is resubmitted the file query is not redone. Scheduler uses the exact same files as the original job. Be careful when resubmitting old jobs as the path to the file may have changed.

When a request is submitted by the scheduler, it produces a file named [jobID].session.xml. This is a semi-human readable file, that contains everything you need to regenerate the .csh and .list files and to resubmit all or part of your job.

If you wish to resubmit the whole job, the command is (replace [jobID] with your job Id): star-submit -r all [jobID].session.xml

Example: star-submit -r all 08077748F46A7168F5AF92EC3A6E560A.session.xml

If you wish to resubmit a particular job number, the command is (where n is the job number): star-submit -r n [jobID].session.xml

To resubmit all of the the failed jobs, the command is: star-submit -f all [jobID].session.xml

Type : star-submit -h for more options and for help. There are a lot more options available this is a very short overview of the resubmission options.

21. How do I kill a request or a job?

Note: This only refers to version 1.7.5 and above, as resubmission was only built-in at this version. If you submitted via an older scheduler version the resubmit syntax will not work and you will not have a session file. It is also recommended you read 20. How resubmit jobs as there is more information about the session file in there.

The command to kill all the job in the submission is (replace [jobID] with your job Id): star-submit -k all [jobID].session.xml

If you wish only to kill part of the jobs, substitute the word all for a single job number (for job 08077748F46A7168F5AF92EC3A6E560A_4 the number is 4). A comma delaminated list may also be used (example: star-submit -k 5,6,10,22 [jobID].session.xml) or a range (example: star-submit -k 4-23 [jobID].session.xml)

22. How can find out if any of my job submissions failed?

Note: This only refers to version 1.7.5 and above.

This information is stored in the [jobID].report file in a nice neat table (I recommend you turn off word wrap to view this file). We are trying to put more and more information for users in this file with every new version. The file also stores information about queue selection. So it will probably answer such questions as "Why did my job have to go into the long queue as opposed to the short queue".

23. Why is the number of files selected smaller then the number of files returned by the file catalog?

This is because some files may not be able to be accessed, such as files on HPSS when not using xrootd. Duplicate files are also dropped so two or more of the same file returned at more then one location are not counted twice.

24. How can I switch from rootd to xrootd (eXtended rootd) ?

Switching between these two systems is done by specifing requested system in the attribute fileListSyntax in the element job. See the job element section and example.

Note: In the STAR framework (root4star), xrootd syntax is supported for libraries SL05f and above. Please be aware of this restriction as SUMS will generate the jobs and submit but they will not succeed at runtime.

25. Why do I get (/bin/cp: no match) at the end of my job and no output files back?

In a non-grid context when you ask for data to be moved back from $SCRATCH using a tag like this one the:
<output fromScratch="*.root" toURL="file:/star/u/lbhajdu/temp/" />

Is translated into the cp command like you see below:

/bin/cp -r $SCRATCH/*.root /star/u/lbhajdu/temp/

If the cp command returns "/bin/cp: no match" it means it did not match anything. This is because no files where generated by your macro for it to copy. This can be verified by adding an “ls $SCRATCH/*” to the command block of your job right after your macro finishes to list what file it has produced in $SCRATCH.

Examine your macro carefully to see what it’s doing. It could be writing your files to a different directory, or not writing any files at all or crashing before it gets a chance to write anything.

26. Is it guaranteed that one gets each file at most only once (No duplicates)?

You are guaranteed that duplicate files are dropped as long as you get them from the catalog. Duplicate files are not dropped from file list. The dropping of duplicates is based on the fdid of the file. The .dataset file holds the fdid of all the files so you can verify that it did in fact work.

The schedulers output to the screen tells you how many files where dropped because they where duplicates. The scheduler selects between duplicates to pick the one that can be accessed the fastest based on the storage type (there is a ranking). The user can over ride this with the preferStorage key word (see user manual). Xrootd may recover the file from a storage type other then the one stated.

27. How do I delete the large amount of files the scheduler produced when rm fails?

Use this command:
delete '*.*'
The single quotes are needed to prevent shell expansion.

28. What does it mean when the scheduler tells me: The configuration requested a limitation of the number of INPUTFILE(n) environment variables at 200, the rest will be commented out?

There are two ways people input files into there Macro.

1) The file list (.list file) created with every job.
2) From the variables defined in the csh.

If you’re using the file list this message has no effect on you and you should ignore it. If you are using the environment variables limit the number of input files in your job at the limit by using the maxFilesPerProcess variable. There are only so many characters that will fit into the memory of a csh script. So we limit this at 200 files (typically) to avoid hitting the limit.

29. When using the STAR file catalog, when should the keyword "storage=xxx" be used?

The "paths" and "rootd" file list syntax should use "storage!=hpss" because HPSS files are not mounted and these files will be dropped by the scheduler, and even if they where not they would not be accessible.

If you are using "fileListSyntax=xrootd" (currently only available at RCF) xrootd will determine how to give you access to the file. The storage element is less critical here. Using "storage!=local" would pick files for hpss and from NFS if they exists. If files are available on NFS you will get much faster access then the files on HPSS. To prevent excessive dropping of duplicates (load on scheduler and file catalog) you can use "storage=HPSS" because only one copy of each file exists on HPSS.

30. I’m running over the same files over and over again, the catalog is slow\down. Can I submit my jobs without re-querying the catalog?

If you have recently submitted jobs with the scheduler and you still have your .session.xml and your .dataset file and you don’t need to change anything in your xml request (input , ouput, command…) you can recompile your macro and resubmit the jobs again without doing another catalog query again.

To resubmit in this mode cd back to the folder you submitted the jobs from and use:

star-submit –r all 2596CF4AB570DE769AC325EB21864616.session.xml

Of course replacing the .session.xml file above with your own .session.xml file generated for you by the scheduler right after it submitted all of your jobs. The .csh file will be overwritten or regenerated from the information in the .session.xml and the .list files will be rewritten with the information from the .dataset file so any changes to these files you try and make by hand will be lost.

Some additional notes:

This trick will not work forever because the files on local disk get moved around from time to time.
Depending on how you set things up the output files will most likely have the same name and location as the output files from the prior run of jobs, so move the old output files out of the way.
You may want to back up the .dataset and .session.xml before you try this command because if there is a crash while these files are open they will be corrupted.
On a positive note the exact same files will be going to the exact same jobs. So if you’re debugging this should make it a bit easier and faster.

31. How do I better manage the thousands of files the scheduler writes?

In scheduler version 1.10.0c and above there is an option to have the scheduler write the files somewhere other then your current working directory. A basic example of this capability would be to create a subdirectory off your current working directory and have the scheduler copy everything there. To do this first create the directory [mkdir ./myfiles]. Then add these tags inside the body of your xml files jobs tag (that is the same level as the command tag):

<Generator><Location>./myFiles</Location></Generator>

These tags have far more fine grain options fully documented inside the manual in section 5.1.

Levente Hajdu - last modified this page