Before you can analyze data, you have to tell SAM which data files you would like to use by passing the name of a dataset definition to SAM. You can use a dataset definition that someone else has already made. This is recommended, if you can find one that suits your purpose. You can also create your own, either using command line tools which are described on this webpage or using the web interface, the Dataset Definition Editor.
A SAM dataset definition has already been created for all the datasets defined in DFC (the old Data File Catalog) and all primary CDF datasets that are created by production. For example, here is how to check if the bhmu0f dataset definition exists.
source ~cdfsoft/cdf2.cshrc setup sam sam describe dataset definition --defname="bhmu0f"Or go to the DataBase Browser Dataset Definition Report, type in the database definition name "bhmu0f", and click submit. Here is the result of querying the database for all *0f datasets: list of CDF *0f datasets. The browser takes "*" as a wildcard. You can select based on many other fields, such as user name.
ssh -X fcdflnx7 source ~cdfsoft/cdf2.cshrc setup samThe SAM commands in this example were run on fcdflnx7, which is a central LINUX machine. If the necessary components are configured and installed properly, it should also be possible to run these commands on users' desktop machines.
sam list files --dim="CDF.DATASET bhmu0d and RUN_NUMBER <= 186598"The command "sam list files" is not required to create a dataset definition, but we recommend you run it before creating a dataset definition to test the dimension requirements. These requirements will be used to select files when taking a snapshot of the database definition. In the context of SAM, dimensions are parameters associated with files used to select files from the database. The command sam list files simply prints a list of files to standard output with some summary statistics at the end. It will print all the files specified by the criteria following the --dim argument, where --dim is short for dimension. These particular criteria select all files in CDF primary dataset bhmu0d that contain at least one event from a run less than or equal to 186598. CDF.DATASET refers to the CDF datasets defined outside of the context of SAM. You can use the logical operators "and", "or", and "minus" in dimension statements. You can also use parentheses, "(" and ")". You will need them to remove ambiguities in complicated statements like "CDF.DATASET aaaaaa and (CDF.DATASET bbbbbb or CDF.DATASET cccccc)". Most people will not need more than one or two dimensions, but there are many possible dimensions (Database Browser Report - list of dimensions).
If you select files based on run number, you must be aware that SAM only selects files not events. A file can have events from multiple runs, before and after any cutoff. If any event passes the selection, the entire file and all events will be delivered by SAM. You must use another method to select events in your analysis (see DHInput documentation) and be careful not to process the same events twice in different datasets.
Now let's create the dataset definition.
sam create dataset definition \ --defname="high_pt_muons_1" \ --group=test \ --defdesc="High pt muons before the August 2004 shutdown" \ --dim="CDF.DATASET bhmu0d and RUN_NUMBER <= 186598" >> DatasetDefinition saved with definitionId = 11067 sam take snapshot --defname="high_pt_muons_1" --group=test >> Snapshot has been taken, snapshotId = 22604
The command sam create dataset definition actually creates the dataset definition. Then the command sam take snapshot creates a snapshot of the dataset definition. The lines beginning with ">>" show text returned to standard output after these commands execute.
A dataset definition is not a list of files. A dataset definition contains a set of criteria used to select files from the database. If the SAM database changes over time, the list files passing the criteria could change. Making a snapshot creates a list of files from the criteria using the database at the time the snapshot is created. The list of files associated with a snapshot will not change in time.
In addition to creating a list of files, making a snapshot has an important side effect. The dimension requirements associated with a dataset definition can be modified by anyone before the first snapshot is made from it. The first time a snapshot is taken the dimension requirements are frozen and can no longer be changed. If you want to create dataset definition that will not be modified or deleted, you should take a snapshot of it soon after creating it.
We recommend the following naming convention for dataset definitions. For dataset definitions defined by physics groups, we recommend a short descriptive name like "high_pt_muons" be assigned. Then form names by appending an underscore and a single number that is incremented for each new version of the dataset definition, high_pt_muons_2, high_pt_muons_3, etc... We do not recommend creating really long names that include all the metadata. The metadata is in the database and can easily be looked up. We recommend users prepend their username with an underscore to all dataset definitions intended for private use or tests. All dataset definitions must have unique names.
sam create dataset definition \ --defname="high_pt_muons_2" \ --group=test \ --dim="dataset_def_name high_pt_muons_1 and snapshot_version 1" \ --defdesc="From snapshot 29 June 2005, High pt muons before the August 2004 shutdown" sam take snapshot --defname="high_pt_muons_2" --group=testSome comments related to this
sam list files --dim="__SET__ high_pt_muons_2"If you do not properly select "dataset_def_name" or "__SET__", your commands will appear to work OK, but in many cases give unexpected results that you do not want.
Sometimes you want to divide a large dataset into smaller pieces for convenience, not to make some physics selection. The following commands show how the dataset definition defined above was divided into 3 pieces.
sam create dataset definition \ --defname="high_pt_muons_2_a1" \ --group=test \ --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 1-400" \ --defdesc="Files 1 to 400 of high_pt_muons_2" sam take snapshot --defname="high_pt_muons_2_a1" --group=test sam create dataset definition \ --defname="high_pt_muons_2_a2" \ --group=test \ --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 401-800" \ --defdesc="Files 401 to 800 of high_pt_muons_2" sam take snapshot --defname="high_pt_muons_2_a2" --group=test sam create dataset definition \ --defname="high_pt_muons_2_a3" \ --group=test \ --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 801-1135" \ --defdesc="Files 801 to 1135 of high_pt_muons_2" sam take snapshot --defname="high_pt_muons_2_a3" --group=test
sam take snapshot --defname="xpmm0f" --group=testLook up the snapshot version using the database browser ( Snapshot Report ) and the snapshot ID returned by the command above. Let's assume the snapshot version is 20. To print a list of new files:
sam list files \ --dim="(dataset_def_name xpmm0f and snapshot_version=20) minus \ (dataset_def_name xpmm0f and snapshot_version=1)"Then if you want to analyze only this new data, create a new dataset definition and use it to run an analysis job.
sam create dataset definition \ --defname="newDatasetDefinitionName" \ --group=test \ --defdesc="description" \ --dim="(dataset_def_name xpmm0f and snapshot_version=20) minus \ (dataset_def_name xpmm0f and snapshot_version=1)"
sam create dataset definition \ --defname="newDatasetDefinitionName" \ --group=test \ --defdesc="description" \ --dim="__SET__ bhmu0d or __SET__ bhmu0f"If you use "__SET__ datasetDefinition" in a dimension argument, it gets replaced by the dimension statement of the datasetDefinition. So the command above will create a new dataset definition whose file selection criteria are that files pass the criteria for bhmu0d or the criteria for bhmu0f.
Please, do not try to use this to create a dataset definition for each file in a dataset with multiple files. When you run a job with these dataset definitions you will create a project per dataset definition and overload the SAM servers. Do not do this.
You are encouraged to create a SINGLE dataset definition with one, two, or more files and run it as a test or to recover failed CAF segments. Here is an example of how to create a dataset definition for a single data file for test purposes.
sam create dataset definition \ --defname="newDatasetDefinitionName" \ --group=test \ --defdesc="description" \ --dim="FILE_NAME bd02d8e6.05bdhmu0" sam take snapshot --defname="datasetDefinitionName" --group=testHere is an example of how to extend this to two files and this can be extended to a larger number.
sam create dataset definition \ --defname="newDatasetDefinitionName" \ --group=test \ --defdesc="description" \ --dim="FILE_NAME bd02d8e6.05bdhmu0 or FILE_NAME bd021cb9.0001hmu0" sam take snapshot --defname="datasetDefinitionName" --group=test