How To Create SAM Dataset Definitions and Snapshots

Before you can analyze data, you have to tell SAM which data files you would like to use by passing the name of a dataset definition to SAM. You can use a dataset definition that someone else has already made. This is recommended, if you can find one that suits your purpose. You can also create your own, either using command line tools which are described on this webpage or using the web interface, the Dataset Definition Editor.

A SAM dataset definition has already been created for all the datasets defined in DFC (the old Data File Catalog) and all primary CDF datasets that are created by production. For example, here is how to check if the bhmu0f dataset definition exists.

    source ~cdfsoft/cdf2.cshrc
    setup sam
    sam describe dataset definition --defname="bhmu0f"

Or go to the DataBase Browser Dataset Definition Report, type in the database definition name "bhmu0f", and click submit. Here is the result of querying the database for all *0f datasets: list of CDF *0f datasets. The browser takes "*" as a wildcard. You can select based on many other fields, such as user name.

General Notes About SAM

Creating a Dataset Definition

Making a Dataset Definition with a Frozen File List

Dividing a Dataset Definition Into Conveniently Sized Pieces

How to See New Data Added to a Dataset

How to Combine Datasets

How Create a Dataset Definition with One or Two Files

General Notes About SAM

SAM can be controlled by commands given directly to the UNIX shell.

There is a comprehensive list of SAM commands with descriptions, but most of these are only useful for experts, administrators, or exclusively relevant to D0. Below we give examples and describe the sam commands most important for CDF users.
If you want to get help for a specific command, give the command with no arguments and you will get usage and other information.
Some SAM commands can take "%" as a wilcard.
SAM commands are a little slow. Some queries that take 2 minutes in SAM will run in a few seconds on the database browser.

Creating a Dataset Definition

Here is a step by step example of how a dataset definition was created for high PT muons.

        ssh -X fcdflnx7
        source ~cdfsoft/cdf2.cshrc
        setup sam

The SAM commands in this example were run on fcdflnx7, which is a central LINUX machine. If the necessary components are configured and installed properly, it should also be possible to run these commands on users' desktop machines.


        sam list files --dim="CDF.DATASET bhmu0d and RUN_NUMBER <= 186598"

The command "sam list files" is not required to create a dataset definition, but we recommend you run it before creating a dataset definition to test the dimension requirements. These requirements will be used to select files when taking a snapshot of the database definition. In the context of SAM, dimensions are parameters associated with files used to select files from the database. The command sam list files simply prints a list of files to standard output with some summary statistics at the end. It will print all the files specified by the criteria following the --dim argument, where --dim is short for dimension. These particular criteria select all files in CDF primary dataset bhmu0d that contain at least one event from a run less than or equal to 186598. CDF.DATASET refers to the CDF datasets defined outside of the context of SAM. You can use the logical operators "and", "or", and "minus" in dimension statements. You can also use parentheses, "(" and ")". You will need them to remove ambiguities in complicated statements like "CDF.DATASET aaaaaa and (CDF.DATASET bbbbbb or CDF.DATASET cccccc)". Most people will not need more than one or two dimensions, but there are many possible dimensions (Database Browser Report - list of dimensions).

If you select files based on run number, you must be aware that SAM only selects files not events. A file can have events from multiple runs, before and after any cutoff. If any event passes the selection, the entire file and all events will be delivered by SAM. You must use another method to select events in your analysis (see DHInput documentation) and be careful not to process the same events twice in different datasets.

Now let's create the dataset definition.

        sam create dataset definition \
        --defname="high_pt_muons_1" \
        --group=test \
        --defdesc="High pt muons before the August 2004 shutdown" \
        --dim="CDF.DATASET bhmu0d and RUN_NUMBER <= 186598"

        >> DatasetDefinition saved with definitionId = 11067

        sam take snapshot --defname="high_pt_muons_1" --group=test

        >> Snapshot has been taken, snapshotId = 22604

The command sam create dataset definition actually creates the dataset definition. Then the command sam take snapshot creates a snapshot of the dataset definition. The lines beginning with ">>" show text returned to standard output after these commands execute.

The --defname argument is used to assign a new name to the dataset definition when it is created and to refer to the dataset definition when making a snapshot.

The --group argument should always be set to "test" by CDF users. It serves no purpose within CDF. Just set it to "test" and forget about it.

The --defdesc argument allows one to add a text description of the dataset definition. People can find and read this description, but other than that it serves no functional purpose.

The --dim argument has the same meaning as in the sam list files command. It is the dimension requirements (file selection criteria) assigned to the dataset definition.

A dataset definition is not a list of files. A dataset definition contains a set of criteria used to select files from the database. If the SAM database changes over time, the list files passing the criteria could change. Making a snapshot creates a list of files from the criteria using the database at the time the snapshot is created. The list of files associated with a snapshot will not change in time.

In addition to creating a list of files, making a snapshot has an important side effect. The dimension requirements associated with a dataset definition can be modified by anyone before the first snapshot is made from it. The first time a snapshot is taken the dimension requirements are frozen and can no longer be changed. If you want to create dataset definition that will not be modified or deleted, you should take a snapshot of it soon after creating it.

We recommend the following naming convention for dataset definitions. For dataset definitions defined by physics groups, we recommend a short descriptive name like "high_pt_muons" be assigned. Then form names by appending an underscore and a single number that is incremented for each new version of the dataset definition, high_pt_muons_2, high_pt_muons_3, etc... We do not recommend creating really long names that include all the metadata. The metadata is in the database and can easily be looked up. We recommend users prepend their username with an underscore to all dataset definitions intended for private use or tests. All dataset definitions must have unique names.

Making a Dataset Definition with a Frozen File List

If you followed the discussions in the last section, you should understand that using the high_pt_muons_1 dataset definition in your analysis job will not always give the same list of files. For example, if new files are added to the SAM database that pass the file selection criteria they will also be processed in your analysis job. Maybe that is what you want, but if you prefer stability then the following steps show how to create a stable dataset definition.

        sam create dataset definition \
        --defname="high_pt_muons_2" \
        --group=test \
        --dim="dataset_def_name high_pt_muons_1 and snapshot_version 1" \
        --defdesc="From snapshot 29 June 2005, High pt muons before the August 2004 shutdown"

        sam take snapshot --defname="high_pt_muons_2" --group=test

Some comments related to this

You cannot directly pass a snapshot to your analysis job, which is why the method above is useful.

This procedure freezes which files are associated with the dataset definition, but it does not freeze the order files will processed by an analysis job.

Note that snapshot version and snapshot ID are not the same thing. Here is how to find the snapshot version using the database browser. You can start at this web page: Snapshot Report . In the previous step, the command sam take snapshot returned a snapshot ID. Type in the snapshot ID in the corresponding field and then click "Submit Request". Find the snapshot version in the resulting report.

In a dimension argument, you can reference another dataset definition OR both another dataset definition and a corresponding snapshot version. There is an important but subtle technical detail related to this that one needs to be very careful about. You should use a different dimensions in these two cases. You should use the dimension "dataset_def_name" when also specifying a snapshot version. When not specifying a snapshot version, you should always use the dimension "__SET__". For example, if you want to print out the list of files that will be selected by the dataset definition we just created without referencing a snapshot version use:

    sam list files --dim="__SET__ high_pt_muons_2"

Dividing a Dataset Definition Into Conveniently Sized Pieces

Sometimes you want to divide a large dataset into smaller pieces for convenience, not to make some physics selection. The following commands show how the dataset definition defined above was divided into 3 pieces.

        sam create dataset definition \
        --defname="high_pt_muons_2_a1" \
        --group=test \
        --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 1-400" \
        --defdesc="Files 1 to 400 of high_pt_muons_2"

        sam take snapshot --defname="high_pt_muons_2_a1" --group=test

        sam create dataset definition \
        --defname="high_pt_muons_2_a2" \
        --group=test \
        --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 401-800" \
        --defdesc="Files 401 to 800 of high_pt_muons_2"

        sam take snapshot --defname="high_pt_muons_2_a2" --group=test

        sam create dataset definition \
        --defname="high_pt_muons_2_a3" \
        --group=test \
        --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 801-1135" \
        --defdesc="Files 801 to 1135 of high_pt_muons_2"

        sam take snapshot --defname="high_pt_muons_2_a3" --group=test

How to See New Data Added to a Dataset

Assume that you have previously analyzed the xpmm0f dataset using snapshot version 1 of the xpmm0f dataset definition. You want to know which new files have been added since you previously analyzed the dataset. First take a new snapshot.

    sam take snapshot --defname="xpmm0f" --group=test

Look up the snapshot version using the database browser ( Snapshot Report ) and the snapshot ID returned by the command above. Let's assume the snapshot version is 20. To print a list of new files:

    sam list files \
        --dim="(dataset_def_name xpmm0f and snapshot_version=20) minus \
               (dataset_def_name xpmm0f and snapshot_version=1)"

Then if you want to analyze only this new data, create a new dataset definition and use it to run an analysis job.

    sam create dataset definition \
        --defname="newDatasetDefinitionName" \
        --group=test \
        --defdesc="description" \
        --dim="(dataset_def_name xpmm0f and snapshot_version=20) minus \
               (dataset_def_name xpmm0f and snapshot_version=1)"

How to Combine Datasets

It is easy to combine two datasets. Simply create a new dataset definition and use "or" in the dimension requirements.

    sam create dataset definition \
        --defname="newDatasetDefinitionName" \
        --group=test \
        --defdesc="description" \
        --dim="__SET__ bhmu0d or __SET__ bhmu0f"

If you use "__SET__ datasetDefinition" in a dimension argument, it gets replaced by the dimension statement of the datasetDefinition. So the command above will create a new dataset definition whose file selection criteria are that files pass the criteria for bhmu0d or the criteria for bhmu0f.

How Create a Dataset Definition with One or Two Files

Please, do not try to use this to create a dataset definition for each file in a dataset with multiple files. When you run a job with these dataset definitions you will create a project per dataset definition and overload the SAM servers. Do not do this.

You are encouraged to create a SINGLE dataset definition with one, two, or more files and run it as a test or to recover failed CAF segments. Here is an example of how to create a dataset definition for a single data file for test purposes.

    sam create dataset definition \ 
        --defname="newDatasetDefinitionName" \
        --group=test \
        --defdesc="description" \ 
        --dim="FILE_NAME bd02d8e6.05bdhmu0"
  
    sam take snapshot --defname="datasetDefinitionName" --group=test

Here is an example of how to extend this to two files and this can be extended to a larger number.

    sam create dataset definition \ 
        --defname="newDatasetDefinitionName" \
        --group=test \
        --defdesc="description" \ 
        --dim="FILE_NAME bd02d8e6.05bdhmu0 or FILE_NAME bd021cb9.0001hmu0"
  
    sam take snapshot --defname="datasetDefinitionName" --group=test

How To Create SAM Dataset Definitions and Snapshots

Contents

General Notes About SAM

Creating a Dataset Definition

Making a Dataset Definition with a Frozen File List

Dividing a Dataset Definition Into Conveniently Sized Pieces

How to See New Data Added to a Dataset

How to Combine Datasets

How Create a Dataset Definition with One or Two Files