How To Create SAM Dataset Definitions and Snapshots

Before you can analyze data, you have to tell SAM which data files you would like to use by passing the name of a dataset definition to SAM. You can use a dataset definition that someone else has already made. This is recommended, if you can find one that suits your purpose. You can also create your own, either using command line tools which are described on this webpage or using the web interface, the Dataset Definition Editor.

A SAM dataset definition has already been created for all the datasets defined in DFC (the old Data File Catalog) and all primary CDF datasets that are created by production. For example, here is how to check if the bhmu0f dataset definition exists.

    source ~cdfsoft/cdf2.cshrc
    setup sam
    sam describe dataset definition --defname="bhmu0f"
Or go to the DataBase Browser Dataset Definition Report, type in the database definition name "bhmu0f", and click submit. Here is the result of querying the database for all *0f datasets: list of CDF *0f datasets. The browser takes "*" as a wildcard. You can select based on many other fields, such as user name.

Contents

  • General Notes About SAM
  • Creating a Dataset Definition
  • Making a Dataset Definition with a Frozen File List
  • Dividing a Dataset Definition Into Conveniently Sized Pieces
  • How to See New Data Added to a Dataset
  • How to Combine Datasets
  • How Create a Dataset Definition with One or Two Files
  • General Notes About SAM

    SAM can be controlled by commands given directly to the UNIX shell.

    Creating a Dataset Definition

    Here is a step by step example of how a dataset definition was created for high PT muons.
            ssh -X fcdflnx7
            source ~cdfsoft/cdf2.cshrc
            setup sam
    
    The SAM commands in this example were run on fcdflnx7, which is a central LINUX machine. If the necessary components are configured and installed properly, it should also be possible to run these commands on users' desktop machines.
    
            sam list files --dim="CDF.DATASET bhmu0d and RUN_NUMBER <= 186598"
    
    The command "sam list files" is not required to create a dataset definition, but we recommend you run it before creating a dataset definition to test the dimension requirements. These requirements will be used to select files when taking a snapshot of the database definition. In the context of SAM, dimensions are parameters associated with files used to select files from the database. The command sam list files simply prints a list of files to standard output with some summary statistics at the end. It will print all the files specified by the criteria following the --dim argument, where --dim is short for dimension. These particular criteria select all files in CDF primary dataset bhmu0d that contain at least one event from a run less than or equal to 186598. CDF.DATASET refers to the CDF datasets defined outside of the context of SAM. You can use the logical operators "and", "or", and "minus" in dimension statements. You can also use parentheses, "(" and ")". You will need them to remove ambiguities in complicated statements like "CDF.DATASET aaaaaa and (CDF.DATASET bbbbbb or CDF.DATASET cccccc)". Most people will not need more than one or two dimensions, but there are many possible dimensions
    (Database Browser Report - list of dimensions).

    If you select files based on run number, you must be aware that SAM only selects files not events. A file can have events from multiple runs, before and after any cutoff. If any event passes the selection, the entire file and all events will be delivered by SAM. You must use another method to select events in your analysis (see DHInput documentation) and be careful not to process the same events twice in different datasets.

    Now let's create the dataset definition.

            sam create dataset definition \
            --defname="high_pt_muons_1" \
            --group=test \
            --defdesc="High pt muons before the August 2004 shutdown" \
            --dim="CDF.DATASET bhmu0d and RUN_NUMBER <= 186598"
    
            >> DatasetDefinition saved with definitionId = 11067
    
            sam take snapshot --defname="high_pt_muons_1" --group=test
    
            >> Snapshot has been taken, snapshotId = 22604
    

    The command sam create dataset definition actually creates the dataset definition. Then the command sam take snapshot creates a snapshot of the dataset definition. The lines beginning with ">>" show text returned to standard output after these commands execute.

    A dataset definition is not a list of files. A dataset definition contains a set of criteria used to select files from the database. If the SAM database changes over time, the list files passing the criteria could change. Making a snapshot creates a list of files from the criteria using the database at the time the snapshot is created. The list of files associated with a snapshot will not change in time.

    In addition to creating a list of files, making a snapshot has an important side effect. The dimension requirements associated with a dataset definition can be modified by anyone before the first snapshot is made from it. The first time a snapshot is taken the dimension requirements are frozen and can no longer be changed. If you want to create dataset definition that will not be modified or deleted, you should take a snapshot of it soon after creating it.

    We recommend the following naming convention for dataset definitions. For dataset definitions defined by physics groups, we recommend a short descriptive name like "high_pt_muons" be assigned. Then form names by appending an underscore and a single number that is incremented for each new version of the dataset definition, high_pt_muons_2, high_pt_muons_3, etc... We do not recommend creating really long names that include all the metadata. The metadata is in the database and can easily be looked up. We recommend users prepend their username with an underscore to all dataset definitions intended for private use or tests. All dataset definitions must have unique names.

    Making a Dataset Definition with a Frozen File List

    If you followed the discussions in the last section, you should understand that using the high_pt_muons_1 dataset definition in your analysis job will not always give the same list of files. For example, if new files are added to the SAM database that pass the file selection criteria they will also be processed in your analysis job. Maybe that is what you want, but if you prefer stability then the following steps show how to create a stable dataset definition.
            sam create dataset definition \
            --defname="high_pt_muons_2" \
            --group=test \
            --dim="dataset_def_name high_pt_muons_1 and snapshot_version 1" \
            --defdesc="From snapshot 29 June 2005, High pt muons before the August 2004 shutdown"
    
            sam take snapshot --defname="high_pt_muons_2" --group=test
    
    
    Some comments related to this

    Dividing a Dataset Definition Into Conveniently Sized Pieces

    Sometimes you want to divide a large dataset into smaller pieces for convenience, not to make some physics selection. The following commands show how the dataset definition defined above was divided into 3 pieces.

            sam create dataset definition \
            --defname="high_pt_muons_2_a1" \
            --group=test \
            --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 1-400" \
            --defdesc="Files 1 to 400 of high_pt_muons_2"
    
            sam take snapshot --defname="high_pt_muons_2_a1" --group=test
    
            sam create dataset definition \
            --defname="high_pt_muons_2_a2" \
            --group=test \
            --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 401-800" \
            --defdesc="Files 401 to 800 of high_pt_muons_2"
    
            sam take snapshot --defname="high_pt_muons_2_a2" --group=test
    
            sam create dataset definition \
            --defname="high_pt_muons_2_a3" \
            --group=test \
            --dim="dataset_def_name high_pt_muons_2 and snapshot_version 1 and snapshot_file_number 801-1135" \
            --defdesc="Files 801 to 1135 of high_pt_muons_2"
    
            sam take snapshot --defname="high_pt_muons_2_a3" --group=test
    

    How to See New Data Added to a Dataset

    Assume that you have previously analyzed the xpmm0f dataset using snapshot version 1 of the xpmm0f dataset definition. You want to know which new files have been added since you previously analyzed the dataset. First take a new snapshot.
        sam take snapshot --defname="xpmm0f" --group=test
    
    Look up the snapshot version using the database browser (
    Snapshot Report ) and the snapshot ID returned by the command above. Let's assume the snapshot version is 20. To print a list of new files:
        sam list files \
            --dim="(dataset_def_name xpmm0f and snapshot_version=20) minus \
                   (dataset_def_name xpmm0f and snapshot_version=1)"
    
    Then if you want to analyze only this new data, create a new dataset definition and use it to run an analysis job.
        sam create dataset definition \
            --defname="newDatasetDefinitionName" \
            --group=test \
            --defdesc="description" \
            --dim="(dataset_def_name xpmm0f and snapshot_version=20) minus \
                   (dataset_def_name xpmm0f and snapshot_version=1)"
    

    How to Combine Datasets

    It is easy to combine two datasets. Simply create a new dataset definition and use "or" in the dimension requirements.
        sam create dataset definition \
            --defname="newDatasetDefinitionName" \
            --group=test \
            --defdesc="description" \
            --dim="__SET__ bhmu0d or __SET__ bhmu0f"
    
    If you use "__SET__ datasetDefinition" in a dimension argument, it gets replaced by the dimension statement of the datasetDefinition. So the command above will create a new dataset definition whose file selection criteria are that files pass the criteria for bhmu0d or the criteria for bhmu0f.

    How Create a Dataset Definition with One or Two Files

    Please, do not try to use this to create a dataset definition for each file in a dataset with multiple files. When you run a job with these dataset definitions you will create a project per dataset definition and overload the SAM servers. Do not do this.

    You are encouraged to create a SINGLE dataset definition with one, two, or more files and run it as a test or to recover failed CAF segments. Here is an example of how to create a dataset definition for a single data file for test purposes.

        sam create dataset definition \ 
            --defname="newDatasetDefinitionName" \
            --group=test \
            --defdesc="description" \ 
            --dim="FILE_NAME bd02d8e6.05bdhmu0"
      
        sam take snapshot --defname="datasetDefinitionName" --group=test
    
    Here is an example of how to extend this to two files and this can be extended to a larger number.
        sam create dataset definition \ 
            --defname="newDatasetDefinitionName" \
            --group=test \
            --defdesc="description" \ 
            --dim="FILE_NAME bd02d8e6.05bdhmu0 or FILE_NAME bd021cb9.0001hmu0"
      
        sam take snapshot --defname="datasetDefinitionName" --group=test