Dataset Building and Validation for USCMS

Introduction

There are many steps involved in building, publishing and validating datasets for the USCMS User Analysis Facility. Some of the steps currently required are due to limitations,or bugs, in existing infrastructure and are expected to become obsolete in the future. However, until the changes make into production releases and the older versions have long since stopped being used, it will be necessary to continue executing many of the temporary steps. This document describes the steps necessary, given the current circumstances, to build, publish and validate datasets. Note: These instructions are for building in the production area only. Experts wishing to work in the development build area should contact Gerald Guglielmo (gug@fnal.gov) for modified instructions and extra steps. If you are not intimately familiar with the production build process, please do not attempt to run in the development area.

Proceedures

Step 0: General Setup for All Steps

Open a session on the US CMS User Analysis cluster (a.k.a. bigmac). Change directories to the base production area of the dataset building and validation, /storage/data3/pubWork/prod/DatasetBuild, and source the setup.csh script. The script will create a working area, subdirectories for some of the processing steps, a few links, define several environment variables, and change directories to your work area. One of the environment variables that will be setup is DATASETBUILD_BASE and this variable can be used for referencing directories from that point on.

Setup Environment:
Any environment setup should be done inside the scripts, thus making this step irrelevant. However in case there is a need for some initialization outside of the scripts, this step serves as a placeholder. Login on the UAF and change directories to the top level build area, source the setup script and you will end up in your own working directory.
cd /storage/data3/pubWork/prod/DatasetBuild source setup.csh

Step 1: Building the Metadata files

Presumably one will receive datasets from different sources and I need to discuss with Anzar how things really work here. Scripts to automate as much of this as possible will be provided, but the process will still require careful attention and may depend on more manual activities than is desirable. If the datasets come from the production team, then one may already have a POOL XML file catalog to start with and regenerating that will not be necessary. However, if the catalog is not available then it will need to be generated in the early stages of the process.

Identify Dataset Name and All Owner Names:
The name of the dataset and all owners that will be grouped together need to be identified. The owner for the last production stage will serve as the primary owner and will be the one referenced in the publishing and validation stages. For example, for a DST datast we might also want to attach the SimHits and Digis so we DST owner becomes the primary and we also need the SimHits and Digis owners for the dataset. For the moment we assume an expert has provided us with a list of owners needed for each dataset and will proceed from there.
Select and Reserve a Dataset:
In the future there should be a tool for handling this task. For the moment, there is only a text file to edit by hand. Add your username to the beginning of the line for the dataset you will build. The file name is called /afs/fnal.gov/files/expwww/uscms/html/SoftwareComputing/UserComputing/dataset_service/todo.txt. A username at the start of the line, or "done", indicates another user has reserved this dataset.
Attach the Runs:
There may need to be some preliminary steps here, I am do not yet now what the scripts look like and what input they need. It is likely the process will be to run the attachment process more than once, and it is possible that the scripts involved will differ between SimHits/Digis and DSTs.
1. Run the Attach Runs script for Hits Dataset:
  Change directories to your attachArea directory. Invoke the following steps to attach runs and look over the the standar output, for signs of a problem. It is recommended that the command below have standard output redirected to an output file for easier parsing later.
  cd attachArea ${DATASETBUILD_BASE}/bin/BuildDataset.sh <DatasetName> <HitOwnerName>
2. Run the Attach Runs script for Digis Dataset:
  Change directories to your attachArea directory. Invoke the following steps to attach runs and look over the the standar output, for signs of a problem. It is recommended that the command below have standard output redirected to an output file for easier parsing later.
  cd attachArea ${DATASETBUILD_BASE}/bin/BuildDataset.sh <DatasetName> <DigiOwnerName>
3. Run the Attach Runs script for DSTs Dataset:
  Change directories to your attachArea. Invoke the following steps to attach runs and look over the the standar output, for signs of a problem. It is recommended that the command below have standard output redirected to an output file for easier parsing later.
  cd attachArea ${DATASETBUILD_BASE}/bin/BuildDataset.sh <DatasetName> <DstOwnerName>
Build Temporary Tarfile:
The META files and the POOL catalog for the dataset are not built in the production directory tree. This allows one to rebuild an existing dataset without interfering with the existing dataset in the production area. The penalty for this flexibility is that the files need to be moved to the production area. The production top level area is currently /storage/data4/METADATA and the META files reside in a subdirectory path of <OwnerName>/<DatasetName>. To transfer the files to the prodcution area involves building a tar file, untarring the file in the production top level and then running a script to fix the paths listed in the POOL XML file catalog.
1. Change Directory to the Build Metadata Top Level:
  Like the production area the build Metadata area has a top level directory. Below this top level will be a list of subdirectories which represent owner names. In each owner name subdirectory there will be one or more subdirectories which represent dataset names. Change directory to the build top level for Metadata.
  cd ${DATASETBUILD_BASE}/${USER}/commonOutDir/METADATA_ROOT
2. Generate the Tar File:
  There is a script that will build the tar file in the current directory. Invoking the script with the primary owner name and the dataset name will generate a tar file of the form META.<OwnerName>.<DatasetName>.<yyyymmdd>.tar.
  ${DATASETBUILD_BASE}/bin/genTar.sh <OwnerName> <DatasetName>
Deploy the Metadata Files to Production Area:
The production area the build Metadata area has a top level directory of /storage/data4/METADATA. Below this top level will be a list of subdirectories which represent owner names. In each owner name subdirectory there will be one or more subdirectories which represent dataset names.
1. Change Directory to the Production Metadata Top Level:
  The production area the build Metadata area has a top level directory of /storage/data4/METADATA. Change directory to the production top level for Metadata.
  cd /storage/data4/METADATA
2. Untar the Metadata:
  In the production Metadata top level directory untar the tar file generated earlier. Recall the path to where the tar file was built, which should be the build Metadata top level directory, and use that path to the tar file.
  ${DATASETBUILD_BASE}/bin/untar.sh \ ${commonOutDir}/METADATA_ROOT/META.<OwnerName>.<DatasetName>.<yyyymmdd>.tar
3. Fix the Paths to the Metadata Files in the POOL XML File Catalog:
  The POOL XML file catalog will have paths for all the META files that point to the build directory tree instead of the production tree. To correct this a script exists will change instances of the first path provided to that of the second path provided. Care should be taken to omit the trailing '/' character from both paths when invoking the script. The script needs to be run from the production Metadata top level directory.
  ${DATASETBUILD_BASE}/bin/FixMetaPFNS.sh <OwnerName> <DatasetName> \ ${DATASETBUILD_BASE}/${USER}/commonOutDir/METADATA_ROOT /storage/data4/METADATA
  -or recommended-
  ${DATASETBUILD_BASE}/bin/FixMetaPFNS.sh <OwnerName> <DatasetName> \ ${DATASETBUILD_BASE}/${USER}/commonOutDir/METADATA_ROOT $PWD

Step 2: Initial Publishing of Dataset to the Dataset Service

The validation steps require that the dataset is known to the Dataset Service. Therefore it is necessary to initially publish the dataset to the Dataset Service before the validation process can begin.

Change Directory to the UAF Farms Web Area:
The Dataset Service web pages are in the UAF Farms web area located at /afs/fnal.gov/files/expwww/uscms/html/SoftwareComputing/UserComputing. Change directory to that area in preparation for rebuilding the Dataset Service web pages.
cd /afs/fnal.gov/files/expwww/uscms/html/SoftwareComputing/UserComputing
Add the Dataset to the Dataset Service:
There are now two ways to add a new dataset, but currently only the entire rebuild is working so use that one. The first is to rebuild all the datasets in the service, and the second is to build only the new one. In the former case only the production top level Metadata area is specified to the script, while for the latter the owner name and dataset name are also specified. The second method is faster and will scale a bit better as it only needs to regenerate the html for all the datasets, whereas the first method copies over files for all the datasets in addition to the html generation.
${DATASETBUILD_BASE}/bin/getOrcarcFrag.sh /storage/data4/METADATA
-or recommended-
${DATASETBUILD_BASE}/bin/getOrcarcFrag.sh /storage/data4/METADATA <OwnerName> <DatasetName>

Step 3: Run the Fix for the Attach Runs Bug

There is a bug in some versions of the AttachRun code creates inconsistencies in the SysCollection metadata file, which then causes DetachRun to screw up. While the AttachRun code has been fixed, there are still datasets that have been built by the bad version and this could continue for some time into the future. Therefore this issue will not go away completely anytime soon. There is fortunately an application that can repair the Metadata files, it is safe to run on datasets that do not have a problem, and it runs fairly quickly. At the moment the easiest way to access the application to fix the Metadata for this problem is from by using the Cobra version in /storage/data3/pubWork/prod/DatasetBuild/Releases/COBRA_8_1_0, which can be accomplished by using the script described below and specifying 8_1_0 as the <CobraVersion>. Because this proceedure is safe to run on datasets that do not have the problem, as well as the ones that do, it should be run on all datasets to be safe and consistent.

Run the Fix Application
The fix application in this section will produce an output file called fix_<OwnerName>_<DatasetName>_COBRA_<CobraVersion>.txt in the subdirectory called attachFixArea of your basic work area (${DATASETBUILD_BASE}/${USER}/work), regardless of your current working directory when you launch the script. This file will indicate which runs had the fix applied and which ones were detected as not needing to be fixed.
cd ${DATASETBUILD_BASE}/${USER}/work/attachFixArea ${DATASETBUILD_BASE}/bin/runFix.sh <OwnerName> <DatasetName> <CobraVersion>
Note that the for the moment use 8_1_0 for the CobraVersion for this step.

Step 4: Primary Validation of the Dataset

It is possible that the attaching of the runs did not go smoothly, or other structural problems exist, and there may be problems running over the new dataset. Therefore a minimal validation of the dataset needs to be performed. If all goes well there will be a list of bad runs, hopefully empty, generated by the validation process. There will also be information on the number of runs in the dataset and a listing of the runs. This information will be placed in one of two html files, one from each validation application, and these files need to be copied to the production Metadata directory tree. There can be problems that cause a listing of a bad run to be run number 0. In these cases one must read the log files and figure out what the real run number is by hand (contact an expert for help).

Check for Bad Runs and Make a List:
The initial validation step gets a listing of runs for all owners, compares the listings for missing runs in earlier stages, and runs validate collections to look for problems. There are several files needed to run this step. The jdf file allows one to submit the validation to the fbsng system on the User Analysis Facility. Note that the for the moment use 8_1_0 for the CobraVersion for this step.
1. Repeat the Step for Adding a Dataset to the Dataset Service
  See the section above on how to add a dataset to the Dataset Service. That step should be repeated for the dataset. This will update the information about the status of the dataset in the Dataset Service.
Check that Event Headers can be Read:
This validation step uses batchcobra to read event headers for runs in the primary owner for the dataset. There are several files needed to run this validation step. The jdf file allows one to submit the validation to the fbsng system on the User Analysis Facility. Note that the for the moment use 8_1_0 for the CobraVersion for this step.
1. Run the Batchcobra Validation
  The validation application in this section will produce several output files. The important ones for the general user are: BatchCOBRA_<OwnerName>_<DatasetName>.html and bcobra_<OwnerName>_<DatasetName>_ORCA_<CobraVersion>.txt in the subdirectory called headValArea of your basic work area (${DATASETBUILD_BASE}/${USER}/work), regardless of your current working directory when you launch the script. The former has important information needed by the Dataset Service and should contain a listing of any bad runs. The latter is the log file and is useful for investigating problems.
  cd ${DATASETBUILD_BASE}/${USER}/work/headValArea ${DATASETBUILD_BASE}/bin/runBatchCobra.sh <OwnerName> <DatasetName> <CobraVersion>
  -or-
  setup fbsng setup kerberos fbs submit ${DATASETBUILD_BASE}/jdf/runBatchCobra.jdf <OwnerName> <DatasetName> \ <CobraVersion>
2. Copy HTML file to Production Metadata Tree
  This validation step will produce a html file of the form BatchCOBRA_<OwnerName>_<DatasetName>.html which will have a listing of all runs in the primary dataset and any runs that are considered bad and will need to be detached. Make a note of any problems for later, and then copy the html file to the production Metadata directory tree under the owner and dataset subdirectory tree. The files will be located in ${DATASETBUILD_BASE}/${USER}/work/headValArea.
  cp BatchCOBRA_<OwnerName>_<DatasetName>.html \ /storage/data4/METADATA/<OwnerName>/<DatasetName>
3. Repeat the Step for Adding a Dataset to the Dataset Service
  See the section above on how to add a dataset to the Dataset Service. That step should be repeated for the dataset. This will update the information about the status of the dataset in the Dataset Service.

Step 5: Detach the Bad Runs

If a dataset was found to have a non-empty list of bad runs, then that list can now be used to detach the bad runs from the Metadata files. If the list was empty, then this step can be skipped. The only file needed to run this step is called runDetach.sh. The application will attempt to detach the one specified run from the specified <RunOwner> associated with the primary owner and dataset combination. The application runs fairly quickly and should be run interactively if possible, thus there is as of yet no jdf file to allows one to submit the validation to the fbsng system on the User Analysis Facility. Currently the COBRA version used will be taken from This is expected to change in the future when a version becomes available that has all of the needed applications and fixes included.

Run the Detach Application
The fix application in this section will produce an output file called detach_<OwnerName>_<DatasetName>_<RunNumber>_COBRA_<CobraVersion>.txt in the subdirectory called detachArea of your basic work area (${DATASETBUILD_BASE}/${USER}/work), regardless of your current working directory when you launch the script. This file will indicate the status of the detach run process. Note that the script takes the primary owner name as the first argument and the owner from which the run should be detached as the third argument.
cd ${DATASETBUILD_BASE}/${USER}/work/detachArea ${DATASETBUILD_BASE}/bin/runDetach.sh <OwnerName> <DatasetName> \ <RunOwnerName> <RunNumber> <CobraVersion>
Note that the for the moment use 8_1_0 for the CobraVersion for this step. If there are multiple runs to detach from a run owner, then a loop can be setup in a bash script as follows to help automate the process as shown in the example below:
#!/bin/bash callit() { owner=${1} dataset=${2} runOwner=${3} runs=${4} for run in ${runs} do ${DATASETBUILD_BASE}/bin/runDetach.sh ${owner} ${dataset} ${runOwner} ${run} 8_1_0 done } owner="jm_2x1033PU761_TkMu_2_g133_OSC" dataset="jm03b_gj_2040" runOwner="jm_2x1033PU761_TkMu_2_g133_OSC" runs="125300004 125300005 125300007" callit ${owner} ${dataset} ${runOwner} "${runs}" exit
Re-Run the Validation Step
At this point the dataset is in theory validated. However something could have gone wrong with the detach process or something could have been missed. The validation step also creates information that is used by the Dataset Service, like a list of run and bad runs, and that information needs to be updated. Therefore it is necessary at this point to return to the Primary Validation stage of the process and repeat all of the tasks for the dataset.

Step 6: Fix the Collections with FixColls

This step is still waiting on a FixColls that works well with dCache and large numbers of files.

Step 7: Create New Tar Files and Copy into DCache

Now that the dataset has been validated, we would like to save a copy of the Metadata directory tree in case there is an unrecoverable data loss on the disk array. Again there are scripts that can be used to help perform this task. This is essentially a two step process.

Creating the Validated Metadata Tar File:
Essentially these instructions follow the same pattern as the original tar file build in preparation for deploying to the production area. The difference is now the tar file is generated from the production area instead of in preparation for transferring to the production area.
1. Change Directory to the Production Metadata Top Level:
  These instructions should be the same as above. The production area the build Metadata area has a top level directory of /storage/data4/METADATA. Change directory to the production top level for Metadata.
  cd /storage/data4/METADATA
2. Generate the Tar File:
  This recipe is the same as the one from the earlier tar file build stage. There is a script that will build the tar file in the current directory. Invoking the script with the primary owner name and the dataset name will generate a tar file of the form META.<OwnerName>.<DatasetName>.<yyyymmdd>.tar.
  ${DATASETBUILD_BASE}/bin/genTar.sh <OwnerName> <DatasetName>
Copy the Tar File into DCache:
There is a script that can be used to copy the tar file to the dCache area. The script will handle the necessary environment setup and should be invoked from the same directory where the tar file was created. The only input required is the name of the tar file generated in the previous step. The script will create any necessary subdirectories in pnfs for storing the tar file.
${DATASETBUILD_BASE}/bin/cpMETA2pnfs.sh META.<OwnerName>.<DatasetName>.<yyyymmdd>.tar
Delete the Local Tar File:
Finally one should remove the tar file that was generated once it has been safely copied to dCache.
rm META.<OwnerName>.<DatasetName>.<yyyymmdd>.tar
Mark as Done:
In the future there should be a tool for handling this task. For the moment, there is only a text file to edit by hand. Change your username at the beginning of the line for the dataset to "done". The file name is called /afs/fnal.gov/files/expwww/uscms/html/SoftwareComputing/UserComputing/dataset_service/todo.txt.

Gerald Guglielmo

Last modified: Tue Jan 25 14:38:49 CST 2005