Matched Dataset Description

Mrenna Matched Datasets
Introduction (can be skipped): The datasets are built from samples of Madgraph events for W+0,1,2,3, and 4 jets and Gamma/Z+0,1,2 and 3 jets. These exclusive samples (i.e. describing a fixed number of partons) are combined into inclusive samples (i.e. describing any number of partons/hadrons) using Pythia. The method for doing this is not* the same as the method of Catani et al used for e+ e- collisions, (know as CKKW), but is "equivalent." The final product is provided to you, the user, as fully hadronized events in the MCFIO format, residing on the Patriot (Physics Analysis Tools Required to Investigate Our Theories) enstore, and described with metadata in SAM.
What is the CKKW method (can be skipped)? In short, the CKKW "idea" is to map a W+n parton event into a parton shower history. All we have is the kinematics and quantum numbers of the on-shell partons, so we have to "invent" an algorithm to determine the mothers of all splittings that occur in the shower. In contrast, the real parton shower starts from, say, W production, and has a mechanism for adding on additional splittings to get a W+n parton event. We are trying to guess backwards. The reason is that we want to reweight the matrix element prediction to reproduce the parton shower description when the partons are soft and collinear. However, this weighting should be done so that it is equally valid when the partons are not soft and collinear, so that we reproduce the matrix element prediction. CKKW suggests to use Kt-clustering to determine the parton shower history. This means looping over the "on-shell" partons and finding the minimum value of Kt^2 = Pt^2 for each parton or 2min(Pt_i,Pt_j) dR^2 for pairs of partons. After identifying the minimum, you remove the parton or partons associated with that value, and replace it with the mother. You continue until no partons are left. Once you have a history, you then reweight the matrix element prediction with Sudakov form factors on all internal and external lines and use an appropriate value of alpha_s for each parton shower splitting. For CKKW, the Sudakov used is a Next-to-leading-log, analytic form, and alpha_s is evaluated at Kt. Now, the remaining step is to take samples of different "n" for W+n jets and add them together to give an inclusive prediction. This is only sensible if none of the samples "overcount" each other. Each of the W+n jet samples is generated with a cutoff: CKKW uses a minimum value of Kt as the cutoff. Therefore, W+1 parton has only one parton with Kt>Kt_cutoff. W+2 partons has two partons with Kt>Kt_cutoff. W+3 partons has 3 partons with Kt>Kt_cutoff, etc. An "ordinary" parton shower is now used to add on additional partons with Kt<Kt_cutoff. This is accomplished by adding a veto inside the parton shower algorithm to disallow branchings with Kt>Kt_cutoff. At the end of the day, the W+0 parton sample is converted into a semi-inclusive sample that has 0 partons with Kt>Kt_cutoff, but any number with Kt<Kt_cutoff. The W+1 parton sample has 1 and only 1 parton with Kt>Kt_cutoff, and any number with Kt less than Kt_cutoff. Now all the samples can be added together. The "ordinary" parton shower used is based on the Pythia routines.
How are the matched samples different from CKKW (can be skipped, read above for background)? An analytic Sudakov is not used for reweighting. Instead, the parton shower itself is used to calculate the Sudakov. The key here is that the Sudakov gives the probability of no emission between a given upper and a given lower scale. If you add a shower to a set of partons starting at the upper scale and you generate any emissions harder than the specified lower scale, you know that the probability for this to occur is 1 - Sudakov. By going through the various stages of the parton shower history (which was described in the above section) and adding on a "fake" or pseudo-parton shower, you can calculate all of the Sudakov weights. The advantage is that the same Sudakov is used for the reweighting as is used for the subsequent parton shower added on to generate the inclusive sample. Clustering is done in relative Pt instead of Kt. Relative Pt is what is used in both Pythia and Herwig as the argument of alpha_s inside the Sudakov. Again, the aim here is to tailor the matching to the actual event generator. Color and flavor information is used in the construction of a parton shower history: this sometimes limits which partons can be mothers. This method should be more accurate away from the soft and collinear limit. The method can be applied to any parton shower. In a paper with Peter Richardson, I investigated several methods for treating the highest multiplicity matrix element sample. Since we didn't start with a partonic sample of W+(infinity partons) as part of the matching process, we have to add some phase space to get jets with Kt>Kt_cutoff and multiplicity > highest multiplicity at the matrix element level. In other words, starting with W+4 partons with Kt>Kt_cutoff, we should allow for the possibility that there is a 5th jet with Kt>Kt_cutoff. Clearly, this has to be generated through the parton shower, which previously vetoed jets with Kt>Kt_cutoff. Here, we simply relax this last requirement for the highest muliplicity parton sample.
Where are the samples in Enstore space (can be skipped if using SAM)? The samples reside in /pnfs/patriot This is mounted on d0mino01-03 as /pnfs/sam/patriot (this was unexpected and may change) The W+jets samples reside in the subdirectory W_jets/MEPS/0 and /1 The Gamma*/Z+jets samples reside in Z_jets/MEPS/1 and /2 Patriot is part of stken, so you should do "setup encp -q stken" to access the files directly.
Where are the samples in SAM space? To find a listing of files, you can use the SAM Catalog Web Query Interface and search on "data file name" "%Z%run%io" or "%W%run%io" You can find the metadata at the command-line by typing "sam get metadata --file=<file>", where <file> was listed in the previous step. There is a lot of information here.
Example Metadata Listing: "sam get metadata --file=1.Z0run1.1.io" PhysicsGenericFile({ 'fileName' : '1.Z0run1.1.io', 'fileId' : 5318428L, 'fileType' : 'physicsGeneric', 'fileFormat' : 'unknown', 'fileSize' : SamSize('606.00MB'), 'crc' : CRC('3945349037L', 'adler 32 crc type'), 'fileContentStatus' : 'good', 'eventCount' : 25000L, 'dataTier' : 'generated', 'firstEvent' : 1L, 'lastEvent' : 25000L, 'startTime' : SamTime('SamTime("NULL")','InvalidTimeFormat'), 'endTime' : SamTime('SamTime("NULL")','InvalidTimeFormat'), 'group' : 'dzero', 'params' : Params({ 'dataset' : CaseInsensitiveDictionary({ 'dataid' : 48590, 'facility' : 'fixed-target-farm', 'groupname' : 'cd', 'last_user' : 'mrenna', 'num_events' : 25000, 'origin' : 'fnal', 'picobarns' : 262.5, 'pnfs_dir' : 'Z_jets/MEPS/1/', 'producedby' : 'mrenna', 'producedfor' : 'patriot', 'project' : 'patriot', 'weblink' : 'cepa.fnal.gov/patriot', }), 'me_alpha' : CaseInsensitiveDictionary({ 'qcd_order' : 0, 'qcd_power' : 1, 'qed_order' : 1, 'qed_power' : 2, }), 'me_cuts' : CaseInsensitiveDictionary({ 'dcut_mode' : 2, 'djj_cut' : 10.0, 'll_mass_cut' : 20.0, }), 'me_generation' : CaseInsensitiveDictionary({ 'comments' : 'none', 'gen_name' : 'MadEvent', 'partons' : 'pp_epem', 'runnumber' : 1, 'version' : '1', }), 'me_pce' : CaseInsensitiveDictionary({ 'collider' : 'p-pbar', 'energy' : 1960, 'physics_process' : 'Z_jets', }), 'me_scales' : CaseInsensitiveDictionary({ 'fact_scale' : 'mZ', 'renorm_scale' : 'kT', }), 'me_structures' : CaseInsensitiveDictionary({ 'event_weight' : 0, 'pdf' : 'cteq6l', }), 'ps_alpha' : CaseInsensitiveDictionary({ 'matched_show' : 1, 'qcd_order' : 1, 'qcd_radiation' : 1, 'qed_order' : 1, 'qed_radiation' : 1, }), 'ps_generation' : CaseInsensitiveDictionary({ 'default_values' : 0, 'gen_name' : 'pythia', 'version' : '6226', }), 'ps_scales' : CaseInsensitiveDictionary({ 'fsr_scale' : 'kT', 'isr_scale' : 'mZ', }), 'ps_structures' : CaseInsensitiveDictionary({ 'pdf' : 'CTEQ5L', 'underlying_event' : 'RFTuneA', }), }), })
What are the datasets? There are 4 datasets of interest (for today). Gamma/Z -> (e+e-) + jets, with corrections to the 3 hardest jets Gamma/Z -> (mu+mu-) " " " W+ -> (e+ nu_e) + jets, with corrections to the 4 hardest jets. W+ -> (mu+ nu_mu) + jets " " " Why are there only 3 hard jets for Gamma/Z? This was a practical issue resulting from the fact that I allowed Gamma with a mass cut M_ll > 20 GeV. This could be improved in the future, particularly if one is only interesting in M_ll ~ M_Z.
What are the individual files within the datasets? Lets concentrate first on dataset 1 noted above. The files are denoted 1.Z"L"run"N"."M".io The ".io" suffix denotes that they are in the MCFIO format. The "1" prefix denotes the dataset. "M" simply counts how many files are there. For a given "N", all but the highest "M" have 25k events. You (now) don't need to look at all "M" files: the events are already randomized. "N" denotes the partonic level cuts applied to the sample. Different cuts were applied to give a range of predictions, so that one can judge the method. The range of "N" is a bit convoluted. Cuts of 10, 15, and 20 GeV were applied, but what is the kinematic variable to cut on? I originally tried 3 different clustering choices times 3 cuts, leading to 9 sets of datafiles. I finally converged on one method, selecting "N"s of 1, 4, and 7. However, events with no extra partons at the matrix-element level (Z+0 partons), don't depend upon the clustering scheme, so there were only 3 types of files denoted by 1,2, and 3. The set corresponding to a Pt cut of 10 GeV is denoted by %run1%, that with 15 GeV contains %run2% and %run4%, and that with 20 GeV contains %run3% and %run7%. Sorry for the confusion. "L" denotes the number of partons at the matrix element level, i.e. those with Kt>Kt_cutoff(=10, 15, and 20 GeV) [Note, whenever I say Kt, I really mean the relative Pt]. "L" is critical, because it determines the cross section for the number of events you use, or the integrated luminosity it corresponds to (since N = Lint * Sigma).
How should one weight events? The table below gives the crucial information. Consider the case where you take 50k events from 1.Z0run1.1.io and 1.Z0run1.2.io. By looking at the table, you will see that these should correspond to X=262.6 pb, or a weight of X/50k per event. You now add this to 75k events from 1.Z1run1.1.io, 1.Z1run1.2.io, and 1.Z1run1.3.io, corresponding to Y=94.0 pb. Each event then has the weight Y/75k. If you now made a histogram from both sets of events using this weight, the total area of the histogram should be X+Y=356.6 pb, and there should be 125k entries. Now, continue to increasing multiplicity (Z2runX, Z3runX) until you have used up all the cross section. If you do not want to deal with weighted events, you can choose the event sample that corresponds to the smallest integrated luminosity, use ALL those events, and then use N = sigma * Lint to determine how many of each other type to use. This is described below the cross section table in more detail.
Cross Section Tables:
	Dataset	runN	L=0	L=1	L=2	L=3	L=4	Total
	1 (Pt>10 GeV) e+e-	1	262.6	94.0	32.3	11.18	n.a.	400.0
	1 (Pt>15 GeV)	2+4	300.1	59.6	15.8	3.57	n.a.	379.1
	1 (Pt>20 GeV)	3+7	325.5	40.4	9.0	1.45	n.a.	376.4

	2 (Pt>10 GeV) mu+mu-	1	262.6	94.0	32.3	11.16	n.a.	399.9
	2 (Pt>15 GeV)	2+4	300.1	59.6	15.8	3.56	n.a.	379.1
	2 (Pt>20 GeV)	3+7	325.5	40.3	9.0	1.44	n.a.	376.4

	0 (Pt>10 GeV) e+ve	1	667.2	320.2	113.2	29.7	10.5	1140.7
	0 (Pt>15 GeV)	2+4	801.0	218.0	55.5	10.6	2.25	1087.4
	0 (Pt>20 GeV)	3+7	876.4	153.4	31.0	4.6	0.67	1066.0

	1 (Pt>10 GeV) mu+ vmu	1	667.2	320.2	113.2	29.7	10.5	1140.7
	1 (Pt>15 GeV)	2+4	801.0	218.0	55.5	10.6	2.25	1087.4
	1 (Pt>20 GeV)	3+7	876.4	153.4	31.0	4.6	0.67	1066.0

Dealing with only Uniform Weight events: If you do not want to deal with weighted events, you can choose the event sample that corresponds to the smallest integrated luminosity, use ALL those events, and then use N = sigma * Lint to determine how many of each other type to use. Consider again the Z dataset "1" with Pt>10 GeV, i.e. 1.ZLrun1.M.io. Each file but the last in each category contains 25k events, so we need to know how many events of each type there are in total. If you type "ls /pnfs/sam/patriot/Z_jets/1", you will find that the highest M values are 1.Z0run1.10.io, 1.Z1run1.13.io, 1.Z2run1.5.io and 1.Z3run1.6.io, corresponding roughly to 250k, 325k, 125k, and 150k events. To get a precise count of events, type "sam get metadata --file=1.Z0run1.10.io", etc. This reveals that each of those categories corresponds to 227327, 307927, 117279, and 147404 events. Calculate the effective integrated luminosity for each sample: Lint = N/sigma. This yields 866 pb-1, 3276 pb-1, 3631 pb-1, and 13208 pb-1. It is no surprise that the Z+0 parton sample -- arguably the least interesting -- sets the limit here. Pick a number of events from each sample corresponding to min(Lint) = 866 pb-1. Thus, we use all the Z+0 parton events, .264 of Z+1 parton events (81404), .239 of the Z+2 parton events (27972), and .066 of Z+3 parton events (9664). If you really want to use the events this way, I should generate more Z+0 parton ones.
Another view on weighted events: It is useful to review what it means to “unweight” events. Unweighting refers more to giving events a UNIFORM weight (i.e the same constant weight) rather than UNIT(=1) weight. Given a list of weights, the first step would be to identify the largest weight. The weight of each event is compared to this maximum weight times a (pseudo-)random number. If the individual weight is larger than this product, the event is selected, otherwise it is discarded. All events which pass this algorithm have the SAME UNIFORM weight which equals the largest weight. A good Monte Carlo method will strive to give you almost uniform weights from the start, so that most of the events are selected in the end. An obvious problem occurs when the maximum weight is anomalously large – typically, this single event corresponds to a very small fraction of the total cross section and can be discarded in favor of the next-largest weight, which is ideally closer to the mean weight value. This procedure can be repeated until a tolerable fraction of large-weight events have been discarded (1% or so), yielding more unweighted events. If we understand this unweighting procedure, than we can see how to mix the different samples. Here, the problem is simplified because each sample already has a uniform weight – the only problem is that the uniform weights are not all the same. For clarity, let us focus on the example above, where the different samples have events with the uniform weights = 263pb/227327events, 94pb/307927events, 32pb/117279events, 11pb/147404events, or an weight per event of .0011569 pb/ev, 0.000305 pb/ev, 0.00027285 pb/ev, and 7.462e-05 pb/ev, respectively. If we put all these events into the same pot and unweight them, we find that all of the Z+0p events pass, .264 of the Z+1p events pass, 0.239 of Z+2p events pass, and .066 of Z+3p events pass. This is the same conclusion we came to previously.
Physics Tidbits: Here is some important physics information that relates to the validity of the datasets: Photon radiation is included. RF TuneA is used for the Underlying Event. All heavy flavor comes from parton showering. Special sets with Wb-bbar are under development. There is no FSR for the semi-hard QCD interactions that constitute the Underlying Event model in Pythia. As a result, one may observe a spike at Dj(j1,j2)=p for W/Z+0 parton samples.

Mrenna Matched Datasets