xAOD Skimming

Significant performance gains can be obtained by removing information from input files which is no longer needed. One quite generic way to do so is removing events which, for example, fail preselection cuts and therefore never enter any meaningful (phase space) region in the analysis (they never make it into any of your histograms). Additional performance gains can be obtained by merging several small files into one larger file (opening and closing files comes with a significant overhead for a multitude of reasons). CAFCore therefore provides a simple way to perform this "skimming" (=removal of events) and merge input file at the same time.

Usage

When using analysis scripts based on the ones provided in CAFExample the skimming(+merging) feature needs two main options in the master configuration file for the analyze step (in older setups 'runAnalysis'):

xAODdumping.cuts: CutA,CutB,CutC*
xAODdumping.outputDir: /eos/user/j/jdoe/skimmedSamples

The first option specifies which cuts should be used for the skimming. If an event passes any of these cuts (one or more) it will be copied to the output file (the skimmed sample). That is, a logical 'or' is used with respect to passing these Cuts. The Cut names can also include wildcards, as used in the last one in the above example.

The second option specifies the path where the skimmed xAODs should be written to. When running this on a batch system the code should be smart enough to

  • only merge multiple input files to one output file as long as they belong to the same DSID (assumption: for the particular DSID there is only one input file or (if it consists of multiple input files) the TQSample instances corresponding to the input files are sub-TQSamples of one TQSample for the DSID; these assumptions are typically fulfilled unless the SampleFolder structure is severely post-processed by custom implementations/patches)
  • ensure that there are no file name collisions between different jobs. This is done via the '--jobID' option of the analyze.py script of CAFExample. When using the provided submission script 'submit.py' this is automatically taken care of. The so passed in jobID is stored on the output SampleFolder of the respective job and also included in the file name of skimmed xAOD.

If you only want to skim and/or merge MC samples you are (almost) good to go with only the two options mentioned above (you should still check the caveats section below!). It can, however, be beneficial to spend a little more thought on the desired result: how high is the acceptance of the skimming cut for each process and how much should each sample be merged? Merging too aggressively might lead to loosing time when using the skimmed samples: you will typically be limited by the slowest job. It can therefore be beneficial to perform only a moderate merging. Using submit.py it is recommended to create a jobs file which specifies one job per DSID. The degree of merging can then be controlled through the maxSampleCount and maxSampleSize parameters of submitAnalysis (as hinted above: each job will write to a disjoint (set of) file(s), the number of sub-jobs for each DSID therefore corresponds to the number of skimmed output files if the job file is composed as mentioned before). Fine control can be achieved using "modifier lines" in the jobs file (see here).

If your data samples are not grouped in a TQSample (for example, under "data/em", where "em" is a TQSampleFolder, i.e., not a TQSample) you can instruct the framework to treat a TQSampleFolder as if it was a TQSample in this context by placing the tag ".xAODskimming.treatAsSample" onto the TQSampleFolder. In the previous example, you could include a TQFolder patch file with the content

@data/? { <.xAODskimming.treatAsSample="group.phys-higgs"> }

If this tag is found on a TQSampleFolder the TQSamples directly inside of it will be treated as if they resided inside a TQSample, i.e., their events passing the skimming cut(s) will be merged (written to the same output file). The string passed to this option is used as a seed when constructing the output file name. It needs to be set to something that appears in all file names of input files that should be merged (that is, in this example, all data samples). If no such seed is given the TQxAODSkimmingAlgorithm tries to construct the output filename based on (in this order of priority): DSID (from tag 'DSID' on TQSample instance) > name (from tag 'name' on TQSample instance) > name (TQSample::GetName()) .

Important Notes and Caveats

While the xAOD skimming in the CAF requires only very little effort a few things need to be taken into account due to how this feature is provided

  • Always make sure to use this feature only with the MultiChannelAnalysisSampleVisitor (MCASV), not with the single channel equivalent! Reason: same as for systematic variations (see below)
  • If you use a sample multiple times in your analysis (example: the fake estimate in the ggF+VBF HWW analysis split into contributions from fake electrons and fake muons) : please disable this split for running the skimming. Otherwise, when reading the samples back, the (total) sumOfWeights of the multiply used samples will be wrong! The reason is that each copy leads to its own ouput file being written but each skimmed file will contain the full CutBookkeepers and therefore the full sumOfWeights. Since CAF has no way of recognizing this during sample initialization (when reading the skimmed files back), this means that the sumOfWeights summed over all files belonging to the same DSID will be counted multiple times.
  • Skimming samples with systematic variations: when skimming samples like systematics (P)xAODs one has to ensure that all systematic variations which can cause event migration are processed together. Typically, these are systematics affecting the four-momentum of some objects (electrons, muons, jets, MET,...). The reason is that we need to keep events if they (potentially) enter the final selection in some way. Otherwise one would potentially discard events which match a different channel than the one currently processed or events which do not pass the selection in the nominal case but do contribute for one of their systematic (four-vector) variations.
  • In order to save computing resources already during the skimming, one should think about what one needs to process to perform the skimming. For example, one should omit Cuts which do not lead to the Cuts used for the skimming (example for HWW: if you decide to skim at some common preselection, you likely do not need to include the ggF (or VBF) specific cut definition files). Another major efficiency boost can be obtained by activating only the systematic variations needed, i.e., those affecting the kinematics of an event (see previous point). Systematics which only modify some weights (e.g. scale factors variations, variations of fake factors) have typically no effect on whether an event passes a certain cut or not.

 

Technical Details

The technical implementation uses the TQEventFlaggingAnalysisJob. It is attached to all Cuts specified and, if the Cut is passed, executed. All it then does is adding a 'flag' (decoration) to the EventInfo object. This flag is then what the TQxAODSkimmingAlgorithm is looking for. If it exists (and its value is 'true'), TEvent::copy is called, moving the entire event (including decorations applied by user code!) to the output (the 'flag' triggering the write out is removed beforehand).