Analysis Checklist and Workflow

Analysis Checklist

A typical analysis requires several "ingredients" which are anticipated therefore, by CAF.

These include:

Data and Monte-Carlo (MC) samples
Cross-section values for MC samples
Criteria for event selection: 'cuts'
Histogram definitions, usually with certain (event) selection criteria applied

With the exception of the first item, these ingredients are usually read from dedicated text files when performing an analysis in the CAF: cross section files, cut definition files and histogram definition files. In the CAF, a few additional (technical) definitions are used:

A selection of which MC samples should be used when the full set of available MC samples contains, e.g., samples for the same physics process simulated using different MC generators: whitelist files.
A mapping of technical sample names (DSIDs) to more human-readable strings and grouping of samples by (similar) physical processes: mapping files

The CAF Workflow

This section is meant to give a brief overview of the usual (technical) steps that lead from the ingredients listed in the 'Analysis Checklist' to producing graphics and tables. Please note that most changes to an analysis will only require to go through a subset of these steps again. For example - when adding new histograms, only the "analyze" and subsequent "visualize" steps need to be repeated.

Hint: For those who used or are still using setups based on the initial HWWAnalysisCode, please consider the following 'translations':

makeSampleFile = 'prepare' + 'initialize'

runAnalysis = 'analyze'

readAnalysis = 'visualize'

The steps were slightly reorganized to reduce turn-over time and renamed in hope to provide slightly more descriptive names of the respective scripts/steps.

Preparing the SampleFolder (prepare.py)

Main Ingredients: cross section file, whitelist file, mapping file

In this simple step, a sort-of scaffold SampleFolder structure is created, which contains only some basic information:

Luminosity to normalize MC samples to
Cross sections (as well as k-factors, and filter efficiencies if applicable) corresponding to each DSID, obtained from a cross section file
Which samples (MC processes/DSIDs) are to be included, obtained from a whitelist file
Mapping of samples/DSIDs to their designated paths in the SampleFolder structure, obtained from a mapping file
Listing of analysis channels, e.g., depending on the final state signature: two electrons 'ee', two muons 'mm', one electron + one muon 'em'/'me'

The SampleFolder created from this information is then written to a ROOT file and picked up again by the next step. It is important to mention that at this point, none of the input files (data or Monte Carlo) have been discovered and linked to the samples in the SampleFolder - meaning that this step can succeed independent of whether or not the input files are actually accessible (or, indeed, even exist).

Initializing the SampleFolder (initialize.py)

Main Ingredients: data and MC input files

Hint: In earlier setups this step was combined with the previous one and known as 'makeSampleFile'. It was split in order to allow for easier parallelization of the extraction of meta-information from MC samples on batch systems.

This step continues from where the "prepare" step left off - its main purpose is to discover / include the data and MC input files.

For data, the following action is taken:

Discovery and creation of links (TQSamples) to data samples (in a more abstract sense: samples for which no meta-information like the sumOfWeights needs to be extracted). The file paths can either be passed via a text file or, if they are on a mounted file system (roughly: where 'ls' works), via CAFs automatic discovery capabilities by providing the directory in which the files are located and a name pattern matching the files to be included.

For MC samples, two additional aspects need to be considered:

Match the available input files (MC samples) to their corresponding DSID, i.e., the right location in the SampleFolder. To this end, pre- and suffix-patterns can be provided. A file is then matched to a DSID if its file name (possibly including its path) matches '*<prefix>*<DSID>*<suffix>'.
Extract the (generated) sumOfWeights for each DSID to compute the normalization to be applied to the sample in later steps.

If multiple files match for one DSID, the CAF uses all of these and automatically accounts for the combined sumOfWeights.

The extraction of the sumOfWeights from each file can take a considerable amount of time. This will increase even more as more data and correspondingly larger MC samples become available and therefore the information is spread over more files. The initialization of each DSID factorizes, that is each DSID can be initialized independently from other DSIDs. Hence, this step can be submitted, e.g., to a computing cluster ('batch system') where each job initializes only Samples of one or few DSIDs. The following sections assume that, if this parallelization possibility is used, the individual outputs from the different jobs have been merged back into a single SampleFolder.

Hint: If the merging at this point is particularly inconvenient for your case (or takes considerable amounts of time), please do not hesitate to ask an expert for some help on how to avoid merging at this point.

Analyzing the SampleFolder (analyze.py)

Main Ingredients: MC and data samples, cut definition file, histogram definition file(, nTuple definition file, event-list definition file, ...)

During this step, the individual events contained in the MC and data samples are analyzed, that is the so called 'event loop' (not to be confused with another framework using this broadly used term as its name) is executed. In the CAF, the operations performed within the event loop are defined through Cuts, AnalysisJobs and Algorithms (see Concepts section). Again, this step makes use of the SampleFolder as produced by the previous step ('initialize').

The most frequently performed tasks in this step are the creation of cutflows and histograms. Both histograms and cutflow entries (Counters) are stored directly in the SampleFolder (this also implies that no graphics are produced, yet).

Hint: In the CAF, by default, histograms and counters are stored separately for each individual Sample. This is often useful for debugging an analysis setup, but can require more computing resources. Therefore in a full production, one often 'pools' these histograms together on-the-fly during runtime and at a more general location inside the SampleFolder. The hierarchical structure of a SampleFolder makes this fairly intuitive. For example, one might have a subSampleFolder for Z->ll samples which is further split into SampleFolders for different decay modes of the Z boson (two electrons, muons, or taus). Pooling histograms at the combined Z->ll level results in lower resource usage but also less granular information compared to pooling at the SampleFolder level of the different decay modes.

Amongst the steps described on this page, this one creates the largest computational load. Again, the turn-around time can be reduced to more reasonable durations through the use of a computing cluster and a subsequent merging operation.

Hint: While the splitting into individual jobs can, in principle, be at the level of individual input files (MC or data) the optimal performance is often found at a slightly more coarse splitting. Many samples might be tiny enough such that the job startup time dominates compared to the total time of the event loop for one sample. A quite successful approach is to use a job splitting such that each job processes a similar cumulated size of input files.

Visualizing the Analysis Results (visualize.py)

Main Ingredients: list of processes, list of cuts(, style file)

In order to reduce the need of having to rerun the event loop, the visualization (that is creation of graphics, tables, ...) is realized as its own step. This allows for the ability to quickly change some visualization settings like y-axis ranges, colors, labels and the like. Again, the SampleFolder produced in the previous step is used as a starting point.

The most important configurations for this step are the definitions for which cut entries should be shown in cutflow tables and which parts of the SampleFolder should be considered as one 'process' for histograms and cutflows. A 'process' in this sense corresponds to one entry in the legend of the histograms/graphics produced in this step (or one column in the cutflow tables). Optionally, one can use 'style files' (written in TQFolder syntax) to deposit various style settings directly on the SampleFolder. When a histogram is retrieved from the SampleFolder the code will automatically apply these styles to the (individual) histogram (given there is no ambiguity in style options).