Central Concepts of the CAF

The CAF aims to provide powerful tools and mechanisms for high-level HEP analyses with minimal user side code required. This section introduces some of the key concepts employed to achieve this.

In the following sections, nouns starting with a capital or even camel-case capitalization refer to the interpretation of a certain concept in the CAF, e.g, in the form 'SomeConcept' which is implemented in a class usually called 'TQSomeConcept'.

SampleFolders

A key component of the CAF are so-called SampleFolders. They are largely inspired by regular folders which one deals with on almost any computer system. However, on the level of the actual file system, SampleFolders are contained inside ROOT files or living in the computer's memory (RAM). This section gives an overview of how one can interact with these structures, which essentially represent the outline of an analysis in CAF.

Visualizing SampleFolders

The relevant classes TQFolder, TQSampleFolder and TQSample are derived from ROOT's TFolder. A structure composed of such folders can therefore be visualized in a TBrowser:

Visualization of a SampleFolder structure using ROOT's TBrowser

A much more powerful (and more verbose if needed) way to investigate these folder structures is using the TQFolder::print(const TString& options="") method, for example inside an interactive ROOT shell:

Using TQFolder::print to visualize a SampleFolder structure. Here, 'samples' is the pointer to the base folder ('root node'). The argument string specifies the path in this folder structure to be printed ('sig/em/mh125/ggf') separated by a colon separating the options to be used for the printing: 'r1d' causes subfolders to be printed recursively for up to 1 level and the details column to be printed. Another commonly used option character is 't' to also print so called tags.

Hint: For this functionality to be available, the QFramework library needs to be loaded. If the analysis environment is correctly set up, one can use the 'tqroot' wrapper script to do this automatically: simply type 'tqroot' on your shell instead of 'root'. If the SampleFolder in question is contained in a ROOT file it can be automatically loaded via
tqroot -sfr myRootFileWithSampleFolders.root

Extracting Histograms and Counters

In many situations, the names of histograms and counters (=cut-flow entries) stored in a SampleFolder are already known. If, however, this is not the case, one can easily obtain lists of these, e.g., in a tqroot shell:

/* get and print a list of all known histogram identifiers */
TList* l1 = samples->getListOfHistogramNames();
l1->Print();
/* get and print a list of all known counter identifiers */
TList* l2 = samples->getListOfCounterNames();
l2->Print();

Hint: Histogram identifiers are typically composed of the name of the selection stage ('cut') at which they are produced and a name for the distribution contained in the histogram. They therefore typically have the form 'cutName/distributionName'. Counters simply represent yields (of a certain process/sample) after some cut and, hence, they are only identified by the name of the cut.

Once the identifier of a histogram or counter has been found, a corresponding histogram can be retrieved through

TH1* hist = samples->getHistogram("bkg/em/Zjets","myCut/myDistribution");
TQCounter* count = samples->getCounter("bkg/em/Zjets","myCut");

Hint: Histograms obtained in this way are technically returned as a TH1* (a pointer to a TH1). In ROOT, TH1 is the base class of all histogram classes with one, two, or three dimensions. If the histogram was originally, e.g., a two-dimensional histogram, the TH1 pointer can simply be cast to a TH2* pointer if needed:
TH2* hist2D = dynamic_cast<TH2*>(hist);

The histograms obtained in this way are plain ROOT histograms and so one can do anything with them that is also possible in a plain ROOT environment, e.g.

hist->Draw()
// or: in case of a 2D histogram
hist->Draw("colz")

TQCounters are CAF specific objects which conveniently combine a few numbers: the sum of event weights added to the counter, the sum of squared weights (and thereby the statistical uncertainty of the sum of event weights), as well as the raw (unweighted) number of events contributing to the counter. Methods to extract these numbers can be found in the class documentation of TQCounter.

The content of counters and histograms obtained in this way are usually not stored as such in the SampleFolder. Instead, SampleFolders automatically sum all individual contributions for the given path. That is, histograms / counters stored in subfolders of a given path are internally combined into a new histogram / counter object that is then returned.

As an example, consider a SampleFolder with folders for Z→ee, Z→µµ and Z→tautau contributions. Additionally, this split is present in two locations with paths: 'bkg/em/Zjets/[ee+mm+tt]' and 'bkg/me/Zjets/[ee+mm+tt]'

Structure of a (dummy) example SampleFolder.

In this example, several possibilities exist:

//obtain a counter with only contributions from Z->ee samples in the 'em' branch:
samples->getCounter("bkg/em/Zjets/ee","myCut");
//obtain a counter with the combined contributions from all Z->ll samples in the 'em' branch:
samples->getCounter("bkg/em/Zjets","myCut");
//obtain a counter with the combined contributions from all Z->ll samples and sum over both 'em' and 'me' variants:
samples->getCounter("bkg/[em+me]/Zjets","myCut");
//obtain a counter with the difference between the 'em' and 'me' branches, combined over all Z->ll samples:
samples->getCounter("bkg/[em-me]/Zjets","myCut");
//same as before but the contributions from the 'em' branch scaled by a factor of 2:
samples->getCounter("bkg/[2*em-me]/Zjets","myCut");

The last three examples here show an additional powerful feature of the CAF which is often referred to as path arithmetic.

Manipulating SampleFolders

SampleFolders can be loaded from and written to ROOT files using

//loading from a file:
TQSampleFolder* sf = TQSampleFolder::loadSampleFolder("myFile.root:folderName");
//writing to a file (see https://atlas-caf.web.cern.ch/TQFolder.html#TQFolder:writeToFile for details):
sf->writeToFile("myOtherFile.root" /*filename*/,
                           true /*overwrite file if already existing*/,
                           2 /*split at this depth*/
                           );

The splitting depth causes the SampleFolder structure to be automatically split into multiple objects at the specified depth. Upon (re)loading of the structure this is automatically re-assembled when needed. This trick is needed for larger SampleFolder structures to circumvent some of ROOTs limitations regarding the streaming of large objects.

Individual subfolders can be retrieved using

TQSampleFolder* sub = sf->getSampleFolder("path/to/subfolder");
//alternative: create missing SampleFolders if they don't exist yet:
TQSampleFolder* otherSub = sf->getSampleFolder("path/to/otherFolder+");

In the second example the trailing '+' tells the code to create all parts of the specified path if they don't yet exist.

Hint: For most of this section, the term 'SampleFolder' can be read as the more general 'Folder'. With the exception of methods indicating by their name that they explicitly refer to SampleFolders, a plain TQFolder object can be used in place of a TQSampleFolder (or TQSample). Usually, for SampleFolder specific methods discussed here, there is also a variant for plain TQFolders, e.g., getSampleFolder and getFolder. The difference between these two examples is simply the returned type: TQSampleFolder* or TQFolder*. This also means that getFolder also works on/for SampleFolders but if one wants to use SampleFolder specific methods on the TQFolder* returned by getFolder one first needs to cast it to a TQSampleFolder* pointer. In doubt, check the class documentations of these classes or start experimenting!

The difference between TQFolders, TQSampleFolders and TQSamples is their degree of 'physics awareness': while TQFolders are fairly agnostic in this sense, TQSampleFolders are aware of different contributions (c.f. summation of histograms and counters) while a TQSample represents a particular physics sample (one or more nTuples or xAODs).

Customizing Analysis Behavior for Different Samples and SampleFolders

TQFolders, TQSampleFolders and TQSamples are, like many other classes in the CAF, derived from the TQTaggable class. All such objects can have simple meta information attached to them in key:value fashion. Each tag is identified by its name (the key, a string) and maps this to another string, a (double precision) floating point, an integer, or a boolean value. These pieces of information can later be picked up by other pieces of code which then adjust their behavior based on the value of the relevant tags. To see which tags are applied to a SampleFolder, one can simply use

samples->print("path/to/folder/in/question:t"); //or ':dt' to print tags and 'details' column

More details on the usage of tag can be found in a dedicated part of the tutorial pages. (TODO: write that part and link it here; if this note is still present at the time you're trying to learn about the CAF feel free to notify the developers;)

Visitor Pattern

In the CAF, several so-called visitors exist that operate on a SampleFolder structure to perform different actions based on the analysis structure encoded within. Instead of adding more and more functionality to the SampleFolders themselves, this approach leads to an easily expandable set of classes where each class fullfills a certain purpose. In a bit more practical terms, this pattern typically boils down to

Do some work on a particular SampleFolder or Sample
Recursively iterate over all subfolders

Examples of the work being performed include

Initialization of Samples: checking for input files corresponding to the particular Sample and extract some bookkeeping information from them like the (initial) sum of weights of the sample (or partial sample if it is split into multiple files) : TQSampleInitializer (see also its base class TQSampleInitializerBase)
Analyze the events contained in a particular sample by looping over the events (execute the so called 'event loop') : TQAnalysisSampleVisitor and TQMultiChannelAnalysisSampleVisitor (often abbreviated as MCASV)

Event Selection: Cuts

A vital aspect of essentially any analysis is the event selection using selection criteria which are often simply called cuts. Due to their central role in analyses, cuts have their own dedicated representation in the CAF. The TQCut class implements this concept by forming hierarchical structures very similar to (Sample)Folders. In fact, cut structures are usually created as a tree of TQFolders which can then be automatically converted into a corresponding structure of TQCuts.

This implies that (analogous to multiple subfolders inside one higher level folder) multiple cuts can be applied at the same level, i.e., cuts do not need to be orthogonal (even though this might be an analysis choice). A particular event can pass none, one, or multiple cuts which are applied after the same base cut. The structure being hierarchical means that for a cut to be passed, not just that cuts criterion needs to be passed (evaluate to 'true' or simply non-zero) but also the criteria of all preceding cuts (= logical AND).

A schematic representation of a tree like cut structure with several branches. Each box represents one selection criterion. Only events that pass a certain cut are propagated to its descendant cuts. The similarity between the "SC Leptons" and "OC Leptons" branches happens to be the case in this example but is in no way enforced.

Working with Events: AnalysisJobs and Algorithms

Apart from only selecting events, one typically wants to perform additional operations based on the events passing a set of cuts. Examples of such operations are (weighted) counting of events, filling event properties into histograms, or writing information about selected events into a (mini-)nTuple. In the CAF this is done by so called AnalysisJobs. Each instance of a TQAnalysisJob is attached to an instance of a TQCut and performs its operation for each event that the cut accepts.

In this way, a cutflow - that is, the evolution of event yields when successively applying more and more selection criteria - can be produced by simply attaching ('booking') a TQCutflowAnalysisJob to each Cut:

TQCutflowAnalysisJob* cf = new TQCutflowAnalysisJob();
myFirstCut->addAnalysisJob(cf,"*"); //adds a copy of the AnalysisJob 'cf' to this cut and all descending cuts

Hint: Since cutflows are fairly lightweight this is usually already implemented in your CAF-based analysis code and one can simply expect all cutflow information to be available once the analysis code is run.

AnalysisJobs creating histograms or (mini)nTuples, for example, can be added in the same way. Since these usually need some more configuration (e.g., defining the histogram binning, ranges, what variables to be filled, ...) these AnalysisJobs provide the possibility to specify their desired configuration and assignment to specific cuts in dedicated configuration files. Details on these (and other) configuration files are (TODO: to be) provided in a later part of the tutorial as well as in the histogram and nTuple definition READMEs.

Hint: If your analysis repository was created by forking from the CAFExample repository, copies of these and other README files should also be contained directly in your own analysis setup.

In some cases the combination of Cuts and attached AnalysisJobs is not particularly suitable. For example, one might want to perform some preprocessing of an event like an overlap removal at the level of physics objects before any cuts on the event level are evaluated. Code performing such pre- and/or post-processing can be included into an analysis in the CAF via Algorithms (please note: despite the name these have no (direct) technical relation to (so called) 'algorithms' in other ATLAS frameworks).