- Difficulty level: intermediate
- Time need to lean: 20 minutes or less
- Key points:
- Option
group_bycreates groups (subsets) of input targets - Groups are persistent and can be passed from step to step
- Option
By default all input targets are processed all at once by the step. If you need to process input files one by one or in pair, you can define substeps that basically applies the step to subgroups of input targets, represented by variable _input.
In the trivial case when all input targets are processed together, _input is the same as step_input.
Using option group_by, you can group the input targets in a number of ways, the easiest being group by 1:
As you can see, the step process is now executed twice. Whereas the step_input is the same for both substeps, _input is a.txt for the first substep, and b.txt for the second substep. Here we used an internal variable _index to show the index of the substep.
SoS allows you to group input in a number of ways:
| option | group by |
|---|---|
all |
all in a single group, the default |
single |
individual target |
pairs |
match first half of files with the second half, take one from each half each time |
combinations |
all unordered combinations of 2-sets |
pairwise |
all adjacent 2-sets |
label |
by labels of input |
pairsource |
pair input files by their sources and take one from each source each time |
N = 1, 2, ... |
chunks of size N |
pairsN, N=2, 3, ... |
match first half of files with the second half, take N from each half each time |
pairlabelN, N=2, 3, ... |
pair input files by their labels and take N from each label (if equal size) each time |
pairwiseN, N=2, 3, ... |
all adjacent 2-sets, but each set has N items |
combinationsN, N=2, 3, ... |
all unorderd combinations of N items |
function (e.g. lamba x: ...) |
a function that returns groups of inputs |
You can group input targets in many different combinations based on their order in input list. For exmple, with the following sos script, the input are groups pairwisely:
To demonstrate more acceptable values, the following example uses sos_run action to execute this a step with different grouping method.
We did not include options pairsN and pairwiseN in the example because we need more input files to see what is going on. As you can see from the following example, the N groups input targets as small groups of size N before pairs and pairwise are applied.
As we recall from the labels attribute of sos_targets, input targets can have label of the present step (if specified directly), or as the output of previouly executed steps. Option group_by allows you to group input by sources by='label', or pair sources (by='pairlabel' and by='pairlabelN').
An example to use labeled input is when you have input data of different nature. For example
Here we would like to group_by=1 only for _input["data"], so we pair _input["data"] and _input["reference"] and group them together with pairlabel.
As a more complete example,
The options pairsource and pairsource2 need some explanation here because our groups do not have the same size. What these options do are
- Determine number of groups
mfromNand longest source. - Either group or repeat items in sources to create
mgroups
For example, with pairsource2, we are creating two groups because the largest source have 4 targets (m=4/2=2). Then, a1 is repeated twice, b1, b2 are in two groups, and c1, c2 and c3, c4 are in two groups.
Finally, if none of the predefined grouping mechanism works, it can be easier for you to specify a function that takes step_input and returns a list of sos_targets as _input.
Pairing input from multiple sources is complicated when we apply group_by to a list of targets with different sources. It is actually a lot easier if you apply group_by to the sources separately. Fortunately, functions output_from accepts group_by so that you can regroup the targets before merging with other sources.
For example, in the following example, step_10 has 2 output files, step_20 has 4, by applying group_by=1 to output_from('step_10') and group_by=2 to output_from('step_20'), we create two sos_targets each with two subgroups. The two sos_targets will be joined to create a single _input for each substep.
As explained by named input, keyword arguments overrides the labels of targets, so you can assign names to groups with keyword arguments:
Things can become tricky if you specify both "regular" input and grouped targets from output_from. In this case, the regular input will be considered as a sos_targets with a single group, and be merged to every group of another sos_targets.
However, if option group_by is specified outside of output_from, it will group all targets regardless of original grouping. For example, in the following example, output from step_10 will be grouped by 2.