- Difficulty level: intermediate
- Time need to lean: 20 minutes or less
- Key points:
- Option
group_by
creates groups (subsets) of input targets - Groups are persistent and can be passed from step to step
- Option
By default all input targets are processed all at once by the step. If you need to process input files one by one or in pair, you can define substeps that basically applies the step to subgroups of input targets, represented by variable _input
.
In the trivial case when all input targets are processed together, _input
is the same as step_input
.
Using option group_by
, you can group the input targets in a number of ways, the easiest being group by 1
:
As you can see, the step process is now executed twice. Whereas the step_input
is the same for both substeps, _input
is a.txt
for the first substep, and b.txt
for the second substep. Here we used an internal variable _index
to show the index of the substep.
SoS allows you to group input in a number of ways:
option | group by |
---|---|
all |
all in a single group, the default |
single |
individual target |
pairs |
match first half of files with the second half, take one from each half each time |
combinations |
all unordered combinations of 2-sets |
pairwise |
all adjacent 2-sets |
label |
by labels of input |
pairsource |
pair input files by their sources and take one from each source each time |
N = 1 , 2 , ... |
chunks of size N |
pairsN , N =2 , 3 , ... |
match first half of files with the second half, take N from each half each time |
pairlabelN , N =2 , 3 , ... |
pair input files by their labels and take N from each label (if equal size) each time |
pairwiseN , N =2 , 3 , ... |
all adjacent 2-sets, but each set has N items |
combinationsN , N =2 , 3 , ... |
all unorderd combinations of N items |
function (e.g. lamba x: ... ) |
a function that returns groups of inputs |
You can group input targets in many different combinations based on their order in input list. For exmple, with the following sos script, the input are groups pairwisely:
To demonstrate more acceptable values, the following example uses sos_run
action to execute this a step with different grouping method.
We did not include options pairsN
and pairwiseN
in the example because we need more input files to see what is going on. As you can see from the following example, the N
groups input targets as small groups of size N
before pairs
and pairwise
are applied.
As we recall from the labels
attribute of sos_targets
, input targets can have label
of the present step (if specified directly), or as the output of previouly executed steps. Option group_by
allows you to group input by sources by='label'
, or pair sources (by='pairlabel'
and by='pairlabelN'
).
An example to use labeled input is when you have input data of different nature. For example
Here we would like to group_by=1
only for _input["data"]
, so we pair _input["data"]
and _input["reference"]
and group them together with pairlabel
.
As a more complete example,
The options pairsource
and pairsource2
need some explanation here because our groups do not have the same size. What these options do are
- Determine number of groups
m
fromN
and longest source. - Either group or repeat items in sources to create
m
groups
For example, with pairsource2
, we are creating two groups because the largest source have 4 targets (m=4/2=2
). Then, a1
is repeated twice, b1
, b2
are in two groups, and c1
, c2
and c3
, c4
are in two groups.
Finally, if none of the predefined grouping mechanism works, it can be easier for you to specify a function that takes step_input
and returns a list of sos_targets
as _input
.
Pairing input from multiple sources is complicated when we apply group_by
to a list of targets with different sources. It is actually a lot easier if you apply group_by
to the sources separately. Fortunately, functions output_from
accepts group_by
so that you can regroup the targets before merging with other sources.
For example, in the following example, step_10
has 2 output files, step_20
has 4, by applying group_by=1
to output_from('step_10')
and group_by=2
to output_from('step_20')
, we create two sos_targets
each with two subgroups. The two sos_targets
will be joined to create a single _input
for each substep.
As explained by named input, keyword arguments overrides the labels of targets, so you can assign names to groups with keyword arguments:
Things can become tricky if you specify both "regular" input and grouped targets from output_from
. In this case, the regular input will be considered as a sos_targets
with a single group, and be merged to every group of another sos_targets
.
However, if option group_by
is specified outside of output_from
, it will group all targets regardless of original grouping. For example, in the following example, output from step_10
will be grouped by 2.