Edit this page on our live server and create a PR by running command !create-pr in the console panel

Input option group_by

  • Difficulty level: intermediate
  • Time need to lean: 20 minutes or less
  • Key points:
    • Option group_by creates groups (subsets) of input targets
    • Groups are persistent and can be passed from step to step

Parameter group_by and substeps

By default all input targets are processed all at once by the step. If you need to process input files one by one or in pair, you can define substeps that basically applies the step to subgroups of input targets, represented by variable _input.

In the trivial case when all input targets are processed together, _input is the same as step_input.

In [1]:
step input is a.txt b.txt
substep input is a.txt b.txt

Using option group_by, you can group the input targets in a number of ways, the easiest being group by 1:

In [2]:
input of step is a.txt b.txt
input of substep 0 is a.txt
input of step is a.txt b.txt
input of substep 1 is b.txt

As you can see, the step process is now executed twice. Whereas the step_input is the same for both substeps, _input is a.txt for the first substep, and b.txt for the second substep. Here we used an internal variable _index to show the index of the substep.

SoS allows you to group input in a number of ways:

option group by
all all in a single group, the default
single individual target
pairs match first half of files with the second half, take one from each half each time
combinations all unordered combinations of 2-sets
pairwise all adjacent 2-sets
label by labels of input
pairsource pair input files by their sources and take one from each source each time
N = 1, 2, ... chunks of size N
pairsN, N=2, 3, ... match first half of files with the second half, take N from each half each time
pairlabelN, N=2, 3, ... pair input files by their labels and take N from each label (if equal size) each time
pairwiseN, N=2, 3, ... all adjacent 2-sets, but each set has N items
combinationsN, N=2, 3, ... all unorderd combinations of N items
function (e.g. lamba x: ...) a function that returns groups of inputs

Group by order of input targets

You can group input targets in many different combinations based on their order in input list. For exmple, with the following sos script, the input are groups pairwisely:

In [3]:
file1 file2
file2 file3
file3 file4

To demonstrate more acceptable values, the following example uses sos_run action to execute this a step with different grouping method.

In [4]:
group_by=1
0: file1
1: file2
2: file3
3: file4

group_by=2
0: file1 file2
1: file3 file4

group_by=single
0: file1
1: file2
2: file3
3: file4

group_by=pairs
0: file1 file3
1: file2 file4

group_by=pairwise
0: file1 file2
1: file2 file3
2: file3 file4

group_by=combinations
0: file1 file2
1: file1 file3
2: file1 file4
3: file2 file3
4: file2 file4
5: file3 file4

group_by=combinations3
0: file1 file2 file3
1: file1 file2 file4
2: file1 file3 file4
3: file2 file3 file4

We did not include options pairsN and pairwiseN in the example because we need more input files to see what is going on. As you can see from the following example, the N groups input targets as small groups of size N before pairs and pairwise are applied.

In [5]:
group_by=pairs2
0: A1 B1 A3 B3
1: A2 B2 A4 B4

group_by=pairwise2
0: A1 B1 A2 B2
1: A2 B2 A3 B3
2: A3 B3 A4 B4

Group by label of input

As we recall from the labels attribute of sos_targets, input targets can have label of the present step (if specified directly), or as the output of previouly executed steps. Option group_by allows you to group input by sources by='label', or pair sources (by='pairlabel' and by='pairlabelN').

An example to use labeled input is when you have input data of different nature. For example

In [6]:
Process data sample1.txt with reference reference.txt
Process data sample2.txt with reference reference.txt

Here we would like to group_by=1 only for _input["data"], so we pair _input["data"] and _input["reference"] and group them together with pairlabel.

As a more complete example,

In [7]:
group_by=label
0: c1 c2 c3 c4 from ['group_step', 'group_step', 'group_step', 'group_step']
1: a1 from ['step_10']
2: b1 b2 from ['step_20', 'step_20']

group_by=pairlabel
0: c1 a1 b1 from ['group_step', 'step_10', 'step_20']
1: c2 a1 b1 from ['group_step', 'step_10', 'step_20']
2: c3 a1 b2 from ['group_step', 'step_10', 'step_20']
3: c4 a1 b2 from ['group_step', 'step_10', 'step_20']

group_by=pairlabel2
0: c1 c2 a1 b1 from ['group_step', 'group_step', 'step_10', 'step_20']
1: c3 c4 a1 b2 from ['group_step', 'group_step', 'step_10', 'step_20']

The options pairsource and pairsource2 need some explanation here because our groups do not have the same size. What these options do are

  1. Determine number of groups m from N and longest source.
  2. Either group or repeat items in sources to create m groups

For example, with pairsource2, we are creating two groups because the largest source have 4 targets (m=4/2=2). Then, a1 is repeated twice, b1, b2 are in two groups, and c1, c2 and c3, c4 are in two groups.

Group by user-defined function

Finally, if none of the predefined grouping mechanism works, it can be easier for you to specify a function that takes step_input and returns a list of sos_targets as _input.

In [8]:
0: c1
1: c2 c3
2: c4 c5 c6

Parameter group_by of output_from and named_output

Pairing input from multiple sources is complicated when we apply group_by to a list of targets with different sources. It is actually a lot easier if you apply group_by to the sources separately. Fortunately, functions output_from accepts group_by so that you can regroup the targets before merging with other sources.

For example, in the following example, step_10 has 2 output files, step_20 has 4, by applying group_by=1 to output_from('step_10') and group_by=2 to output_from('step_20'), we create two sos_targets each with two subgroups. The two sos_targets will be joined to create a single _input for each substep.

In [9]:
0: a1 c1 c2 from ['step_10', 'step_20', 'step_20']
1: a2 c3 c4 from ['step_10', 'step_20', 'step_20']

As explained by named input, keyword arguments overrides the labels of targets, so you can assign names to groups with keyword arguments:

In [10]:
0: a1 c1 c2 from ['step_10', 's20', 's20']
1: a2 c3 c4 from ['step_10', 's20', 's20']

Things can become tricky if you specify both "regular" input and grouped targets from output_from. In this case, the regular input will be considered as a sos_targets with a single group, and be merged to every group of another sos_targets.

In [11]:
Substep 0
substep input is a1 c1 c2 e1 e2 from ['step_10', 'step_20', 'step_20', 'my', 'my']

Substep 1
substep input is a2 c3 c4 e1 e2 from ['step_10', 'step_20', 'step_20', 'my', 'my']

However, if option group_by is specified outside of output_from, it will group all targets regardless of original grouping. For example, in the following example, output from step_10 will be grouped by 2.

In [12]:
Substep 0
substep input is c1 c2 from ['step_10', 'step_10']

Substep 1
substep input is c3 c4 from ['step_10', 'step_10']

Substep 2
substep input is e1 e2 from ['my', 'my']