Working with R

Difficulty level: easy
Time need to lean: 15 minutes or less
Key points:
- There are intuitive corresponding data types between most Python (SoS) and R datatypes

Installation

There are several options to install R and its jupyter kernel irjernel, the easiest of which might be using conda but it could be tricky to install third-party libraries of R to conda, and mixing R packages from the base and r channels can lead to devastating results.

Anyway, after you have a working R installation with irkernel installed, you will need to install

The sos-r language module,
The arrow library of R, and
The feather-format module of Python

The feature modules are needed to exchange dataframe between Python and R

Overview

SoS transfers Python variables in the following types to R as follows:

Python	condition	R
`None`		`NULL`
`integer`		`integer`
`integer`	`large`	`numeric`
`float`		`numeric`
`boolean`		`logical`
`complex`		`complex`
`str`		`character`
Sequence (`list`, `tuple`, ...)	homogenous type	`c()`
Sequence (`list`, `tuple`, ...)	multiple types	`list`
`set`		`list`
`dict`		`list` with names
`numpy.ndarray`		array
`numpy.matrix`		`matrix`
`pandas.DataFrame`		R `data.frame`

SoS gets variables in the following types to SoS as follows (n in condition column is the length of R datatype):

R	condition	Python
`NULL`		`None`
`logical`	`n == 1`	`boolean`
`integer`	`n == 1`	`integer`
`numeric`	`n == 1`	`double`
`character`	`n == 1`	`string`
`complex`	`n == 1`	`complex`
`logical`	`n > 1`	`list`
`integer`	`n > 1`	`list`
`complex`	`n > 1`	`list`
`numeric`	`n > 1`	`list`
`character`	`n > 1`	`list`
`list` without names		`list`
`list` with names		`dict` (with ordered keys)
`matrix`		`numpy.array`
`data.frame`		`DataFrame`
`array`		`numpy.array`

One of the key problems in mapping R datatypes to Python is that R does not have scalar types and all scalar variables are actually array of size 1. That is to say, in theory, variable a=1 should be represented in Python as a=[1]. However, because Python does differentiate scalar and array values, we chose to represent R arraies of size 1 as scalar types in Python.

a=1 with type <class 'int'>
b=[1, 2] with type <class 'list'>

Simple data types

Most simple Python data types can be converted to R types easily,

 NULL

 num 123

 num 3.14

 logi TRUE

 chr "1\"23"

 cplx 1+2i

The variables can be sent back to SoS without losing information

None

123

3.1415925

True

'1"23'

(1+2j)

However, because Python allows integers of arbitrary precision which is not supported by R, large integers would be presented in R as float point numbers, which might not be able to keep the precision of the original number.

For example, if we put a large integer with 18 significant digits to R

The last digit would be different because of floating point presentation

This is not a problem with SoS because you would get the same result if you enter this number in R

Consequently, if you send large_int back to SoS, the number would be different

123456789123456784

Array, matrix, and dataframe

The one-dimension (vector) data is converted from SoS to R as follows:

 chr [1:3] "1" "2" "3"

List of 3
 $ : num 1
 $ : num 2
 $ : chr "3"

List of 3
 $ a: num 1
 $ b: num 2
 $ c: chr "3"

List of 3
 $ : num 1
 $ : num 2
 $ : chr "3"

List of 2
 $ a:List of 1
  ..$ b: num 123
 $ c: logi TRUE

 logi [1:3] TRUE FALSE TRUE

 Named num [1:6] 1 2 3 3 3 3
 - attr(*, "names")= chr [1:6] "0" "1" "2" "3" ...

The multi-dimension data is converted from SoS to R as follows:

 num [1:2, 1:2] 1 3 2 4

 num [1:2, 1:2] 1 3 2 4
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "0" "1"

The scalar data is converted from R to SoS as follows:

None

123

True

'1"23'

(1+2j)

The one-dimension (vector) data is converted from R to SoS as follows:

[1, 2, 3]

[True, False, True]

['1', '2', '3']

[1, 2, '3']

{'a': 1, 'b': 2, 'c': '3'}

{'a': 1, 'b': {'c': 3, 'd': 'whatever'}}

0    1
1    2
2    3
3    3
4    3
5    3
dtype: int64

The multi-dimension data is converted from R to SoS as follows:

array([[1., 3.],
       [2., 4.]])

array([[[[ 1,  3],
         [ 2,  4]],

        [[ 5,  7],
         [ 6,  8]]],


       [[[ 9, 11],
         [10, 12]],

        [[13, 15],
         [14, 16]]]])

It is worth noting that R's named list is transferred to Python as dictionaries but SoS preserves the order of the keys so that you can recover the order of the list. For example,

Although the dictionary might appear to have different order

{'A': 1, 'C': 'C', 'B': 3, 'D': [2, 3]}

The order of the keys and values are actually preserved

dict_keys(['A', 'C', 'B', 'D'])

dict_values([1, 'C', 3, [2, 3]])

so it is safe to enumerate the R list in Python as

1 item of Rlist has key A and value 1
2 item of Rlist has key C and value C
3 item of Rlist has key B and value 3
4 item of Rlist has key D and value [2, 3]