Edit this page on our live server and create a PR by running command !create-pr in the console panel

Working with R

  • Difficulty level: easy
  • Time need to lean: 15 minutes or less
  • Key points:
    • There are intuitive corresponding data types between most Python (SoS) and R datatypes

Installation

There are several options to install R and its jupyter kernel irjernel, the easiest of which might be using conda but it could be tricky to install third-party libraries of R to conda, and mixing R packages from the base and r channels can lead to devastating results.

Anyway, after you have a working R installation with irkernel installed, you will need to install

  • The sos-r language module,
  • The arrow library of R, and
  • The feather-format module of Python

The feature modules are needed to exchange dataframe between Python and R

Overview

SoS transfers Python variables in the following types to R as follows:

Python condition R
None   NULL
integer   integer
integer large numeric
float   numeric
boolean   logical
complex   complex
str   character
Sequence (list, tuple, ...) homogenous type c()
Sequence (list, tuple, ...) multiple types list
set   list
dict   list with names
numpy.ndarray   array
numpy.matrix   matrix
pandas.DataFrame   R data.frame

SoS gets variables in the following types to SoS as follows (n in condition column is the length of R datatype):

R condition Python
NULL   None
logical n == 1 boolean
integer n == 1 integer
numeric n == 1 double
character n == 1 string
complex n == 1 complex
logical n > 1 list
integer n > 1 list
complex n > 1 list
numeric n > 1 list
character n > 1 list
list without names   list
list with names   dict (with ordered keys)
matrix   numpy.array
data.frame   DataFrame
array   numpy.array

One of the key problems in mapping R datatypes to Python is that R does not have scalar types and all scalar variables are actually array of size 1. That is to say, in theory, variable a=1 should be represented in Python as a=[1]. However, because Python does differentiate scalar and array values, we chose to represent R arraies of size 1 as scalar types in Python.

In [1]:
In [2]:
a=1 with type <class 'int'>
b=[1, 2] with type <class 'list'>

Simple data types

Most simple Python data types can be converted to R types easily,

In [3]:
In [4]:
> null_var:
 NULL
> int_var:
 num 123
> float_var:
 num 3.14
> logic_var:
 logi TRUE
> char_var:
 chr "1\"23"
> comp_var:
 cplx 1+2i

The variables can be sent back to SoS without losing information

In [5]:
> null_var: NoneType
None
> int_var: int
123
> float_var: float
3.1415925
> logic_var: bool
True
> char_var: str of length 4
'1"23'
> comp_var: complex
(1+2j)

However, because Python allows integers of arbitrary precision which is not supported by R, large integers would be presented in R as float point numbers, which might not be able to keep the precision of the original number.

For example, if we put a large integer with 18 significant digits to R

In [6]:

The last digit would be different because of floating point presentation

In [7]:
123456789123456784

This is not a problem with SoS because you would get the same result if you enter this number in R

In [8]:
123456789123456784

Consequently, if you send large_int back to SoS, the number would be different

In [9]:
Out[9]:
123456789123456784

Array, matrix, and dataframe

The one-dimension (vector) data is converted from SoS to R as follows:

In [10]:
In [11]:
> char_arr_var:
 chr [1:3] "1" "2" "3"
> list_var:
List of 3
 $ : num 1
 $ : num 2
 $ : chr "3"
> dict_var:
List of 3
 $ a: num 1
 $ b: num 2
 $ c: chr "3"
> set_var:
List of 3
 $ : num 1
 $ : num 2
 $ : chr "3"
> recursive_var:
List of 2
 $ a:List of 1
  ..$ b: num 123
 $ c: logi TRUE
> logic_arr_var:
 logi [1:3] TRUE FALSE TRUE
> seri_var:
 Named num [1:6] 1 2 3 3 3 3
 - attr(*, "names")= chr [1:6] "0" "1" "2" "3" ...

The multi-dimension data is converted from SoS to R as follows:

In [12]:
In [13]:
> num_arr_var:
 num [1:2, 1:2] 1 3 2 4
> mat_var:
 num [1:2, 1:2] 1 3 2 4
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "0" "1"

The scalar data is converted from R to SoS as follows:

In [14]:
In [15]:
> null_var: NoneType
None
> num_var: int
123
> logic_var: bool
True
> char_var: str of length 4
'1"23'
> comp_var: complex
(1+2j)

The one-dimension (vector) data is converted from R to SoS as follows:

In [16]:
In [17]:
> num_vector_var: list of length 3
[1, 2, 3]
> logic_vector_var: list of length 3
[True, False, True]
> char_vector_var: list of length 3
['1', '2', '3']
> list_var: list of length 3
[1, 2, '3']
> named_list_var: dict of length 3
{'a': 1, 'b': 2, 'c': '3'}
> recursive_var: dict of length 2
{'a': 1, 'b': {'c': 3, 'd': 'whatever'}}
> seri_var: Series of shape (6,)
0    1
1    2
2    3
3    3
4    3
5    3
dtype: int64

The multi-dimension data is converted from R to SoS as follows:

In [18]:
In [19]:
> mat_var: ndarray of shape (2, 2)
array([[1., 3.],
       [2., 4.]])
> arr_var: ndarray of shape (2, 2, 2, 2)
array([[[[ 1,  3],
         [ 2,  4]],

        [[ 5,  7],
         [ 6,  8]]],


       [[[ 9, 11],
         [10, 12]],

        [[13, 15],
         [14, 16]]]])

It is worth noting that R's named list is transferred to Python as dictionaries but SoS preserves the order of the keys so that you can recover the order of the list. For example,

In [20]:

Although the dictionary might appear to have different order

In [21]:
Out[21]:
{'A': 1, 'C': 'C', 'B': 3, 'D': [2, 3]}

The order of the keys and values are actually preserved

In [22]:
Out[22]:
dict_keys(['A', 'C', 'B', 'D'])
In [23]:
Out[23]:
dict_values([1, 'C', 3, [2, 3]])

so it is safe to enumerate the R list in Python as

In [24]:
1 item of Rlist has key A and value 1
2 item of Rlist has key C and value C
3 item of Rlist has key B and value 3
4 item of Rlist has key D and value [2, 3]