Writing a new language module

Difficulty level: difficult
Time need to lean: 30 minutes or more
Key points:
- It is relatively easy to write a language module with basic functions
- Data exchange for different data types are handled independently so you can start from most common types and add more types gradually

Role of a language modual

SoS can interact with any Jupyter kernel. As shown in the SoS notebook tutorial, SoS can

List the kernel in the language dropdown box and use it to execute associated cells
Use %expand magic to prepare input before sending to the kernel
Use %capture magic to capture the output from the kernel
Use %render magic to render output from the kernel

without knowing what the kernel does.

However, if the kernel supports the concept of variable (not all kernel does), a language module for the kernel would allow SoS to work more efficiently with the kernel. More specifically, SoS can

Mark the prompt areas of each cell to differentiate cells that belong to different kernels
Preview variables when an assignment is executed during line-by-line execution
Change current directory of all subkernels with %cd magic
Exchange variables between subkernels using magics %put, %get and %with
Expand (markdown) texts in subkernel using magic %expand --in
Preview the content of variables using magic %preview
Show the version information of kernels using magic %sessioninfo

Whereas data exchange among subkernels is really powerful, it is important to understand that, SoS does not tranfer any variables among kernels, it creates independent homonymous variables of similar types that are native to the destination language. For example, if you have the following two variables

a = 1
b = c(1, 2)

in R and executes a magic

%get a b --from R

in a SoS cell, SoS actually execute the following statements, in the background, to create variables a and b in Python

a = 1
b = [1, 2]

These variables are independent so that changing the value of variables a or b in one kernel will not affect another. We also note that a and b are of different types in Python although they are of the same numeric type in R (a is technically speaking an array of size 1).

Define a new language Module

The best way to start a new language module is to read the source code of an existing language module and adapt it to your language. Our github oraganization has a number of language modules. Module sos-r is a good choice and you should try to match the corresponding items with code in kernel.py when going through this tutorial.

Class attributes

To support a new language, you will need to write a Python package that defines a class, say mylanguage, that provides the following class attributes:

`supported_kernels`

supported_kernels should be a dictionary of language and names of the kernels that the language supports. For example, ir is the name of kernel for language R so this attribute should be defined as:

supported_kernels =  {'R': ['ir']}

If multiple kernels are supported, SoS will look for a kernel with matched name in the order that is specified. This is the case for JavaScript where multiple kernels are available:

supported_kernels =  {'JavaScript': ['ijavascript', 'inodejs']}

Multiple languages can be specified if a language module supports multiple languages. For example, MATLAB and Octave share the same language module

supported_kernels = {'MATLAB': ['imatlab', 'matlab'], 'Octave': ['octave']}

Wildcard characters are allowd in kernel names, which are useful for kernels that contain version numbers:

supported_kernels = {'Julia': ['julia-?.?']}

Finally, if SoS cannot find any kernel that it recognizes, it will look into the language information of the kernelspec.

`background_color`

background_color should be a name or #XXXXXX value for a color that will be used in the prompt area of cells that are executed by the subkernel. An empty string can be used for using default notebook color. If the language module defines multiple languages, a dictionary {language: color} can be used to specify different colors for supported languages. For example,

background_color = {'MATLAB': '#8ee7f1', 'Octave': '#dff8fb'}

is used for MATLAB and Octave.

`cd_command`

cd_command is a command to change current working directory, specified with {dir} intepolated with option of magic %cd. For example, the command for R is

cd_command = 'setwd({dir!r})'

where !r quotes the provided dir. Note that { } are used as a Python f-string but no f prefix should be used.

`options`

A Python dictionary with options that will be passed to the frontend. Currently two options variable_pattern and assignment_pattern are supported. Both options should be regular expressions in JS style.

Option variable_pattern is used to identify if a statement is a simple variable (nothing else). If this option is defined and the input text (if executed at the side panel) matches the pattern, SoS will prepend %preview to the code. This option is useful only when %preview var displays more information than var.
Option assignment_pattern is used to identify if a statement is an assignment operation. If this option is defined and the input text matches the pattern, SoS will prepend %preview var to the code where var should be the first matched portion of the pattern (use ( )). This mechanism allows SoS to automatically display result of an assignment when you step through the code.

Both options are optional.

`version`

This attribute, if provided, will be included in the debug message when the language module is loaded. This helps you, for example, to check if the correct version of the language module has been loaded if you have multiple instances of python, sos, and/or language module available.

Instance attributes

An instance of the class would be initialized with the sos kernel and the name of the subkernel, which does not have to be one of the supported_kernels (could be self-defined) and should provide the following attributes and functions. Because these attributes are instantiated with kernel name, they can vary (slightly) from kernel to kernel.

String `init_statement`

init_statements is a statement that will be executed by the sub-kernel when the kernel starts. This statement usually defines a number of utility functions.

Function `get_vars(self, names)`

should be a Python function that transfer specified Python variables to the subkernel. We will discussion this in detail in the next section.

Function `put_vars(self, items, to_kernel=None)`

Function put_vars should be a Python function that put one or more variables in the subkernel to SoS or another subkernel. We will discussion this in detail in the next section.

Function `expand(self, text, sigil)` (new in SoS Notebook 0.20.8)

Function expand should be a Python function that passes text (most likely in Markdown format) with inline expressions, evaluate the expressions in the subkernel and return expanded text. This can be used by the markdown kernel for the execution of inline expressions of, for example, R markdown text.

Function `preview(self, item)`

Function preview accepts a name, which should be the name of a variable in the subkernel. This function should return a tuple of two items (desc, preview) where

desc should be a text (can be empty) that describes the type, size, dimension, or other general information of the variable, which will be displayed after variable name.
preview can be
- A single str that are printed as stdout
- A dictionary, which should contain keys such as text/plain, text/html, image/png and corresponding data. The data will be sent directly as display_data and allows you to return different types of preview result, even images.
- A list or tuple of two dictionaries, with the first being the data dictionary, and the second being the metadata directionary for a display_data message.

Function `sessioninfo(self)`

Function sessioninfo should a Python function that returns information of the running kernel, usually including version of the language, the kernel, and currently used packages and their versions. For R, this means a call to sessionInfo() function. The return value of this function can be

A string
A list of strings or (key, value) pairs, or
A dictinary.

The function will be called by the %sessioninfo magic of SoS.

Obtain variable from SoS

The get_vars function should be defined as

def get_vars(self, var_names)

where

self is the language instance with access to the SoS kernel, and
var_names are names in the sos dictionary.

This function is responsible for probing the type of Python variable and create a similar object in the subkernel.

For example, to create a Python object b = [1, 2] in R (magic %get), this function could

Obtain a R expression to create this variable (e.g. b <- c(1, 2))
Execute the expression in the subkernel to create variable b in it.

Note that the function get_vars can change the variable name because a valid variable name in Python might not be a valid variable name in another language. The function should give a warning (call self.sos_kernel.warn()) if this happens.

Send variables to other kernels

The put_vars function should be defined as

def put_vars(self, var_names, to_kernel=None)

where

self is the language instance with access to the SoS kernel
var_name is a list of variables that should exist in the subkernel.
to_kernel is the destination kernel to which the variables should be passed.

Depending on destination kernel, this function can:

If direct variable transfer is not supported by the language, the function can return a Python dictionary, in which case the language transfers the variables to SoS and let SoS pass along to the destination kernel.
If direct variable transfer is supported, the function should return a string. SoS will evaluate the string in the destination kernel to pass variables directly to the destination kernel.

So basically, a language can start with an implementation of put_vars(to_kernel='sos') and let SoS handle the rest. If needs arise, it can

Implement variable exchanges between instances of the same language. This can be useful because there are usually lossness and more efficient methods in this case.
Put variable to another languages where direct varable transfer is much more efficient than transferring through SoS.

NOTE: SoS Notebook before version 0.20.5 supports a feature called automatic variable transfer, which automatically transfers variables with names starting with sos between kernels. This feature has been deprecated. (#253).

For example, to send a R object b <- c(1, 2) from subkernel R to SoS (magic %put), this function can

Execute an statement in the subkernel to get the value(s) of variable(s) in some format, for example, a string "{'b': [1, 2]}".
Post-process these varibles to return a dictionary to SoS.

The R sos extension provides a good example to get you started.

NOTE: Unlike other language extension mechanisms in which the python module can get hold of the "engine" of the interpreter (e.g. saspy and matlab's Python extension start the interpreter for direct communication) or have access to lower level API of the language (e.g. rpy2), SoS only have access to the interface of the language and perform all conversions by executing commands in the subkernels and intercepting their response. Consequently,

Data exchange can be slower than other methods.
Data exchange is less dependent on version of the interpreter.
Data exchange can happen between a local and a remote kernel.

Also, although it can be more efficient to save large datasets to disk files and load in another kernel, this method does not work for kernels that do not share the same filesystem. We currently ignore this issue and assume all kernels have access to the same file system.

Functions of `sos_kernel`

With access to an instance of SoS kernel, you can call various functions of this kernel. However, the SoS kernel does not provide a stable API yet so you are advised to use only the following functions:

`sos_kernel.warn(msg)`

This function produces a warning message.

`sos_kernel.run_cell(statement, True, False, on_error='msg')`

Execute a statement in the current subkernel, with True, False indicating that the execution should be done in the background and no output should be displayed. A message on_error will be displayed if the statement fails to execute.

`sos_kernel.get_response(statement, msg_type, name)`

This function executes the statement and collects messages send back from the subkernel. Only messages in specified msg_type are kept (e.g. stream, display_data), and name can be one or both of stdout and stderr when stream is specified.

The returned value is a list of

msg_type, msg_data
msg_type, msg_data
...

so

self.sos_kernel.get_response('ls()', ('stream', ), 
                name=('stdout', ))[0][1]

runs a function ls() in the subkernel, collects stdout, and get the content of the first message.

Debugging

If you are having trouble in figuring out what messages have been returned (e.g. display_data and stream can look alike) from subkernels, you can use the %capture magic to show them in the console panel.

You can also define environment variable SOS_DEBUG=MESSAGE (or MESSAGE,KERNEL etc) before starting the notebook server. This will cause SoS to, among other things, log messages processed by the get_response function to ~/.sos/sos_debug.log.

Logging

If you would like to add your own debug messages to the log file, you can

from sos.utils import env

env.log_to_file('VARIABLE', f'Processing {var} of type {var.__class__.__name__}.')

If the log message can be expensive to format, you can check if SOS_DEBUG is defined before logging to the log file:

if 'VARIABLE' in env.config['SOS_DEBUG'] or 'ALL' in env.config['SOS_DEBUG']:
    env.log_to_file('VARIABLE', f'Processing {var} of type {var.__class__.__name__}.')

Testing

Although you can test your language module in many ways, it is highly recommended that you adopt a standard set of selenium-based tests that are executed by pytest. To create and run these tests, you should

Install selenium and pytest
Install Google Chrome and chrome driver
Set environment variable JUPYTER_TEST_BROWSER to live if you would like to the test running. Otherwise the tests will be run in a virtual chrome browser without display.
Copy three test files from tests for sos-r and adapt them for your language.

Test files

The test suite contains three files:

conftest.py

This is the configuration file for pytest that defines how to start a Jupyter server with the notebook with the right kernel. You can simply copy this file for your purpose.
test_interface.py

This file contains tests on the interface of the language module, including
- Test for prompt color
- Test for magic %cd
- Test for change of variable names for magics %put and %get
- Test for the automatic exchange of sos variables (variables with names starting with sos
- Test for the %preview magic
- Test for the %sessioninfo magic
test_data_exchange.py

This file should contain tests for data exchange between SoS (Python) and the language, and optionally between subkernels. It should separate by data types and direction of data transfer.

All tests should be derived from NotebookTest derived from sos_notebook.test_utils, and use a pytest fixture notebook as follows:

from sos_notebook.test_utils import NotebookTest

class TestDataExchange(NotebookTest):
    def test_something(self, notebook):
        pass

The `notebook` fixture

The notebook fixture that is passed to each test function contains a notebook instance that you can operate on. Although there are a large number of functions, you most likely only need to learn two of them for your tests:

notebook.call(statement, kernel, expect_error=False)

This function append a new cell to the end of notebook, insert the specified statement as its content, change the kernel of the cell to kernel, and executes the cell. It automatically dedent statement so you can indent multiple statements and cal

notebook.call('''\
          %put df --to R
          import pandas as pd
          import numpy as np
          arr = np.random.randn(1000)
          arr[::10] = np.nan
          df = pd.DataFrame({'column_{0}'.format(i): arr for i in range(10)})
          ''', kernel='SoS')

This function returns the index of the cell so that you can call notebook.get_cell_output(idx) if needed. If you are supposed to see some warning messages, use expect_error=True. Otherwise the function will raise an exception that fails the test.

notebook.check_output(statement, kernel, expect_error=False, selector=None, attribute=None)

This function calls the notebook.call(statement, kernel) and then notebook.get_cell_output(idx, selector, attribute) to get the output. The output contains all the text of the output, and additional text from non-text elements. For example, selector='img', attribute='src' would return text in <img src="blah"> output. Using this function, most of your unittests can look like the following

def test_sessioninfo(self, notebook):
    assert 'R version' in notebook.check_output(
        '%sessioninfo', kernel="SoS")

Registering the new language module

To register a language module with SoS, you will need to add your module to an entry point under section sos-language. This can be done by adding the something like the following to your setup.py:

entry_points='''
[sos_language]
Perl = sos_perl.kernel:sos_Perl
'''

With the installation of this package, sos would be able to import a class sos_Perl from module sos_perl.kernel, and use it to work with the Perl language.

Writing a new language module

Role of a language modual

Define a new language Module

Class attributes

supported_kernels

background_color

cd_command

options

__version__

Instance attributes

String init_statement

Function get_vars(self, names)

Function put_vars(self, items, to_kernel=None)

Function expand(self, text, sigil) (new in SoS Notebook 0.20.8)

Function preview(self, item)

Function sessioninfo(self)