Threading considerations

Project: SmallForth

This implementation of Forth was always intended to support threads of some sort. Any code that would need to support atomic operations has a // TODO against it - such as compiling words.

There is already support for thread local storage - these are stored in the ExecState of each thread. Each thread is expected to have its own ExecState.

To initiate a new thread of execution, a word would be defined threadst or similar. The next word would be the one to start a new thread of execution, similar to starting a go routine.

The thread would end when the word that was called ended.

One issue would be what to do with the data (and return, and temporary) stack. The method being called will not need all the items on the stack, it may need a specific number, or it may need a variable number depending on the data. It could simply leave the spurious items on the bottom of the stack, and on the thread ending the stack would be cleared. However this seems wasteful, and could lead to stack overflow situations, especially if the thread 'recursively' created another thread.

The continuance in the current thread would likely not need the top n stack elements either.

Threading strategies

There are several ways to do separate threads of execution:

Cooperative multitasking. The execution of DOCOL could swap between execution states. This would mean that DOCOL could not be called recursively, as the C++ stack would not be in a comparible state to the Forth-thread being executed - executing the Forth word EXIT would cause the C++ method to return to whatever called this iteration of DOCOL, which may not be the same Forth-Thread. The DOCOL implementation would need altering to no longer be recursive.
Start a new kernel-thread per thread-of-execution. This would not clone the C++ stack, but the current execution state including the Forth-stacks could be cloned. A new kernel thread could be started and the word being called executed on it.
m:n threading. Create m Forth-threads and execute them on n kernel threads, cooperatively (swapping in the DOCOL)
Fork the process. This would be quite a heavy way of doing things, and probably is not needed here.
Create a VM and compile all C++ code to byte code. Support cooperative multi-tasking in this bytecode, with the VM itself supporting kernel threads.

Of these options, m:n threading seems the best option - allowing thousands of Forth-threads (fthreads) to execute without the kernal overhead associated with that many kernel-threads (kthread). However, a thread dispatcher will need to be written to determine which fthreads are blocking, and which are due to execute next.

To do this, the most sensible place to swap fthreads would be the DOCOL C++ method, that iterates over the second-word definitions, calling (recursively) either itself, or a first-level word. To manage multi-threading, the DOCOL implementation should not recurse into itself, but implement that functionality using iterative behaviour (still calling first-level words). This would then be extended so that each kthread would have its own running DOCOL instance, that fthreads can be assigned to.

The DOCOL implementation should be taken out of the class it is in, and moving to a threading implementation class.

First level words

The execution of first-level words (written in C++) becomes an issue. These can call second-level words, for instance when converting the TOS to a string, or in the debugger when calling .s to output the stack. Many first-level words have already been converted to second-level words, such as if,then and loop. More could be converted, and some others could be prevented from calling second-level words. However, there will always be a need for this functionality.

There is no easy way for a C++ method to temporarily return to its caller (the DOCOL method) and be re-called at a later date to continue where it left off with exactly the same local state, the way that yield works in C#. (It might be possible with coroutines.)

One way to implement this would be for the C++ code to call the current version of DOCOL, that executes without regard to cooperative multitasking. Care must be taken that these second-level words do not call anything that will block execution, such as i/o, or waiting on thread signalling.

Thread of execution

On starting the thread, the first thing the word should do is call a new word keepn, to keep only the top items in the stack. It should also clear the temporary and return stack, as the values in these would not be needed.

Current thread continuance

After starting the new thread, the current thread needs to drop the top n stack elements. A new word dropn taking a single integer should be defined for this task.

Thread-safe objects

Reference counted objects are not currently thread-safe, and will not be altered to be made thread safe.

Static arrays and variables may be considered thread safe, but this will be tested. Putting the pointer to an array or variable increments the element's internal reference count and so will not be deleted by a different thread. Incrementing/decrementing reference counts will need to be made thread-safe, but at which point in the process will need to be determined. It may mean using a simple std::atomic< int > , or the whole process may need to be guarded by a std::mutex.

As well as ref count changings, setting pointer values (or pter to pter values) may need std::mutex as a guard too - the code triggers the ref count code inside TypeSystem.cpp in weird and wonderful ways.

Inter-thread communications

Typically when I implement IPC between threads in C++ or C#, ultimately what I create will use a vector or list, with an event to signal when new items are available. Ultimately this looks like a Go channel. Therefore I should implement a channel of some kind.

A channel should be implemented as a ref-counted object, one that is thread-safe communication. It be needed to be constructed by pushing the type on to the stack that it is to contain:

string_type channel_type construct

This should create an object of type channel. This could be stored in a variable to be shared with the threads communicating, or on the stack.

A way of waiting for a single channel, multiple channels, other signals types, a timeout, or a process-kill event, will need to be implemented. awaitOne or awaitAll, taking an array of signalling objects, should be implemented.

Thread dispatcher

If the m:n, or cooperative multitasking is chosen, the C++ implementation of DOCOL should become the thread dispatcher. Executing words, be them first or second level, will involve sending to one of the n running DOCOL methods, running on n kthread threads.

A first-level word be scheduled to be called against the fthread without looking at any CFA/Word Body Elements but with the thread's ExecState. A second-level word would execute as it is now, but different threads Word Body Elements will be executed in turn on a specific kthread.

This model would assign a specific fthread to a specific kthread. This seems the wisest way of doing things for cache coherency, in case different threads are on different cores/CPUs. The amount of kthreads available will need to be increased and shrank depending on demand by the fthread count, and possibly what the fthreads are up to. There is lots of CS-research devoted to this, so I will start with a basic implementation of a ratio and go from there.

Thread state

Threads will be in one of four states:

Sleeping. The thread has be asked to sleep for a set duration.
Active. The thread is active, but not actually executing on a kthread.
Running. The thread is running on a kthread
Blocked. The thread is blocked on something

This is based on my thinking now. In fact, active and running may be merged into one during implementation.

Threads become blocked when:

Awaiting a signal from a channel, or other signal
Awaiting for i/o to complete. Currently i/o is all synchronous, but this should be changed to asynchronous. This change would be invisible to the Forth language, but calls to i/o would then allow the thread dispatcher to swap to a different fthread.

Clearly, the changes to the i/o methods will need to work closely with the thread-dispatcher.

Conclusion

Implementing threads will be a lot of work, but could be done iteratively. The DOCOL implementation could be altered to be non-recursive at first, with first-level words still calling DOCOL directly. Then a copy of DOCOL should be moved into a threading implementation class to support swapping between different fthreads. First-level words should still call the current version of DOCOL.

Next, the thread dispatcher should be written, and the ExecState::ExecuteWordDirectly() method altered so when calling a second-level word, it added the word it wants to execute to a fthreads DOCOL instance's burden.

Then, the signalling objects, i/o routines updates and thread-safety can be implemented.

Alterrnatively, the byte-code VM solution should be considered - it may be slower to execute that C++, and would need a lot of functions (like trig') to be recreated, but then it could cross-compile to Web Assembly.