enjoy implementing VMs with PHP; for example, I already implemented JVM and RubyVM. I am an author of some books. My hobby is reading binary fi les. I have been writing with Ruby for 4 months. memory1994 m3m0r7
YARV (Yet Another Ruby VM). - The YARV is a set of instruction sequences (a list of instructions to be executed) called ISeq (a.k.a. Instruction Sequence) on Ruby and meta- information about the instruction sequences. How RubyVM works
The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects The RUBY_PLATFORM name section (string) An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code
The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code The endian section (2 bytes) The word size section (2 bytes)
The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code The endian section (2 bytes) The word size section (2 bytes) The endian section (2 bytes) The word size section (2 bytes) Platform name section changed to endian section and word size section. No other changes in the YARV structure between Ruby 3.2 and 3.3. Platform name section changed to endian section and word size section. No other changes in the YARV structure between Ruby 3.2 and 3.3.
on the left is actually su ffi cient to output "Hello World!" as a string. However, it becomes more di ff i cult when it comes to YARV. - Not only RubyVM, the hardest part is to implement VM and output "Hello World!". - It is so interesting that it become to me crazy. How RubyVM works Hello World! Hello World!
The JVM has an extensive document called the "Java Virtual Machine Speci fi cation (i.e., JVM Speci fi cation)" [1]. - While Java documentation is maintained by companies, RubyVM documentation is maintained by the community. Therefore, the maintenance of documentation is inevitably limited compared to that of a company. - In such a situation, how can we implement RubyVM? How to implement a RubyVM with PHP? [1]: https://docs.oracle.com/javase/specs/jvms/se8/html/
RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
fi le; Ruby provides "RubyVM::InstructionSequence.compile" to compile Ruby code into instruction sequences. - For example below command, create a YARV fi le named HelloWorld.yarv using the "ruby -e". How to implement a RubyVM with PHP? RubyVM::InstructionSequence.compile ruby -e HelloWorld.yarv
"unpack" are very useful if you want to reading binary fi les in PHP. - Of course, implementation without using the "unpack" function is also possible by using bitwise operations. - PHP is unlike the C language. It can not read binary fi les with the speci fi ed type (e.g., an integer type). Therefore, it is necessary to read the binary once as a string (using "fread") and then convert it to an integer type (using "unpack"). How to implement a RubyVM with PHP? fread fseek unpack fread unpack unpack
RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
object list iseq list offset global object list The header of YARV 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes Speci fi ed magic string YARB (Yet Another Ruby Binary) The compiled Ruby major version. In the example, it is "3". The compiled Ruby minor version. In the example, it is "2". The binary payload size Number of the instruction sequences Number of the symbols (The symbol is different in Ruby's symbol) Offsets for the instruction sequences Offsets for the Ruby symbols It will look like the fi gure on the right iseq size 4 bytes The extra binary payload size
RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
size" times to get offsets for each 4 bytes. Move cursor to global object list offset Loop "global object list size" times to get offsets for each 4 bytes. For example, get o ff sets for the instruction sequences For example, get o ff sets for the symbols
RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
called "ibf_(?:load|dump_write)_small_value" for e ff i cient binary handling [1].ɻ - It uses Hamming weights[2] (also called popcount or population count) to handle variable byte lengths. An example implementation is shown in the left fi gure which written in PHP. - In this talk, I will name it "readSmallValue". How to implement a RubyVM with PHP? [1]: https://github.com/ruby/ruby/blob/2f603bc4/compile.c#L11262-L11273 [2]: https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%9F%E3%83%B3%E3%82%B0%E9%87%8D%E3%81%BF ibf_(?:load|dump_write)_small_value readSmallValue
as in the previous example, it is available to read the data structure of the 0th Instruction Sequence. - The data structure of Instruction Sequence is actually very huge. Among them, meta-information is huge, such as exception table, keyword arguments, etc. And the number of meta-information is more than 50.... - Actually, it is not necessary to implement all of the data structures if only the output is "HelloWorld!". Therefore, to omit it and using the necessary 4 meta-information. How to implement a RubyVM with PHP? readSmallValue HelloWorld!
of RubyVM, read the Ruby core implementation (https://github.com/ruby/ruby/blob/ruby_3_3/ compile.c#L12514) - Or see to my implementation of "RubyVM on PHP" (https://github.com/ m3m0r7/rubyvm-on-php/blob/0.3.3.0/src/VM/Core/Runtime/Kernel/ Ruby3_3/InstructionSequence/InstructionSequenceProcessor.php#L59). How to implement a RubyVM with PHP?
instruction sequence sv sv sv sv ※ "sv" is omitted by a "small value" (it is variable bytes). Move cursor to 0th instruction sequence from offsets list for instruction sequences
RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
ff set and $iseqSize were got from the previous slide. Using these, we will implement execution of the operation code while reading the instruction sequence. - Use 4 instructions "putself (18)", "putstring (21)", "opt_send_without_block (51)", and "leave (60)" for outputting "HelloWorld!". - Make an array of opcode and mnemonic pairs as shown on the left fi gure. How to implement a RubyVM with PHP? The pairs of opcodes and mnemonics are implemented in https://github.com/ruby/ ruby/blob/ruby_3_3/yjit/src/ cruby_bindings.inc.rs#L669-L872. $bytecodeO ff set $iseqSize putself (18) putstring (18) opt_send_without_block (51) leave (60) HelloWorld!
implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Push the running context to the stack The running context
implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Push the string "HelloWorld!" to the stack The running context "HelloWorld!"
implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Pop two data from the stack. Then, call "puts" method in the running context The running context "HelloWorld!"
implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack ←Finish execution. Return the result to upper context.
the instruction sequence after being converted to YARV can be got by doing something like "puts RubyVM::InstructionSequence.compile("puts 'HelloWorld!'", "HelloWorld.rb").disasm"; this is similarity the javap command in Java. How to implement a RubyVM with PHP? puts RubyVM::InstructionSequence.compile("puts 'HelloWorld!'", "HelloWorld.rb").disasm
loadObject as shown in the left fi gure. loadObject is a function to get the symbol from the o ff set. - In addition, we implement the Main class, which implements the "puts" method. How to implement a RubyVM with PHP? Here number is speci fi ed from https://github.com/ ruby/ruby/blob/ruby_3_3/compile.c#L13303-L13336, which is indexed number in functions array. For example, string is "5". loadObject loadObject puts
An example of the implementation of the "putstring" instruction An example of the implementation of the "opt_send_without_block" instruction An example of the implementation of the Implement "leave" instruction
fi gure shows the output of "HelloWorld!" when executing the previous implementation. - Example source code is published in the following the gist. - https://gist.github.com/ m3m0r7/226e20c8115caf4a9d43b291861 f978b How to implement a RubyVM with PHP? HelloWorld!
contains meta-information about the method to be executed, such as the method name, the number of arguments, and the names of keyword arguments, and so on. - It can be called without specifying "puts" directly on code by implementing it as shown on the following page. puts CallInfoEntry
de fi ned in the fi gure on the left are required In the actual implementation, but are omitted in this example because they are not required for the output of "HelloWorld!". Add this one $ciSize below Result
resolves the method name using the CallInfoEntry . And also resolves the number of arguments so that the method can be called even when the number of arguments increases. - The CallInfoEntry provides a variety of information. If you are interested in more, you can read the Ruby core code or my implementation of "RubyVM on PHP" and try to implement it. CallInfoEntry
following the gist for the code, including the changes made earlier. - https://gist.github.com/m3m0r7/226e20c8115caf4a9d43b291861f978b? permalink_comment_id=4686761#gistcomment-4686761
How many instructions are there in the RubyVM instruction set? - RubyVM has about 100 instructions. Actually, it has about 200, but half of them are for tracing instructions. By the way, JVM has about 150 instructions (as of SE 13). - For simple HelloWorld! output, FizzBuzz algorithm, or QuickSort algorithm, it is not necessary to implement everything. It is possible to execute them by implementing a few instructions.
RubyVM. - https://github.com/ruby/ruby/blob/ruby_3_3/insns.def - See below for an example implementation in PHP. - https://github.com/m3m0r7/rubyvm-on-php/tree/0.3.3.0/src/VM/Core/ Runtime/Executor/Insn/Processor How many instructions are there in the RubyVM instruction set? How many instructions are there in the RubyVM instruction set?
set? - The following 6 instructions have been added since Ruby 3.3.0. - In Ruby, when the instruction set increases, the opcodes of other instruction sets may become out of sync... For example, OPT_SEND_WITHOUT_BLOCK was 51 in Ruby 3.2, but it is 53 in Ruby 3.3. If you want to support multiple versions of RubyVM, you need to consider this speci fi cation. OpCode Mnemonic 33 SPLATKW 45 DEFINEDIVAR 58 OPT_NEWARRAY_SEND 135 TRACE_SPLATKW 147 TRACE_DEFINEDIVAR 160 TRACE_OPT_NEWARRAY_SEND How many instructions are there in the RubyVM instruction set?
How do we understand RubyVM from Ruby's core code? - It is easy to understand the Ruby implementation by looking at compile.c (https://github.com/ruby/ruby/blob/ruby_3_3/compile.c). - Especially, it is a good idea to follow the `ibf_load_*` function. - However, you will have to follow the code to fi nd out in which order the functions are called, so I will draw a fl ow diagram on the next page to give you a rough idea of how to follow. ibf_load_* compile.c
How do we understand RubyVM from Ruby's core code? rb_iseq_ibf_load ibf_load_setup ibf_load_iseq rb_ibf_load_iseq_complete ibf_load_iseq_each ibf_load_code Return read ISeq ibf_load_small_value ibf_load_object/id ibf_load_local_table Functions below called by ibf_load_iseq_each ibf_load_iseq is called compile.c ← Function called when calling RubyVM::InstructionSequence.load_from_binary method in Ruby's core code iseqw_s_load_from_binary iseq.c RubyVM::InstructionSequence.load_from_binary and so on...
work? - Local variables are one di ffi cult implementation of RubyVM. I myself have often failed to implement it. - When executing a de fi ned method, arguments must be pre-set in a local table at runtime (rather than pushed onto the stack), but the location of the arguments to be set must be calculated, and it is so di ff i cult. - Although the source code seems to require an Environment Pointer (EP), but I explain how to not use EP in this slides. Environment Pointer (EP)
work? - The value of "varN" (where N is a natural number, N>0) is determined by "VM_ENV_DATA_SIZE + local table size - (N - 1)". This means var1 can be understood to be stored in "slot[6]" as it is calculated by "VM_ENV_DATA_SIZE(3) + local table size(4) - N(1)". - The arguments passed to a method must be associated with slot indexes in the reverse order of the de fi ned arguments. - The opcode for "[gs]etlocal(?:_WC[01]|)" follows this rule for getting values. varN VM_ENV_DATA_SIZE + local table size - (N - 1) slot[6] VM_ENV_DATA_SIZE(3) + local table size(4) - N(1) [gs]etlocal(?:_WC[01]|)
work? - The reason for starting from the third position seems to be due to embedding information necessary for RubyVM ("VM_ENV_DATA_INDEX_ME_CREF", "VM_ENV_DATA_INDEX_SPECVAL", "VM_ENV_DATA_INDEX_FLAGS"). - If there are arguments, it is necessary to prepopulate the slots with values before calling the method. For example, "var1" needs to be placed in "slot[6]" and "var2" in "slot[5]" in advance. Note that for "var3" and "var4", it is not necessary to prepopulate the values as "setlocal" is called within the internal instruction sequence. (Implementation hint: https://github.com/m3m0r7/ rubyvm-on-php/blob/0.3.3.0/src/VM/Core/Runtime/Executor/ CallBlockHelper.php#L121) VM_ENV_DATA_INDEX_ME_CREF VM_ENV_DATA_INDEX_SPECVAL VM_ENV_DATA_INDEX_FLAGS setlocal var1 slot[6] var2 slot[5] var3 var4
work? - In addition, RubyVM's local variables have the concept of "level", where level = 0 represents the current execution context. Each time level is increased by 1, 2, 3 ..., the local variable of the previous context (corresponding to the VM_ENV_PREV_EP macro in the Ruby core code) is referenced (see the fi gure on the next page). - It is easy to understand if you think of level as a relative position in terms of the context in which it is being executed. - Therefore, it is necessary to be able to trace back the context in which it is being executed. VM_ENV_PREV_EP
level is 0 and "setlocal_WC0" is executed. The var3 is de fi ned in the previous context. Therefore, level is 1 in terms of the running context and "getlocal_WC1" is executed. The var3 is de fi ned in the previous context. Therefore, level is 1 in terms of the running context and "setlocal_WC1" is executed. It is in the running context that the var1 and var2 are de fi ned. Therefore level is 0 and "getlocal_WC0" is executed. When a call is made (internally when opcode such as send/ opt_send_without_block is called) execution context changes