Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to implement a RubyVM with PHP?

How to implement a RubyVM with PHP?

RubyKaigi 2024

memory

May 17, 2024
Tweet

More Decks by memory

Other Decks in Programming

Transcript

  1. memory m3m0r7 I am a CTO at Liiga, Inc. I

    enjoy implementing VMs with PHP; for example, I already implemented JVM and RubyVM. I am an author of some books. My hobby is reading binary fi les. I have been writing with Ruby for 4 months.  memory1994 m3m0r7
  2. 

  3.  "Yes, let's make a VM."
 "If I make a

    VM, I can understand how Ruby feels."
  4. How do RubyVM work? How to implement a RubyVM with

    PHP? What is "CallInfoEntry"? How many instructions are there in the RubyVM instruction set?  1 2 3 4 Table of Contents 1/2
  5. How do we understand RubyVM from Ruby's core code? How

    do local variables work? DEMO  5 6 7 Table of Contents 2/2
  6. How the lexical analyzer for Ruby works How the parser

    for Ruby works How to write PHP and Ruby  Topics I will not cover are...
  7. Do Repeat Yaruki (which means "motivation" in English)  D

    R Y What is important to implement a VM?
  8. What is "RubyVM" ? - The RubyVM is also called

    YARV (Yet Another Ruby VM). - The YARV is a set of instruction sequences (a list of instructions to be executed) called ISeq (a.k.a. Instruction Sequence) on Ruby and meta- information about the instruction sequences.  How RubyVM works
  9. - The fi gure below is the YARV for outputting

    "Hello World!". What is "RubyVM" ?  How RubyVM works You can see the string "Hello World!" Hello World!
  10.  The header section 
 (36 bytes) The payload section

    The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects The RUBY_PLATFORM name section 
 (string) An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section 
 (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code
  11.  The header section 
 (36 bytes) The payload section

    The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section 
 (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code The endian section (2 bytes) The word size section 
 (2 bytes)
  12.  The header section 
 (36 bytes) The payload section

    The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section 
 (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code The endian section (2 bytes) The word size section 
 (2 bytes) The endian section (2 bytes) The word size section 
 (2 bytes) Platform name section changed to
 endian section and word size section.
 No other changes in the YARV structure
 between Ruby 3.2 and 3.3. Platform name section changed to
 endian section and word size section.
 No other changes in the YARV structure
 between Ruby 3.2 and 3.3.
  13. What is "RubyVM" ? - In Ruby, the code shown

    on the left is actually su ffi cient to output "Hello World!" as a string. However, it becomes more di ff i cult when it comes to YARV. - Not only RubyVM, the hardest part is to implement VM and output "Hello World!". - It is so interesting that it become to me crazy.  How RubyVM works Hello World! Hello World!
  14. Have you any idea how to implement a RubyVM? -

    The JVM has an extensive document called the "Java Virtual Machine Speci fi cation (i.e., JVM Speci fi cation)" [1]. - While Java documentation is maintained by companies, RubyVM documentation is maintained by the community. 
 Therefore, the maintenance of documentation is inevitably limited compared to that of a company. - In such a situation, how can we implement RubyVM?  How to implement a RubyVM with PHP? [1]: https://docs.oracle.com/javase/specs/jvms/se8/html/
  15. The fl ow of implementation  How to implement a

    RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
  16. Processing binary structures in PHP - First, generates a YARV

    fi le; Ruby provides "RubyVM::InstructionSequence.compile" to compile Ruby code into instruction sequences. - For example below command, create a YARV fi le named HelloWorld.yarv using the "ruby -e".  How to implement a RubyVM with PHP? RubyVM::InstructionSequence.compile ruby -e HelloWorld.yarv
  17. Processing binary structures in PHP - The "fread", "fseek", and

    "unpack" are very useful if you want to reading binary fi les in PHP. - Of course, implementation without using the "unpack" function is also possible by using bitwise operations. - PHP is unlike the C language. It can not read binary fi les with the speci fi ed type (e.g., an integer type). Therefore, it is necessary to read the binary once as a string (using "fread") and then convert it to an integer type (using "unpack").  How to implement a RubyVM with PHP? fread fseek unpack fread unpack unpack
  18. The fl ow of implementation  How to implement a

    RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
  19.  magic major version minor version size extra size global

    object list iseq list offset global object list The header of YARV 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes Speci fi ed magic string YARB (Yet Another Ruby Binary) The compiled Ruby major version. In the example, it is "3". The compiled Ruby minor version. In the example, it is "2". The binary payload size Number of the instruction sequences Number of the symbols 
 (The symbol is different in Ruby's symbol) Offsets for the instruction sequences Offsets for the Ruby symbols It will look like the fi gure 
 on the right iseq size 4 bytes The extra binary payload size
  20. The fl ow of implementation  How to implement a

    RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
  21.  Move cursor to iseq list offset Loop "iseq list

    size" times to get 
 offsets for each 4 bytes. Move cursor to global object list offset Loop "global object list size" times to get 
 offsets for each 4 bytes. For example, get o ff sets for the instruction sequences For example, get o ff sets for the symbols
  22. The fl ow of implementation  How to implement a

    RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
  23. Process the Instruction Sequence byte-code - RubyVM has an implementation

    called "ibf_(?:load|dump_write)_small_value" for e ff i cient binary handling [1].ɻ - It uses Hamming weights[2] (also called popcount or population count) to handle variable byte lengths. An example implementation is shown in the left fi gure which written in PHP. - In this talk, I will name it "readSmallValue".  How to implement a RubyVM with PHP? [1]: https://github.com/ruby/ruby/blob/2f603bc4/compile.c#L11262-L11273
 [2]: https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%9F%E3%83%B3%E3%82%B0%E9%87%8D%E3%81%BF ibf_(?:load|dump_write)_small_value readSmallValue
  24. Process the Instruction Sequence byte-code - If readSmallValue is implemented

    as in the previous example, it is available to read the data structure of the 0th Instruction Sequence. - The data structure of Instruction Sequence is actually very huge. Among them, meta-information is huge, such as exception table, keyword arguments, etc. And the number of meta-information is more than 50.... - Actually, it is not necessary to implement all of the data structures if only the output is "HelloWorld!". Therefore, to omit it and using the necessary 4 meta-information.  How to implement a RubyVM with PHP? readSmallValue HelloWorld!
  25. Process the Instruction Sequence byte-code - For an example implementation

    of RubyVM, read the Ruby core implementation (https://github.com/ruby/ruby/blob/ruby_3_3/ compile.c#L12514) - Or see to my implementation of "RubyVM on PHP" (https://github.com/ m3m0r7/rubyvm-on-php/blob/0.3.3.0/src/VM/Core/Runtime/Kernel/ Ruby3_3/InstructionSequence/InstructionSequenceProcessor.php#L59).  How to implement a RubyVM with PHP?
  26.  type iseq size bytecode offset bytecode size Read an

    instruction sequence sv sv sv sv ※ "sv" is omitted by a "small value" (it is variable bytes). Move cursor to 0th instruction sequence from offsets list for instruction sequences
  27. The fl ow of implementation  How to implement a

    RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence
  28. Execute the byte-code of 0th Instruction Sequence - The $bytecodeO

    ff set and $iseqSize were got from the previous slide. Using these, we will implement execution of the operation code while reading the instruction sequence. - Use 4 instructions "putself (18)", "putstring (21)", "opt_send_without_block (51)", and "leave (60)" for outputting "HelloWorld!". - Make an array of opcode and mnemonic pairs as shown on the left fi gure.  How to implement a RubyVM with PHP? The pairs of opcodes and mnemonics are implemented in https://github.com/ruby/ ruby/blob/ruby_3_3/yjit/src/ cruby_bindings.inc.rs#L669-L872. $bytecodeO ff set $iseqSize putself (18) putstring (18) opt_send_without_block (51) leave (60) HelloWorld!
  29. Execute the byte-code of 0th Instruction Sequence  How to

    implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Push the running context 
 to the stack The running context
  30. Execute the byte-code of 0th Instruction Sequence  How to

    implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Push the string "HelloWorld!" to the stack The running context "HelloWorld!"
  31. Execute the byte-code of 0th Instruction Sequence  How to

    implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Pop two data from the stack. Then, call "puts" method 
 in the running context The running context "HelloWorld!"
  32. Execute the byte-code of 0th Instruction Sequence  How to

    implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack ←Finish execution. Return the result 
 to upper context.
  33. Execute the byte-code of 0th Instruction Sequence - In addition,

    the instruction sequence after being converted to YARV can be got by doing something like "puts RubyVM::InstructionSequence.compile("puts 'HelloWorld!'", "HelloWorld.rb").disasm"; this is similarity the javap command in Java.  How to implement a RubyVM with PHP? puts RubyVM::InstructionSequence.compile("puts 'HelloWorld!'", "HelloWorld.rb").disasm
  34. Execute the byte-code of 0th Instruction Sequence - Next, implement

    loadObject as shown in the left fi gure. loadObject is a function to get the symbol from the o ff set. - In addition, we implement the Main class, which implements the "puts" method.  How to implement a RubyVM with PHP? Here number is speci fi ed from https://github.com/ ruby/ruby/blob/ruby_3_3/compile.c#L13303-L13336, which is indexed number in functions array. For example, string is "5". loadObject loadObject puts
  35.  An example of the implementation of 
 the"putself" instruction

    An example of the implementation of the "putstring" instruction An example of the implementation of the "opt_send_without_block" instruction An example of the implementation of 
 the Implement "leave" instruction
  36. Execute the byte-code of 0th Instruction Sequence - The left

    fi gure shows the output of "HelloWorld!" when executing the previous implementation. - Example source code is published in the following the gist. - https://gist.github.com/ m3m0r7/226e20c8115caf4a9d43b291861 f978b  How to implement a RubyVM with PHP? HelloWorld!
  37. What is "CallInfoEntry"?  What is "CallInfoEntry"? - The CallInfoEntry

    contains meta-information about the method to be executed, such as the method name, the number of arguments, and the names of keyword arguments, and so on. - It can be called without specifying "puts" directly on code by implementing it as shown on the following page. puts CallInfoEntry
  38.  Only these two variables are used The other variables

    de fi ned in the fi gure on the left are required In the actual implementation, but are omitted in this example because they are not required for the output of "HelloWorld!". Add this one $ciSize below Result
  39. What is "CallInfoEntry"?  What is "CallInfoEntry"? - This automatically

    resolves the method name using the CallInfoEntry . And also resolves the number of arguments so that the method can be called even when the number of arguments increases. - The CallInfoEntry provides a variety of information. If you are interested in more, you can read the Ruby core code or my implementation of "RubyVM on PHP" and try to implement it. CallInfoEntry
  40. What is "CallInfoEntry"?  What is "CallInfoEntry"? - See the

    following the gist for the code, including the changes made earlier. - https://gist.github.com/m3m0r7/226e20c8115caf4a9d43b291861f978b? permalink_comment_id=4686761#gistcomment-4686761
  41. How many instructions are there in the RubyVM instruction set?

     How many instructions are there in the RubyVM instruction set? - RubyVM has about 100 instructions. Actually, it has about 200, but half of them are for tracing instructions. By the way, JVM has about 150 instructions (as of SE 13). - For simple HelloWorld! output, FizzBuzz algorithm, or QuickSort algorithm, it is not necessary to implement everything.
 It is possible to execute them by implementing a few instructions.
  42.  - See below for the instruction set provided by

    RubyVM. - https://github.com/ruby/ruby/blob/ruby_3_3/insns.def - See below for an example implementation in PHP. - https://github.com/m3m0r7/rubyvm-on-php/tree/0.3.3.0/src/VM/Core/ Runtime/Executor/Insn/Processor How many instructions are there in the RubyVM instruction set? How many instructions are there in the RubyVM instruction set?
  43.  How many instructions are there in the RubyVM instruction

    set? - The following 6 instructions have been added since Ruby 3.3.0. - In Ruby, when the instruction set increases, the opcodes of other instruction sets may become out of sync... For example, OPT_SEND_WITHOUT_BLOCK was 51 in Ruby 3.2, but it is 53 in Ruby 3.3.
 If you want to support multiple versions of RubyVM, you need to consider this speci fi cation. OpCode Mnemonic 33 SPLATKW 45 DEFINEDIVAR 58 OPT_NEWARRAY_SEND 135 TRACE_SPLATKW 147 TRACE_DEFINEDIVAR 160 TRACE_OPT_NEWARRAY_SEND How many instructions are there in the RubyVM instruction set?
  44. How do we understand RubyVM from Ruby's core code? 

    How do we understand RubyVM from Ruby's core code? - It is easy to understand the Ruby implementation by looking at compile.c (https://github.com/ruby/ruby/blob/ruby_3_3/compile.c). - Especially, it is a good idea to follow the `ibf_load_*` function. - However, you will have to follow the code to fi nd out in which order the functions are called, so I will draw a fl ow diagram on the next page to give you a rough idea of how to follow. ibf_load_* compile.c
  45. How do we understand RubyVM from Ruby's core code? 

    How do we understand RubyVM from Ruby's core code? rb_iseq_ibf_load ibf_load_setup ibf_load_iseq rb_ibf_load_iseq_complete ibf_load_iseq_each ibf_load_code Return read ISeq ibf_load_small_value ibf_load_object/id ibf_load_local_table Functions below called by ibf_load_iseq_each ibf_load_iseq is called compile.c ← Function called when calling RubyVM::InstructionSequence.load_from_binary method in Ruby's core code iseqw_s_load_from_binary iseq.c RubyVM::InstructionSequence.load_from_binary and so on...
  46. How do local variables work?  How do local variables

    work? - Local variables are one di ffi cult implementation of RubyVM. I myself have often failed to implement it. - When executing a de fi ned method, arguments must be pre-set in a local table at runtime (rather than pushed onto the stack), but the location of the arguments to be set must be calculated, and it is so di ff i cult. - Although the source code seems to require an Environment Pointer (EP), but I explain how to not use EP in this slides. Environment Pointer (EP)
  47.  VM_ENV_DATA_SIZE Variables in methods Arguments local table size: 4

    0 1 2 3 4 6 5 slot index var4 -> slot[3] var3 -> slot[4] var2 -> slot[5] var1 -> slot[6] call info argc: 2
  48. How do local variables work?  How do local variables

    work? - The value of "varN" (where N is a natural number, N>0) is determined by "VM_ENV_DATA_SIZE + local table size - (N - 1)". This means var1 can be understood to be stored in "slot[6]" as it is calculated by "VM_ENV_DATA_SIZE(3) + local table size(4) - N(1)". - The arguments passed to a method must be associated with slot indexes in the reverse order of the de fi ned arguments. - The opcode for "[gs]etlocal(?:_WC[01]|)" follows this rule for getting values. varN VM_ENV_DATA_SIZE + local table size - (N - 1) slot[6] VM_ENV_DATA_SIZE(3) + local table size(4) - N(1) [gs]etlocal(?:_WC[01]|)
  49. How do local variables work?  How do local variables

    work? - The reason for starting from the third position seems to be due to embedding information necessary for RubyVM ("VM_ENV_DATA_INDEX_ME_CREF", "VM_ENV_DATA_INDEX_SPECVAL", "VM_ENV_DATA_INDEX_FLAGS"). - If there are arguments, it is necessary to prepopulate the slots with values before calling the method. For example, "var1" needs to be placed in "slot[6]" and "var2" in "slot[5]" in advance. Note that for "var3" and "var4", it is not necessary to prepopulate the values as "setlocal" is called within the internal instruction sequence. (Implementation hint: https://github.com/m3m0r7/ rubyvm-on-php/blob/0.3.3.0/src/VM/Core/Runtime/Executor/ CallBlockHelper.php#L121) VM_ENV_DATA_INDEX_ME_CREF VM_ENV_DATA_INDEX_SPECVAL VM_ENV_DATA_INDEX_FLAGS setlocal var1 slot[6] var2 slot[5] var3 var4
  50. How do local variables work?  How do local variables

    work? - In addition, RubyVM's local variables have the concept of "level", where level = 0 represents the current execution context. Each time level is increased by 1, 2, 3 ..., the local variable of the previous context (corresponding to the VM_ENV_PREV_EP macro in the Ruby core code) is referenced (see the fi gure on the next page). - It is easy to understand if you think of level as a relative position in terms of the context in which it is being executed. - Therefore, it is necessary to be able to trace back the context in which it is being executed. VM_ENV_PREV_EP
  51.  The assignment to var3 is a runtime context. Therefore

    level is 0 and "setlocal_WC0" is executed. The var3 is de fi ned in the previous context. Therefore, level is 1 in terms of the running context and "getlocal_WC1" is executed. The var3 is de fi ned in the previous context. Therefore, level is 1 in terms of the running context and "setlocal_WC1" is executed. It is in the running context that the var1 and var2 are de fi ned. Therefore level is 0 and "getlocal_WC0" is executed. When a call is made (internally when opcode such as send/ opt_send_without_block is called) execution context changes
  52.