Hi,



after some work on pcode optimisation I've tried to compare execution 
speed of various virtual machines (VM). I've took the most known for 
test:

- Harbour 1.1.0dev (Rev. 10008)
- PHP 5.2.8
- Python 2.6.1
- Ruby 1.8.7
- Java, Sun J2SE 1.5.0 Upgrade 17

These are the latest versions (except Java) downloaded from developers 
web pages. I've also tried to download sources (php-5.2.8.tar.bz2, 
Python-2.6.1.tar.bz2, ruby-1.8.7-p72.tar.gz) and look to internals. 
Of cause Sun's JavaVM is produces from closed source, and I had no 
chance to look inside. I'll use notation source:path/file, when I want
to refer a file inside these compressed source files.

Only a few test have been done. One expression was evaluated in the 
loop and execution time calculated. For example,
  PROC main()
  LOCAL nTime, nI, nSum
    nTime := seconds()
    nSum := 0
    FOR nI := 0 TO 9999999
      nSum += nI * 2 + 17
    NEXT
    ?  "Time:", seconds() - nTime
    ?  "Value:", nSum
  RETURN
This test I will denote "nJ+=nI*2+17" in a table below (test's has 
also numbers used to name test files, it is presented in No column
of the table). The last rows of table show memory allocation, and 
number VM's of opcode instructions.
The tests have been selected to test not only virtual machine execution 
speed, but to test optimizer also. For example, I've added test for 
expression:
    nSum += 17 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + nI * 2
since it can be easy optimized to:
    nSum += 27 + nI * 2
and also 
    nSum += nI * 2 + 17 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1
The later could not be optimized to
    nSum += nI * 2 + 27
without assumption that nI is an integer. If language has untyped 
variables and allows to overload operators for objects, then nI can 
be object with overloaded * and + operators, and nI*2+17+1+...+1 could 
not be optimized. We have this in Harbour.
    It was also interesting to compare speed of more complex structures 
than integer number operations, so, I've added a test for a string 
concatenation and array. String concatenation has been done in two 
ways: cI:=cI+"a" and cI+="a". Array test has also been done in two 
ways. The reason for selection of AADD(aI,{aI[nI-1][1]+1}) test will
be described later. Java language has typed variables and arrays should 
be preallocated. So, it would be not fair to compare execution speed to 
dynamic allocated arrays/lists of other languages. That's why ArrayList
object was used for dynamic array tests.

All test source for all languages can be found at 
http://www.dbtopas.lt/hrb/vmcomparison.zip


No Test code               Harbour  PHP   Python  Ruby  Java  Java+JIT
----------------------------------+------+------+------+------+-------+
   10'000'000 times:
11 nJ+=nI*2+17               4.61   3.41   6.54  25.53   1.36    0.08
12 nJ+=17+1+...+1+nI*2       4.61   6.10   6.62  45.86   1.36    0.08
13 nJ+=nI*2+17+1+...+1      10.86   5.95  12.96  45.78   1.92    0.17
   200'000 times:
21 cI:=cI+"a"                6.56   6.70   0.11  25.80 1641.766 1650.437
22 cI+="a"                   0.08   0.07   0.12  25.83 1649.344 1654.735
31 AADD(aI,{nI})             0.46   0.29   0.32   0.20   0.76    0.40
32 AADD(aI,{aI[nI-1][1]+1})  0.53   0.34   0.38   0.39   0.92    0.40
-----------------------------------------------------------------------
11 Mem Usage (KB)            2680   4296 162400   3244   6684    6892
   Opcode instruction count   180    150    143    105    202


Harbour
=======
The execution time for cI:=cI+"a" and cI+="a" test are very different. 
This can be optimized at pcode (or expressions) level, but I'm not 
sure it can be done. Objects can have different behaviour for "+", and
"+=" operators. Am I wrong here?

cI+="a" has small execution times, because string preallocation is 
used. But I do not understand, why do we always call realloc() (see 
hb_vmPlus(), hb_itemReSizeString(), and hb_xRefResize())? We do not 
need it, if item.asString.allocated > ulSize?

We can increase performance of Harbour array/hash by preallocation, 
if memory manager does not do preallocation of realloc'ed memory 
blocks.


PHP
===

PHP has 151 opcodes (see source:Zend/zend_vm_opcodes.h), but it seems 
interpreter does not compile code to opcodes. PHP is interpreted 
language and expression is evaluated right after bison creates 
expression tree. This eliminates need of push/pop arguments. It's a 
kind of doing:
   operation->itemResult = MULT_handler( operation->itemLeft, 
                                         operation->itemRight );
instead of:
   stackPush( MULT_handler( stackPop(), stackPop() ) );

So, there is no need to decode opcode and its parameters. This makes 
execution faster. This can be the reason why php is faster in execution 
of code (see nJ+=nI*2+17, PHP vs. Hrb). Interpretation of code does not 
let to optimize expression (or this is not implemented in PHP) and the 
whole expression tree is evaluated on each calculation of expression 
value (compare nJ+=nI*2+17 and nJ+=17+1+...+1+nI*2, PHP vs. Hrb). 

PHP has opcode handler table, i.e., each opcode is handled using:
   operation->handler( ... )
Instead of Harbour's:
   switch( operation->opcode )
   {
      case MULT:
         MULT_handler( ... );
         exit;
   }
Actually, this is optimized by most compilers and long switch sentence 
is replaced by:
   goto handler_table[ operation->opcode ];
This makes both Harbour's and PHP's opcode handling equivalent  
regardless of this fact PHP and Harbour has different opcode handler 
tables. PHP's opcode handlers table is much larger (3776 handlers, 
see source:Zend/zend_vm_execute.h) and also depends on operand type. 
For example, a different handler is used, if operand is constant. 
Compare:
   MULT_handler( itemGetNumeric( itemLeft ), 
                 itemGetNumeric( itemRight ) );
vs.
   MULT_handler( itemGetNumeric( itemLeft ), 
                 itemRight->asNumeric.value );

BTW, it seems multi-thread problems was solved in PHP in a similar way 
like in Harbour. Many functions are stuffed with TSRMLS_FETCH(). 
In Harbour it's called HB_STACK_TLS_PRELOAD. 


Python
======

Python has 143 opcodes (see source:Include/opcode.h). "hb_vmExecute()" 
is called PyEval_EvalFrameEx() here (see source:Python/ceval.c). The 
code is very familiar for Harbour developers, but they sometimes use 
"goto" instead of "break":
   for( ;; ) {
      ...
   fast_next_opcode:
      ...
      switch(opcode) {
         case NOP:
            goto fast_next_opcode;
         ...
         case ROT_TWO: /* rotate two aka swap */
            v = TOP();
            w = SECOND();
            SET_TOP(w);
            SET_SECOND(v);
            goto fast_next_opcode;
         ...
      }
      ...
   }

Python does not use bison parser. It has very short grammar rules (see 
source:Grammar/Grammar, Parser/Python.asdl), and it uses its own parser 
(source:Parser/*, Modules/parsermodule).

Python has same nJ+=nI*2+17+1+...+1 optimisation problem as Harbour has,
but it better handles cI:=cI+"a" optimisation. I guess array 
preallocation can help Harbour reach the speed of Python lists.


Ruby
====

Ruby's "hb_vmExecute()" is called rb_eval() (see source:eval.c). It 
uses bison for parsing. It uses 105 opcodes (see source:node.h), but 
it is not real virtual machine. It evaluates expression tree just like 
PHP do.

Ruby was very slow in comparison to other VM, the reason of this could 
be, that Ruby does not have numeric operators like +, *. Everything is 
operators of numeric class. Syntax 5+3 is equivalent to 5.+3  here "."
is used as method name delimiter. Just imagine numeric class overload 
in Harbour and 5:plus(3).

But I was very surprised then I've did an array test ADD(aI,{nI}). 
Ruby shows perfect performance here! Where is the trick?! I had an 
idea, that Ruby does not creates an arrays in my code, but uses 
array lazy creation. That's why I've rote last AADD(aI,{aI[nI-1][1]+1})
test. Here value of every element depends on another element's value.
In the end of loop I print value of last element, this should force
lazy creation of all array elements and give "the true" execution 
speed. for me great surprise, Ruby's performance here remains greet!
I don't have ideas why it has great performance on array creation and 
poor on simple numeric operations! BTW, array creation also contains 
+ and -, but does not contain *, but I don't believe * is very 
difficult for Ruby.


Java
====

I can not see Java source of selected JavaVM implementation, but spec 
(http://java.sun.com/docs/books/jvms/second_edition/html/
Mnemonics.doc.html) says, it has 202 (+3(reserved)-1(unused)) opcode. 
Java and can use just-in-time compile (JIT) execution mode. It means 
Java opcode can be compiled to real CPU code, thus making a 
significant execution speed improvement. Test were executed in both 
modes: JIT (default), and no-JIT. No-JIT mode is selected using -Xint
execution option. 

Java has typed variables, so it's execution speed is faster than 
execution speed of typeless languages, but a simple string 
concatenation test gives extremely bad performance (27 minutes 
against to 0.08..25 seconds from other languages). 
If I reduce loop count from 200'000 to 20'000. Execution time decreases
from 27 minutes to 2.6 seconds (!) for both JIT and no-JIT modes. I can 
imagine the only reason for this, no reference counters are used and 
variables are released using garbage collector. Memory allocation 
during 27 min execution process is 10244 KB, so, only 3560 KB larger 
than a simple integer test.
If no reference counters are used, garbage collection should be called
  20000 * 20001 / 2 / ( 3560 * 1024 ) = 54,86 
54 times for 20'000 bytes (augmented) strings allocation. If number
of strings (and string lengths) is increased up to 200'000. Garbage 
collector should be called
  200000 * 200001 / 2 / ( 3560 * 1024 ) = 5486.33
5486 times. If garbage collection takes most of execution time (i.e., 
1650/5486 ~= 0.3 seconds), we can have a similar situation. But this 
should not be the only reason, because in this case execution time 
for 20'000 times loop should be approx. 0.3*54=16 seconds (it is 2.6
seconds).



Best regards,
Mindaugas



P.S. Here is another thing (not related to virtual machine and pcode 
evaluation), but pcode tracing allows to find out various things, for 
example to trace type of variables, or trace usage of constant values. 
This allows to make constant folding other optimisations 
(see http://en.wikipedia.org/wiki/Constant_folding). Ex.:

   FOR nI := 1 TO 50
      QOUT(nI)
   NEXT

can be compiled to:

   int   local1;
   local1 = 1;
 lab00001: 
   hb_xvmPushFuncSymbol( symbols + 1 );
   hb_xvmPushInteger( local1 );
   if( hb_xvmDo( 1 ) ) break;
   local1++;
 lab00002: 
   fValue = local1 > 50;
   if( !fValue )
      goto lab00001;

instead of:

   hb_xvmPushInteger( 1 );
   hb_xvmPushUnRef();
   hb_xvmPopLocal( 1 );
   goto lab00002;
 lab00001: ;
   hb_xvmPushFuncSymbol( symbols + 1 );
   hb_xvmPushLocal( 1 );
   if( hb_xvmDo( 1 ) ) break;
   if( hb_xvmLocalIncPush( 1 ) ) break;
 lab00002: ;
   if( hb_xvmGreaterThenIntIs( 50, &fValue ) ) break;
   if( !fValue )
      goto lab00001;

But perhaps we must wait for type declarations in Harbour instead of
trying to guess variable types from pcode.

