Tag Archives: Performance

About Erlang/OTP and Multi-core performance in particular – Kenneth Lundin

I attended an awesome talk by Kenneth Lundin about Erlang/OTP at the Erlang Factory in London. The main topic was SMP and it’s improvements it in the latest release(s). That’s exactly one of the main reasons for Erlang, parallelize computations on many cores, without worrying about locks in shared memory.

Some of the issues they’ve been working on:

  1. Erlang now detects CPU Topology automatically at startup.
  2. Multiple run-queues
  3. You can lock schedulers to logical CPU’S
  4. Improved message passing – reduced lock time

They improved more things of course but considering SMP these are the most important ones.

  1. Erlang now detects the CPU topology of your system automatically at startup. You may still override this automatic setup using:
    erl +sct L0-3c0-3
    erlang:system_flag(cpu_topology,CpuTopology).
  2. Multiple run queues … what does that mean? We should first take a look at how Erlang does SMP:
    • Erlang without SMP:
      Without SMP support the Erlang VM had one Scheduler for one runqueue. So all the jobs were pushed on one queue and fetched by one scheduler.
    • Erlang SMP / before R13
      They started more schedulers that were pulling jobs from one queue. Sounds more parallel but still not performing as good as desired on many cores.
    • Erlang SMP R13
      Several schedulers like in the former solution but each of them has it’s own runqueue. The problem with this approach is that it can of course happen that you end up with some empty and some full queues because of the different runtime of the processes. So they build something called migration logic that is controlling and balancing the different runqueues.

    They migration logic does:

    • collect statistics about the maxlength of all scheduler’s runqueues
    • setup migration paths
    • Take away jobs from full-load schedulers and pushing jobs on low load scheduler queues

    Running on full load or not! If all schedulers are not fully loaded, jobs will be migrated to schedulers with lower id’s and thus making some schedulers inactive.

    This makes perfectly sense because the more schedulers and runqueues you need the more migrating has to be done. Using SMP support with many schedulers makes only sense if you’re really optimizing for many cores and you will have decreased performance on systems with few cores.

  3. Binding schedulers to CPU’s is really worth looking at it. The more cores your CPU has the more important it’ll be and the more performance improvement you’ll gain. You can force the erlang VM to do scheduler binding by:
    erl +sbt db
    erlang:system_flag(scheduler_bind_type,default_bind).
    1>erlang:system_info(cpu_topology).
    [{processor,[{core,{logical,0}},
    {core,{logical,3}},
    {core,{logical,1}},
    {core,{logical,2}}]}]
    2> erlang:system_info(scheduler_bindings).
    {unbound,unbound,unbound,unbound}
    fabrizio@machine:~$ erl +sbt db
    1> erlang:system_info(scheduler_bindings).
    {0,1,3,2}

Benchmark - Scheduler Binding - Kenneth Lundin
Source: presentation Kenneth Lundin – Erlang-Factory

You can test and benchmark SMP using following flags:
fabrizio@machine:~$ erl -smp disable       //default is auto
fabrizio@machine:~$ erl +S 2:4               //Number of Schedulers : Schedulers online

With erlang:system_info/1 you can use the following atoms

# cpu_topology
# multi_scheduling
# scheduler_bind_type
scheduler_bindings
logical_processors
multi_scheduling_blockers
scheduler_id
schedulers
# schedulers_online
smp_support

The ones marked with # can be set using system_flag/2

PHP SPL Data Structures Benchmark

Data structures and collections are one of the most wanted features for the Standard PHP Library SPL over the last few years. With PHP 5.3 we’ll finally get a little of what we want and this is good news. With data structures like stack,  queue, heap or priority queue implemented in C we expect PHP programming to become somewhat more efficient.

Inspired by this post http://blueparabola.com/blog/spl-deserves-some-reiteration we decided to run our own benchmarks to either verify or disapprove the results posted. Our benchmarks were executed on a 64bit RHEL with PHP 5.3.0beta1. As you may expect, we carefully excluded startup or compilation time and measured only the code blocks in question. We used getrusage() to determine CPU time consumption. A huge number of iterations guaranteed smooth results.

The first structure under consideration was the SplFixedArray. If you only need numerical indices you now can create an array of fixed size that does not have to grow while more and more items are inserted. Dealing with an SplFixedArray saves you about 10 percent of runtime compared to a plain old PHP array.

Next we tried the SplStack and SplQueue. It is usually easy to implement a stack and a queue with plain arrays. Using array_push(), array_pop() and array_shift() is straightforward. It may be a surprise to the average PHP programmer to learn about the runtime behaviour of these functions. Worst is array_shift() because of the internal rehashing and the experienced PHP programmer may – for critical code at least – try to access arrays by indices maintaining counters, for example. This is much more efficient. Compared to the functions, at least SplQueue is something like an upset, but it is possible to find comparable solutions with plain PHP.

bars_full_ok

There is a little danger to compare apples and pears when turning towards SplHeap and SplPriorityQueue. What is the proper representation of a heap implemented using plain old arrays only? It’s a sorted array, ok. But a heap is sorted for each insert, so, do we really have to sort the array for each insert? Who will do this in real life?

It’s the use case that decides about the sorting strategy. If you are supposed to carefully separate writing the heap and reading from it, it is sufficient to sort it once. That way you beat SPL. But if you have to mix reading and writing arbitrarily the SPL will beat plain arrays by far. This is shown in the pictures below. For the mixed strategy we read once for 5 inserts and the SplMinHeap scales very well. The same holds for SplMaxHeap and SplPriorityQueue.

splminheap_rw_separated

splminheap_rw_mixed

Lessons learned:

  • SPL rules
  • use SPL data structures where appropriate for a particular use case, they are efficient and comfortable
  • benchmarking is error prone
  • anyway, always benchmark performance critical code blocks