Senzala digital

EPx professional blog and repository for braindumps

2010/03/12

Optimizing with Cython

Ok, the quest for Python optimization continues. A friend of mine suggested that my stock options simulation in Python could be made much faster by using Cython, since it is a pure matematical simulation.

Cython is very easy to work with. Basically, it compiles a Python script into C, which then is compiled natively. Both steps are done automatically by distutils. Each .pyx module becomes a .so module, which can be imported by Python normally. (The main usage of Cython seems to be writing bindings.)

Actually, Cython understands a restricted dialect, but my script compiled almost without changes. No pain, no gain: the performance boost of compiled version was just 10% or so. Adding more and more of Cython 'dialect', like typed parameters, typed functions, increased the performance gain, but not much, not even near Psyco (which gives 4x for absolutely free).

The Activity Monitor's profiling showed that program spent too much time in Python interpreter, which was not right, since we compiled the thing, right? Time to take a look into generated C code. The main problem was usage of math and random modules in my code. Each time Cython code calls a Python module function, it needs to convert parameters from native to Python objects and reconvert the function returns.

Cython documentation tells how to use the C math functions directly. Also, I replaced seemingly innocent functions like max() by cdef'ed functions. Doing that increased the gain to around 1.4-1.5x.

Replacing random.gauss() was more difficult, since it does not have a C counterpart. I found a Java version of it in the Internet and reimplemented in Cython. This time the effort finally paid off: 23x performance gain over vanilla Python, 6x over Psyco.

(@#d00dz note the sacred number, 23!)

The "modular" approach plus the fact that code needs some Cythonization work, means that the programmer should profile the application and try to convert only the critical functions/classes/modules to Cython.

Since the scripts would need extensive rewriting to work under Cython, and I am happy enough with the current performance provided by Psyco, I am probably not going to use Cython in my simulations. But it is good to know that there are alternatives beyond Psyco (and before resorting to C++), should I need more performance.

Small test with PyPy 1.2

Ok, just some days after the Unladen Swallow experiment, PyPy 1.2 JIT was out. I used the Mac OS X binary offered by PyPy site.

The results were good. Using exactly the same script with the same parameters (that one that simulates stock option operations), PyPy delivers a 3.1x performance gain, which is twice the Unladen's 1.6x, and almost as good as Psyco (3.6x, in my specific case).

The only strange thing that happened: I protected the psyco invocation by a try..except block, so the script wouldn't fail in case of Pysyco absence:


try:
import psyco
psyco.full()
except:
pass


Even though the psyco import fails under PyPy interpreter, the presence of this block causes very strange errors in certain math functions:


File "/Users/epx/agenda/novolivro/gen/blackscholes.py", line 24, in N
return 1 - N(-x)
File "/Users/epx/agenda/novolivro/gen/blackscholes.py", line 24, in N
return 1 - N(-x)
RuntimeError: internal error (stack overflow?)


or, sometimes,


OverflowError: math range error


and sometimes the very same script runs just fine.

Actually, Psyco does not need to be involved; any try..except block which raises exceptions will exercise the error (a simple "print unknown_var" will do). In the other hand, removing the block (or not raising exceptions in it) avoids the problem entirely.

2010/03/07

Small test with Unladen Swallow

I have been working on a new book about stock options, this time with a "dirty hands" approach. As part of the effort, I have been developing a strategy simulation tool in Python language. Given some market and simulation parameters, the application "does" 100 thousand "operations" and spits out average return and volatility:


The above output is just one scenario. Thousands of scenarios must be analyzed. For easier examination of the data mass, it is plotted as a graph by a Python/TkInter script:


Each simulation scenario becomes just six points in the above graph (one for each color). It takes a lot of processing time to simulate all of this, especially because each operation must be simulated in a day-by-day basis, in order to test strategy against "gaps" and "stops" and all this investment stuff.

Soon it was clear that the task would take weeks with standard Python. Then I remembered about Psyco, the Python JIT compiler. Using Psyco gave an immediate 4x performance boost:


Compounding this with concurrent simulation of two scenarios (putting both CPUs into use), simulation time became a matter of days, not weeks. The most attractive trait of Psyco is simplicity of usage -- just two lines of code at the beginning of script:



In the last few weeks, we have heard a lot about Unladen Swallow, Google's approach to add JIT into standard Python interpreter (with great chance of integration into vanilla Python). I have compiled the trunk version, and it worked normally with my script:


Gain was around 1.6x, less than Psyco, but still noteworthy. Note that Psyco is i386-only, while Unladen works for every architecture supported by LLVM. The above Unladen exacutable is truly 64-bit, while Snow Leopard's Python must be configured to run in 32-bit in order to accomodate Psyco.

I could not compila 32-bit version of Unladen (not even an Universal one, which I could lipo 64-bit out). That would allow for a more fair comparison with Psyco. Still, I feel Unladen already provides palpable performance gains and it is in the right path. It is good to see a Python optimization project to go forward. Even better if they manage to remove GIL.

Um pequeno teste com o Unladen Swallow

Tenho trabalhado num novo livro sobre opções, desta vez com um enfoque muito mais prático que o anterior. Uma ferramenta que estou desenvolvendo para fundamentar o texto (além da experiência empírica, com que a crise de 2008 nos "presenteou") é a simulação de estratégias.

O programa simulador é escrito em Python. Dados os parâmetros, o aplicativo simula um grande número de operações (geralmente 100 mil) e devolve rentabilidade. Segue um exemplo de rodada de simulação:


Milhares de combinações de parâmetros são simuladas; a saída acima foi apenas uma delas. Para facilitar a análise da massa de dados, os dados são jogados num gráfico como o exemplo abaixo (feito em Tkinter):


Cada rodada de simulação vira apenas seis pontos nesse gráfico -- um de cada cor. São 5000 combinações, 100.000 operações por combinação. E para cada operação, é preciso simular cada dia, de modo a testar a estratégia com "gaps", "stops", e toda essa baboseira de investimento.

Logo ficou claro que, com o Python padrão, ia demorar semanas para gerar a massa de dados. Aí lembrei do Pysco, aquele compilador JIT. Com ele, consegui um ganho médio de 4 vezes na performance:


Assim, o que ia levar uma semana, passa a levar um dia. (Também tirei partido do fato do computador ter 2 processadores, o que permite o teste simultâneo de 2 combinações, traduzindo num ganho total de 8x). O mais atraente do Pysco é a incrível simplicidade de aplicação num programa convencional. Bastam duas linhas no início:


(Ao fundo, pode-se ver o vídeo do YouTube que eu estava vendo quando tirei os snapshots.)

Também adicionei estas linhas no programa gerador gráfico. O ganho de performance foi menos estelar, mas ainda assim palpável, certamente eu usaria o Unladen se o Psyco não estivesse à mão.

Nos últimos dias, ouvimos falar muito do Unladen Swallow, o projeto do Google para adicionar JIT ao interpretador Python, mas já trabalhando a coisa (tecnica e politicamente) para integração ao Python "baunilha".

Compilei a versão trunk do Unladen, e funcionou normalmente com meu aplicativo:


O ganho foi em torno de 1.6x. Menos que o Psyco, mas ainda assim considerável. Detalhe: o Unladen otimiza em qualquer arquitetura suportada pelo LLVM. O executável Python do exemplo acima é 64 bits real, enquanto o Python padrão do Snow Leopard tem de ser configurado para 32 bits para funcionar com o Psyco, já que este último só "cospe" código i386.

Não consegui compilar o Unladen para 32 bits, o que permitiria uma comparação mais justa (vai que o LLVM para i386 esteja mais otimizado). De qualquer forma, é bom ver um projeto de otimização Python no caminho certo, e melhor ainda se eles conseguirem eliminar o GIL, conforme estão prometendo.

2010/01/30

AGC control in Python modem, version 3 finished

This afternoon I put the last block into Python QAM "modem" version 3: Automatic Gain Control (AGC). The sources can be found at http://epx.com.br/wav, the m3*.py files, as well as gray.py as a dependency. The feature can be tested by e.g. changing volume of qam.wav file using Audacity.

AGC allows the modem to adapt to different power levels. Up to now, V3 modem was not adapting to the "volume" in WAV file; it assumed that TX always used 90% of dynamic range. V2 already had a crude amplitude measurer which only paid attention to the training header sequence. But a true AGC must adapt to changes mid-flight.

This is easier said than done. The main problem is that AGC must figure out the full dynamic range, that is, the maximum amplitude a symbol may have. But there is no guarantee that the "strongest" symbols will actually appear in QAM signal. And, if they appear, how does AGC "know" that they are the ones?

A crude solution would be to send the training sequence at maximum amplitude every few seconds, which is easily detectable as a no-data sound, but this wastes bandwidth and is not 1337 enough :)

The actual solution that real-world modems employ is to work with power level averages. It is trivial to calculate the average amplitude of our constellation. Then, we maintain a moving average of QAM signal reception, and compensate for the difference.

V3 uses a one-second moving average, which is short enough to profit from initial training sequence, and adapts fast enough while actual data is being received. TX was changed to send a training sequence at this average power level instead of the maximum possible power level, so AGC does not have to distinguish between training and data signal parts.

Then we have another problem: will actual signal power average match the theoretical constellation average? In theory, this will happen only if all symbols are equally probable. Real-world modems employ a randomizer on bit stream to ensure that.

I was optimistic that AGC would get a good average without a randomizer, but I was wrong. AGC will not work at all without a very random bit stream! It is incredible how ordinary data, like a text file, is so visibly non-random at such a low level.

So, out of necessity, I have implemented a very simple bit randomizer for V3, with a "4th order polynominal". This is a very pedantic way to say that randomizer takes into account the last 4 bits to figure out the next one, which is calculated with simple XOR operations. To be even more pedantic, I even employed a good programming practice this time: a self-test.


# Randomizer: 1 + x**-3 + x **-4

def randomize(b0):
global b1, b2, b3, b4
bs = b0 ^ (b3 ^ b4)
b1, b2, b3, b4 = bs, b1, b2, b3
return bs

def derandomize(b0):
global b1, b2, b3, b4
bt = b0 ^ (b3 ^ b4)
b1, b2, b3, b4 = b0, b1, b2, b3
return bt

rseq = [ int(random.random() * 2) for x in range(0, 25) ]
b1, b2, b3, b4 = (0, 0, 0, 0)
rseqS = [ randomize(b) for b in rseq ]
b1, b2, b3, b4 = (0, 0, 0, 0)
rseqT = [ derandomize(b) for b in rseqS ]
b1, b2, b3, b4 = (0, 0, 0, 0)

if rseqT != rseq:
print "Randomizer is broken"
print rseq
print rseqS
print rseqT
sys.exit(1)


This randomization scheme was copied directly from Fabio Montoro's "Modem e transmissão de dados" book.

It was funny to realize in practice how important the randomizer is, when you simply don't know the power level at RX side and must figure it out based on data stream itself.

Determining power level using averages, even with the help of randomizer, does not have the same precision as knowing it beforehand. This reduces the number of bits per symbol that modem can handle. V3 had reached 16 bits/symbol, now the highest reliable level is 12. (Turning off AGC in RX side makes it go back to 16.)

AGC was the last thing to implement in V3. The next major milestone, V4, will add some form of convolutional coding to increase noise resistance.