Senzala digital

EPx professional blog and repository for braindumps

2010/03/07

Small test with Unladen Swallow

I have been working on a new book about stock options, this time with a "dirty hands" approach. As part of the effort, I have been developing a strategy simulation tool in Python language. Given some market and simulation parameters, the application "does" 100 thousand "operations" and spits out average return and volatility:


The above output is just one scenario. Thousands of scenarios must be analyzed. For easier examination of the data mass, it is plotted as a graph by a Python/TkInter script:


Each simulation scenario becomes just six points in the above graph (one for each color). It takes a lot of processing time to simulate all of this, especially because each operation must be simulated in a day-by-day basis, in order to test strategy against "gaps" and "stops" and all this investment stuff.

Soon it was clear that the task would take weeks with standard Python. Then I remembered about Psyco, the Python JIT compiler. Using Psyco gave an immediate 4x performance boost:


Compounding this with concurrent simulation of two scenarios (putting both CPUs into use), simulation time became a matter of days, not weeks. The most attractive trait of Psyco is simplicity of usage -- just two lines of code at the beginning of script:



In the last few weeks, we have heard a lot about Unladen Swallow, Google's approach to add JIT into standard Python interpreter (with great chance of integration into vanilla Python). I have compiled the trunk version, and it worked normally with my script:


Gain was around 1.6x, less than Psyco, but still noteworthy. Note that Psyco is i386-only, while Unladen works for every architecture supported by LLVM. The above Unladen exacutable is truly 64-bit, while Snow Leopard's Python must be configured to run in 32-bit in order to accomodate Psyco.

I could not compila 32-bit version of Unladen (not even an Universal one, which I could lipo 64-bit out). That would allow for a more fair comparison with Psyco. Still, I feel Unladen already provides palpable performance gains and it is in the right path. It is good to see a Python optimization project to go forward. Even better if they manage to remove GIL.

Um pequeno teste com o Unladen Swallow

Tenho trabalhado num novo livro sobre opções, desta vez com um enfoque muito mais prático que o anterior. Uma ferramenta que estou desenvolvendo para fundamentar o texto (além da experiência empírica, com que a crise de 2008 nos "presenteou") é a simulação de estratégias.

O programa simulador é escrito em Python. Dados os parâmetros, o aplicativo simula um grande número de operações (geralmente 100 mil) e devolve rentabilidade. Segue um exemplo de rodada de simulação:


Milhares de combinações de parâmetros são simuladas; a saída acima foi apenas uma delas. Para facilitar a análise da massa de dados, os dados são jogados num gráfico como o exemplo abaixo (feito em Tkinter):


Cada rodada de simulação vira apenas seis pontos nesse gráfico -- um de cada cor. São 5000 combinações, 100.000 operações por combinação. E para cada operação, é preciso simular cada dia, de modo a testar a estratégia com "gaps", "stops", e toda essa baboseira de investimento.

Logo ficou claro que, com o Python padrão, ia demorar semanas para gerar a massa de dados. Aí lembrei do Pysco, aquele compilador JIT. Com ele, consegui um ganho médio de 4 vezes na performance:


Assim, o que ia levar uma semana, passa a levar um dia. (Também tirei partido do fato do computador ter 2 processadores, o que permite o teste simultâneo de 2 combinações, traduzindo num ganho total de 8x). O mais atraente do Pysco é a incrível simplicidade de aplicação num programa convencional. Bastam duas linhas no início:


(Ao fundo, pode-se ver o vídeo do YouTube que eu estava vendo quando tirei os snapshots.)

Também adicionei estas linhas no programa gerador gráfico. O ganho de performance foi menos estelar, mas ainda assim palpável, certamente eu usaria o Unladen se o Psyco não estivesse à mão.

Nos últimos dias, ouvimos falar muito do Unladen Swallow, o projeto do Google para adicionar JIT ao interpretador Python, mas já trabalhando a coisa (tecnica e politicamente) para integração ao Python "baunilha".

Compilei a versão trunk do Unladen, e funcionou normalmente com meu aplicativo:


O ganho foi em torno de 1.6x. Menos que o Psyco, mas ainda assim considerável. Detalhe: o Unladen otimiza em qualquer arquitetura suportada pelo LLVM. O executável Python do exemplo acima é 64 bits real, enquanto o Python padrão do Snow Leopard tem de ser configurado para 32 bits para funcionar com o Psyco, já que este último só "cospe" código i386.

Não consegui compilar o Unladen para 32 bits, o que permitiria uma comparação mais justa (vai que o LLVM para i386 esteja mais otimizado). De qualquer forma, é bom ver um projeto de otimização Python no caminho certo, e melhor ainda se eles conseguirem eliminar o GIL, conforme estão prometendo.

2010/01/30

AGC control in Python modem, version 3 finished

This afternoon I put the last block into Python QAM "modem" version 3: Automatic Gain Control (AGC). The sources can be found at http://epx.com.br/wav, the m3*.py files, as well as gray.py as a dependency. The feature can be tested by e.g. changing volume of qam.wav file using Audacity.

AGC allows the modem to adapt to different power levels. Up to now, V3 modem was not adapting to the "volume" in WAV file; it assumed that TX always used 90% of dynamic range. V2 already had a crude amplitude measurer which only paid attention to the training header sequence. But a true AGC must adapt to changes mid-flight.

This is easier said than done. The main problem is that AGC must figure out the full dynamic range, that is, the maximum amplitude a symbol may have. But there is no guarantee that the "strongest" symbols will actually appear in QAM signal. And, if they appear, how does AGC "know" that they are the ones?

A crude solution would be to send the training sequence at maximum amplitude every few seconds, which is easily detectable as a no-data sound, but this wastes bandwidth and is not 1337 enough :)

The actual solution that real-world modems employ is to work with power level averages. It is trivial to calculate the average amplitude of our constellation. Then, we maintain a moving average of QAM signal reception, and compensate for the difference.

V3 uses a one-second moving average, which is short enough to profit from initial training sequence, and adapts fast enough while actual data is being received. TX was changed to send a training sequence at this average power level instead of the maximum possible power level, so AGC does not have to distinguish between training and data signal parts.

Then we have another problem: will actual signal power average match the theoretical constellation average? In theory, this will happen only if all symbols are equally probable. Real-world modems employ a randomizer on bit stream to ensure that.

I was optimistic that AGC would get a good average without a randomizer, but I was wrong. AGC will not work at all without a very random bit stream! It is incredible how ordinary data, like a text file, is so visibly non-random at such a low level.

So, out of necessity, I have implemented a very simple bit randomizer for V3, with a "4th order polynominal". This is a very pedantic way to say that randomizer takes into account the last 4 bits to figure out the next one, which is calculated with simple XOR operations. To be even more pedantic, I even employed a good programming practice this time: a self-test.


# Randomizer: 1 + x**-3 + x **-4

def randomize(b0):
global b1, b2, b3, b4
bs = b0 ^ (b3 ^ b4)
b1, b2, b3, b4 = bs, b1, b2, b3
return bs

def derandomize(b0):
global b1, b2, b3, b4
bt = b0 ^ (b3 ^ b4)
b1, b2, b3, b4 = b0, b1, b2, b3
return bt

rseq = [ int(random.random() * 2) for x in range(0, 25) ]
b1, b2, b3, b4 = (0, 0, 0, 0)
rseqS = [ randomize(b) for b in rseq ]
b1, b2, b3, b4 = (0, 0, 0, 0)
rseqT = [ derandomize(b) for b in rseqS ]
b1, b2, b3, b4 = (0, 0, 0, 0)

if rseqT != rseq:
print "Randomizer is broken"
print rseq
print rseqS
print rseqT
sys.exit(1)


This randomization scheme was copied directly from Fabio Montoro's "Modem e transmissão de dados" book.

It was funny to realize in practice how important the randomizer is, when you simply don't know the power level at RX side and must figure it out based on data stream itself.

Determining power level using averages, even with the help of randomizer, does not have the same precision as knowing it beforehand. This reduces the number of bits per symbol that modem can handle. V3 had reached 16 bits/symbol, now the highest reliable level is 12. (Turning off AGC in RX side makes it go back to 16.)

AGC was the last thing to implement in V3. The next major milestone, V4, will add some form of convolutional coding to increase noise resistance.

2010/01/29

Reaching 48000 bps



The graph above is the waveform for the 38400 bps QAM signal. While I believe in math, my common sense refuses to believe that any information can be recovered from such a chaotic wave.

The "version 3" QAM modem written in Python is a real winner. Today, some of the FIXMEs in code were implemented, and the thing reached 38400 bps at 2400 baud, and 48000 bps at 3000 baud -- both using 16 bits per symbol. Note that 3000 baud is almost twice the rate of 1800Hz carrier!

It "wanted" to work at 18 bits/symbol, but showed too many errors. I am not sure whether 16-bit WAV quantization is the hard limit here, but it seems to be the case. I was quite surprised to see that transmitted message was still recognizable, albeit corrupted.

The changes in v3 which allowed this performance jump are:

1) added code to detect adjacent symbols which are equal in amplitude and phase. While all V3 constellation symbols cause a phase change, sometimes it is very small, especially if we are sending many bits per symbol.

The algorithm is simple: if a symbol change is not detected within the expected interval, the complex signal is sampled anyway. Idiot as it seems, it allows RX to handle up to 16 bits per symbol, while it handled only 7 before.

2) Added a carrier Hz compensation, PLL-like. Any difference between TX and RX carriers cause a "phase drift" in decoded QAM signal -- that is, phase shows a small but continuous rotation in absolutely every sample. This can be distinguished from sudden, time-limited symbol phase rotations, and compensated for.

I discovered that phase drift is more stable (and hence easier to handle) if RX carrier frequency is smaller than TX. If TX carrier is above TX, phase drift oscilates in a difficult way. So, now we make sure that RX carrier is always "low". For symbols with 16 bits each, RX can tolerate up to 15 Hz carrier deviation, which seems very good to me.

3) Constellation uses Gray code now, so adjacent points are guaranteed to have only one different bit. This "softens" errors (even if RX resolves to the wrong symbol, chances are just one bit will be corrupted), and helps whatever error-correction algorithms in use (currently we use none, but I plan to implement convolutional codes in V4).

Towards v4

Actually, the v3 performance numbers are getting meaningless; of course they are possible just because the medium is a noiseless 16-bit WAV file, whose Shannon capacity is almost 700 kbps. (The Shannon limit of a phone line is around 24kbps.)

The next challenge is to make RX work well on noisy and band-limited signals. Rates like 14400bps will be possible only by implementation of convolutional code, I guess.

2010/01/26

QAM modem: 16800 bps!

As I said in the last post, I was thinking about direct (time-domain) analysis of QAM signal at RX part, in order to allow baud rates comparable to carrier frequency. I did not plan to put that into use... but I could not resist doing that.

The thing was working sooner than expected, and performance went beyond my expectations. It can handle 2400 baud with a 1800Hz carrier, like faster modems used to do. It can currently handle 7 bits per symbol, which translates to 16800bps. Going beyond this will demand improvements in symbol detection, which is currently the same as V2 (which is suboptimal).

V3 modem is basically the same as V2, except by decodification of QAM into complex samples, which V2 did in a "classical" fashion (demodulation + FIR filtering), and V3 does by direct analysis of QAM signal. So, I will list only the relevant V3 RX part:


# Proof that we don't need exact carrier frequency
CARRIER += random.random() * 20 - 10
# Proof that we don't need exact carrier phase
ra = random.random() * 2 * math.pi

symbols = []
f = CARRIER * 2 * math.pi / SAMPLE_RATE
carrier_90 = SAMPLE_RATE / CARRIER / 4.0
carrier_90 = int(carrier_90)
phase_error = 2 * math.pi * CARRIER / SAMPLE_RATE / 2.0
recovered = []

for t in range(0, len(qam) - carrier_90 - 2):
# Take derivatives of now and 90 degrees in future
d = (qam[t+1] - qam[t]) / f
d90 = (qam[t + carrier_90 + 1] - qam[t + carrier_90]) / f

st = math.sin(f * t + ra)
ct = math.cos(f * t + ra)

a = d90 * ct + d * st
b = d90 * st - d * ct

try:
tphase = -b / a
phase = math.atan(tphase)
except ZeroDivisionError:
phase = math.pi / 2
if (-b * a) < 0:
phase = -phase

if (abs(d) > abs(d90)):
amplitude = -d / (st * math.cos(phase) + ct * math.sin(phase))
else:
amplitude = -d90 / (-st * math.sin(phase) + ct * math.cos(phase))

if amplitude < 0:
amplitude = -amplitude
phase += math.pi

phase += math.pi * 2
phase %= math.pi * 2

cpx = crect(amplitude, phase)
# print t, int(phase * 180 / math.pi) % 360, cpx

recovered.append(cpx)


This technique is based on the fact that QAM signal is generated by a very simple formula:

s(t) = m . cos(2πft + p)

where "m" and "p" are the polar coordinates of constellation symbol being currently transmitted. The derivative of that is also a simple formula:

s'(t) = 2πf.m.sin(2πft + p)

The QAM decoding problem is: given s(t), find the unknown variables "m" and "p". We need two samples of s'(t) which are near enough to belong to the same symbol, to make an equation system and solve for "m" and "p".

That's exactly what the piece of code listed above does. The samples are taken 90 degrees apart because the equations would simplify very nicely.

I chose the s'(t) equation instead of s(t) because the QAM signal may not be centered at zero; that is, it may have some offset, or DC component. But the difference between each sample and the next is unaffected by the offset. "Difference" in a discrete signal is equivalent to "derivative" in a continous signal.

It is important to measure samples EXACTLY 90 degrees apart, which means that the "distance" between samples must be an integer number. This implies that WAV must be a multiple of 1800 / 4 = 450 Hz. Using a WAV of 43200 Hz works beautifully, while a WAV of 45000 Hz, which is better in theory and matches carrier, will not bear more than 4 bits/symbol. 44100 Hz won't work perfectly even for 3 bits/symbol.

Maybe I have "cheated" by choosing a "perfect" WAV sampling rate, but I suspect that a real-world modem does just that. Anyway, this technique of mine is not intended to be the best in town, it is just an exercise to try to go beyond "classical" QAM. Certainly there are ways to compensate for non-perfect alignment of carrier and sampling rate,