Rust ❤️ Bela – SIMD Sines

Last time I wrote about how to cross-com­pile code for and run it on the Bela us­ing Rust, but there is still a lot of ground to cov­er. I haven't gone in­to any de­tail about what I'm even plan­ning to do with the Bela or how to ac­cess the Bela-spe­cif­ic APIs. In this post, I'll go in­to the for­mer, in­clud­ing some ba­sic fea­si­bil­i­ty checks which will in­volve — you may have guessed it from the ti­tle — SIMD and sines.

As men­tioned last time, the Bela is an ex­ten­sion board that al­lows you to achieve ul­tra-low la­ten­cy while pro­cess­ing sen­sors and au­dio by us­ing a cus­tom­ized Lin­ux with hard re­al-time ex­ten­sions. If you are in­ter­est­ed in what can be done with the Bela in gen­er­al, check out their blog for a bunch of cre­ative elec­tron­ic in­stru­ments, art in­stal­la­tions, and oth­er projects by mak­ers around the world. In any case, I am plan­ning two projects:

  1. To get to know the Bela hard­ware, the Bela APIs, and Bela's per­for­mance char­ac­ter­is­tics, I de­cid­ed a sim­ple(ish) project that doesn't re­quire spe­cial ex­ter­nal hard­ware or oth­er large up front in­vest­ments — aside from the Bela it­self — would be a good idea. Es­sen­tial­ly some­thing that would just as well work as a VST plug­in. A vir­tu­al in­stru­ment. In this par­tic­u­lar case: a vir­tu­al tone wheel or­gan. Ok, maybe not that sim­ple.

    Photo of the complex interior wiring of a Hammond C3 organ
    Hen­ry Zbyszyn­s­ki, CC BY 2.0 <>, via Wiki­me­dia Com­mons

  2. When (if?!) I fin­ish that project, the sec­ond step is to go for some­thing more… phys­i­cal. While I wrote a num­ber of VST plug­ins, most­ly work “in the box,” and just wrote that I plan to cre­ate a vir­tu­al in­stru­ment, I am not an elec­tron­ic mu­si­cian… or even a pass­able key­board play­er. I most­ly play bass (elec­tric and oc­ca­sion­al­ly dou­ble) and some gui­tar.

    I have been fas­ci­nat­ed by the pos­si­bil­i­ties of poly­phon­ic pick­ups for quite some time now. There­fore, project num­ber two will be tak­ing a kit (or cheap) gui­tar or bass (haven't quite de­cid­ed yet) and adding a set of Cy­c­fi Nu poly­phon­ic pick­ups, and pro­cess­ing the poly­phon­ic sig­nals on the Bela. And if I want to get re­al­ly fan­cy, I may add an XY-pad or oth­er weird in­put chan­nel. Maybe a gy­ro?!

    Note to fu­ture self: don't let that “if” be­come re­al­i­ty, you al­ready have the pick­ups! Does any­one want to fund fu­ture self a Crim­son Gui­tars gui­tar build­ing course for ex­tra mo­ti­va­tion? 😅

    Photo of two boxes of Cycfi Nu capsules

Now let's get start­ed. So the plan is to mod­el a clas­sic tone wheel or­gan as close­ly as pos­si­ble — based on specs avail­able around the web — with as low la­ten­cy as pos­si­ble. One im­por­tant thing to note about low la­ten­cy and re­al time: that doesn't mean fast! In fact, re­duc­ing buf­fer sizes (and the Bela can go re­al small, i.e. very low la­ten­cy) in­creas­es the num­ber of con­text switch­es, us­ing up more pro­cess­ing pow­er for things that aren't ac­tu­al au­dio pro­cess­ing. Al­so, the Bea­gle­Bone Black and its 32-bit Cor­tex-A8 run­ning at 1 GHz were in­tro­duced in 2013, so it's not ex­act­ly the fastest sin­gle board com­put­er on the mar­ket (for com­par­i­son: a cur­rent Rasp­ber­ry Pi has a quad-core 64-bit Cor­tex-A72 run­ning at 1.5 GHz!). While work is cur­rent­ly un­der way to get Bela work­ing with the re­cent­ly re­leased (well, in 2019) Bea­gle­Bone AI and its du­al-core Cor­tex-A15 run­ning at 1.5 GHz, we'll have to wait for that a lit­tle longer.

The clas­sic tone wheel or­gan (I'm pret­ty sure al­most ev­ery mu­si­cian knows which one I'm talk­ing about) has a to­tal of 91 ap­prox­i­mate­ly si­nu­soidal tone gen­er­a­tors. Ac­tu­al­ly, the low­est oc­tave has a much weird­er wave­form, but let's as­sume si­nu­soidal for now. Since I want to mod­el as much of the or­gan's re­al-world weird­ness, in­clud­ing cross-talk be­tween tone wheels, fixed phase be­tween cer­tain pairs of wheels, and full polypho­ny, there is no re­al al­ter­na­tive to ac­tu­al­ly gen­er­at­ing 91 sines all the time. Not even com­plex os­cil­la­tors, which can be re­al­ly high qual­i­ty for very lit­tle ef­fort, since I re­al­ly hope to be able to mod­el the phase noise due to im­per­fect cou­pling. And maybe even sim­u­late turn­ing off the mo­tor for the or­gan equiv­a­lent of a wham­my bar dive bomb. Al­so, if we can't gen­er­ate them all the time we can't oc­ca­sion­al­ly gen­er­ate them in time ei­ther.

First things first, now that we know what the project will be, let's re­name our hello project to bela-3. We have to re­name both the fold­er and the pack­age name in Cargo.toml.

For a va­ri­ety of rea­sons, I am a big fan of us­ing an un­signed fixed-point num­ber to rep­re­sent the cur­rent phase. When us­ing an un­signed fixed-point num­ber in Q0.32 for­mat, i.e., us­ing all 32 bits of a u32 as frac­tion­al bits, you can rep­re­sent val­ues in the open range [0,1),[0, 1), at a con­stant res­o­lu­tion of 232,2^{-32}, and get wrap­ping for free. No need for ex­pen­sive float­ing-point re­main­der cal­cu­la­tions (when us­ing [0,2π)[0, 2\pi) , for [0,1)[0, 1) f32.fract will do) or con­di­tion­als to keep the val­ue in range. Ad­di­tion­al­ly, this de­scrip­tion works re­al­ly well with pow­er-of-two waveta­bles: you just use the top nn bits to in­dex a 2n2^n -sam­ple wavetable and use the re­main­ing 32n32 - n bits for in­ter­po­la­tion. Fur­ther­more, it is easy to de­ter­mine in which quad­rant of a sine/co­sine we are, just by look­ing at the top two bits — you'll see why this is rel­e­vant soon.

Since pre­ma­ture op­ti­miza­tion is the root of all evil, to quote Don­ald Knuth, let's first try the di­rect way: just rescal­ing our num­ber in [0,1)[0, 1) (well, tech­ni­cal­ly in [0,11232][0, 1 - \frac{1}{2^{32}}] , rep­re­sent­ed by in­te­gers in [0,2321][0, 2^{32} - 1] ) to [0,2π)[0, 2\pi) by mul­ti­ply­ing it with 2π232\frac{2\pi}{2^{32}} and us­ing the built-in f32::sin func­tion. So let's add a new file src/ to our project:

use std::num::Wrapping;

pub fn sines_std(phasors: &[Wrapping<u32>; 91]) -> [f32; 91] {
    let mut ret = [0.0; 91];
    for (p, r) in phasors.iter().zip(ret.iter_mut()) {
        *r = (std::f32::consts::TAU / (1u64 << 32) as f32 * p.0 as f32)

Yes, Rust sup­ports the full cir­cle con­stant τ=2π\tau = 2\pi . Since we want to mea­sure how much time gen­er­at­ing our 91 sines takes, we need to bench­mark. I will be us­ing the awe­some criterion crate for the bench­mark­ing. How­ev­er, writ­ing a bench­mark re­quires a li­brary crate and so far we on­ly have a bi­na­ry crate. But our re­cent­ly-re­named bela_3 isn't ac­tu­al­ly a crate. It's a pack­age that con­tains a bi­na­ry crate. So we can sim­ply add a li­brary crate to our ex­ist­ing pack­age that so far on­ly con­tains a bi­na­ry crate. In fact, do­ing so and plac­ing most of the im­ple­men­ta­tion in­side the same pack­age and us­ing the li­brary crate with­in the bi­na­ry crate isn't un­com­mon. How to we do this? Just add src/ and while we're at it — in­tro­duce and use our sines mod­ule, oth­er­wise wouldn't be com­piled.

// introduce sines privately and use it publically to avoid having to
// write bela_3::sines::XY everywhere
mod sines;
pub use sines::*;

Now we need to add the criterion crate to our [dev-dependencies] in Cargo.toml and de­clare our bench­mark:

criterion = { version = "0.3", features = [ "html_reports" ] }

name = "sine_benchmark"
harness = false

Fi­nal­ly, we de­fine our bench­mark in benches/

use bela_3::*;
use criterion::{
    black_box, criterion_group, criterion_main, Criterion,
use std::num::Wrapping;

fn criterion_benchmark(c: &mut Criterion) {
    // create a set of phasors as Wrapping<u32>, since Rust u32 are
    // generally not wrapping (overflow checking is only enabled in
    // debug mode by default) in Rust
    let phasors = {
        let mut phasors = [Wrapping(0u32); 91];
        let mut x = Wrapping(0u32);
        for phasor in phasors.iter_mut() {
            *phasor = x;
            x += Wrapping(0x3fff_ffff);

    // create a benchmark group as we want to compare variants of a
    // particular function
    let mut group = c.benchmark_group("sines");
    group.bench_function("std", |b| {
        b.iter(|| sines_std(black_box(&phasors)))

criterion_group!(benches, criterion_benchmark);

Now let's run our bench­mark us­ing cargo bench --bench sine_benchmark -- --noplot, since cargo bench will al­so run all tests — in­clud­ing the au­to-gen­er­at­ed emp­ty unit- and doctest ex­e­cuta­bles — which don't sup­port the --noplot op­tion. And we def­i­nite­ly want to pass the --noplot op­tion, as ren­der­ing the plots and cor­re­spond­ing HTML re­ports will take for­ev­er and a day on the Bela. Didn't I ex­plic­it­ly en­able the html_reports fea­ture in our Cargo.toml? Well, yes, but that was on­ly to si­lence a dep­re­ca­tion warn­ing, which pops up even if we pass --noplot.

sines/std               time:   [55.405 us 55.494 us 55.609 us]
Found 22 outliers among 100 measurements (22.00%)
  1 (1.00%) high mild
  21 (21.00%) high severe

55.5 μs? Quick back of the en­ve­lope cal­cu­la­tion…

144100Hz=22.7μs\frac{1}{44\,100\,\mathrm{Hz}} = 22.7\,\mathrm{\mu s}

Leav­ing us with… −32.8 μs to do ev­ery­thing else when com­put­ing one out­put sam­ple. Well, that's no good. I think we would have to vi­o­late causal­i­ty to get this ver­sion to work, so let's op­ti­mize a lit­tle. Since we don't need our sines to be pre­cise with­in a few ULPS, those tone wheels aren't per­fect ei­ther, a va­ri­ety of ap­prox­i­ma­tions are pos­si­ble. Waveta­bles come to mind, how­ev­er those don't vec­tor­ize too well (gath­er/scat­ter op­er­a­tions for in­dex­ing/in­di­rec­tion are on­ly avail­able in some high-end in­struc­tion sets) and mem­o­ry clocks haven't been in­creas­ing as quick­ly as CPU clocks for many years. The lat­ter isn't too bad if the ta­bles stay in L1 cache, but since we want to keep la­ten­cy low — in­cur­ring many con­text switch­es — fre­quent cache evic­tions are very like­ly. So in­stead I chose a sim­ple, cu­bic ap­prox­i­ma­tion with­in the quad­rant, with the top two bits of our phase be­ing used to flip the poly­no­mi­al / change the sign. I'm us­ing the same ap­proach as in this ar­ti­cle, ex­cept that I on­ly have a fixed-point phase, not a fixed-point re­sult.

fn sin_quadrant_cubic(phasor: Wrapping<u32>) -> f32 {
    let x = phasor.0;
    // set q13 to u32::MAX if we are in the second or fourth quadrant,
    // otherwise set it to 0, using one shift to move the corresponding
    // bit to the most significant bit and a sign-extending shift to
    // set the other bits
    let q13 = (x.wrapping_shl(1) as i32 >> 31) as u32;
    // flip xm, depending on the quadrant we're in and discard the sign
    // bit which we'll be adding back later
    let xm = ((x.wrapping_neg() & q13) | (x & !q13)) & 0x7fff_ffff;
    // convert our Q1.30 value (we discarded the topmost bit) to an f32
    // in [0, 1]
    let xmf = xm as f32 / 0x4000_0000 as f32;
    // use the top bit of the input phasor as the sign of our cubic
    // polynomial approximation
    let sign = f32::from_bits(x);
    f32::copysign(1.5 * xmf - 0.5 * xmf.powi(3), sign)

pub fn sines_quadrant_cubic(
    phasors: &[Wrapping<u32>; 91],
) -> [f32; 91] {
    let mut ret = [0.0; 91];
    for (p, r) in phasors.iter().zip(ret.iter_mut()) {
        *r = sin_quadrant_cubic(*p);

While oth­er cu­bic ap­prox­i­ma­tions have low­er er­ror, this low­er er­ror comes at the cost of dis­con­ti­nu­ities at quad­rant edges, cre­at­ing some very un­pleas­ant har­mon­ics and alias­ing. So in this case, the ab­so­lute er­ror in val­ue may be larg­er, but the (per­ceived) har­mon­ic er­ror is much small­er. So then we can add a bench­mark… “Wait a sec­ond,” I hear you scream at your mon­i­tor “, aren't you about to skip a crit­i­cal step there?” And as a mat­ter of fact, I was! While this isn't pre­ma­ture op­ti­miza­tion, as the naïve ver­sion is def­i­nite­ly too slow, we shouldn't start look­ing at per­for­mance be­fore check­ing if all that bit twid­dling is even cor­rect! There's a good rea­son that cargo bench al­so runs all tests. So let's take care of that over­sight and add some tests, us­ing Rust's won­der­ful built-in test­ing fea­tures (and the approx crate for ap­prox­i­mate float­ing-point as­ser­tions, which I added to my [dev-dependencies]), by adding the fol­low­ing to src/

mod tests {
    use approx::assert_abs_diff_eq;
    use std::num::Wrapping;

    fn reference_phasors() -> [Wrapping<u32>; 91] {
        let scale = (1u64 << 32) as f64;
        let scale = |f| (scale * f) as u32;
        let mut phasors = [Wrapping(0); 91];
        phasors[0] = Wrapping(scale(0.0)); // 0
        phasors[1] = Wrapping(scale(0.125)); // τ/8
        phasors[2] = Wrapping(scale(0.25)); // τ/4
        phasors[3] = Wrapping(scale(0.375)); // 3τ/8
        phasors[4] = Wrapping(scale(0.5)); // τ/2 = π
        phasors[5] = Wrapping(scale(0.625)); // 5τ/8
        phasors[6] = Wrapping(scale(0.75)); // 3τ/4
        phasors[7] = Wrapping(scale(0.875)); // 7τ/8
        phasors[8] = Wrapping(u32::MAX); // ~τ

    fn check_reference_results(sines: &[f32; 91]) {
        use std::f32::consts::TAU;
        // As we are looking at approximations of sines, allow for an
        // error of 0.02
            f32::sin(TAU * 0.0),
            epsilon = 0.02
            f32::sin(TAU * 0.125),
            epsilon = 0.02
            f32::sin(TAU * 0.25),
            epsilon = 0.02
            f32::sin(TAU * 0.375),
            epsilon = 0.02
            f32::sin(TAU * 0.5),
            epsilon = 0.02
            f32::sin(TAU * 0.625),
            epsilon = 0.02
            f32::sin(TAU * 0.75),
            epsilon = 0.02
            f32::sin(TAU * 0.875),
            epsilon = 0.02
            f32::sin(TAU * 1.0),
            epsilon = 0.02

    fn test_sines_std() {
        use crate::sines_std;
        let phasors = reference_phasors();
        let sines = sines_std(&phasors);

    fn test_quadrant_cubic() {
        use crate::sines_quadrant_cubic;
        let phasors = reference_phasors();
        let sines = sines_quadrant_cubic(&phasors);

Drum­roll (and cargo test), please… Two tests passed! Awe­some, but al­ways bet­ter to be safe than sor­ry, es­pe­cial­ly with bit-twid­dling hacks like these. And truth be told, there was a mis­take in an ear­li­er draft: I dis­card­ed the top two bits (i.e., the quad­rant) be­fore the wrap­ping nega­tion, while I should have dropped on­ly the top­most bit af­ter wrap­ping nega­tion, so xmf wasn't in [0,1][0, 1] any­more. Hav­ing made sure that ev­ery­thing works, we can add an ad­di­tion­al bench­mark (be­tween our cur­rent one and group.finish()) to

    group.bench_function("quadrant_cubic", |b| {
        b.iter(|| sines_quadrant_cubic(black_box(&phasors)))

And run it.

sines/quadrant_cubic    time:   [8.1137 us 8.1412 us 8.1934 us]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

Much bet­ter, but still more than a third of our time bud­get. How do we find out if the code the com­pil­er pro­duces is any­where near op­ti­mal? Or if it us­es the NEON in­struc­tion set we en­abled at the end of the last post? The most di­rect way is by look­ing at the as­sem­bly out­put, which we can get at us­ing objdump, or in this case, as we're look­ing at a cross-com­piled ex­e­cutable, our cross-com­pil­er's objdump:

arm-linux-gnueabihf-objdump -dr target\armv7-unknown-linux-gnueabihf\release\deps\sine_benchmark-46d3108f9d479ee1 > disassembly.asm

The ex­act name of the ex­e­cutable, in par­tic­u­lar the hash at the end, will vary, so make sure to dump the ex­e­cutable list­ed while run­ning cargo bench. If you're on Lin­ux or Mac, you'll have to use for­ward slash­es too. In ei­ther case, though, make ab­so­lute­ly sure to pipe the out­put to a file, as it can get very, very large. Then search in the out­put file for the name of our func­tion sines_quadrant_cubic. There will be ad­di­tion­al stuff at the be­gin­ning and the end of the sym­bol's name due to name man­gling.

001b8618 <_ZN6bela_35sines20sines_quadrant_cubic17hf39864046bb5dcf6E>:
  1b8618:   e92d4070    push    {r4, r5, r6, lr}
  1b861c:   e1a04001    mov r4, r1
  1b8620:   e3a01000    mov r1, #0
  1b8624:   e3a02f5b    mov r2, #364    ; 0x16c
  1b8628:   e1a05000    mov r5, r0
  1b862c:   e3a06000    mov r6, #0
  1b8630:   ebf94991    bl  ac7c <memset@plt>
  1b8634:   eeb71a08    vmov.f32    s2, #120    ; 0x3fc00000  1.5
  1b8638:   ed9f0a16    vldr    s0, [pc, #88]   ; 1b8698 <_ZN6bela_35sines20sines_quadrant_cubic17hf39864046bb5dcf6E+0x80>
  1b863c:   eebe2a00    vmov.f32    s4, #224    ; 0xbf000000 -0.5
  1b8640:   f3c00610    vmov.i32    d16, #-2147483648   ; 0x80000000
  1b8644:   e7940006    ldr r0, [r4, r6]
  1b8648:   f26011b0    vorr    d17, d16, d16
  1b864c:   e3100101    tst r0, #1073741824 ; 0x40000000
  1b8650:   ee030a10    vmov    s6, r0
  1b8654:   12600000    rsbne   r0, r0, #0
  1b8658:   e3c00102    bic r0, r0, #-2147483648    ; 0x80000000
  1b865c:   ee040a10    vmov    s8, r0
  1b8660:   e0850006    add r0, r5, r6
  1b8664:   e2866004    add r6, r6, #4
  1b8668:   eeb84a44    vcvt.f32.u32    s8, s8
  1b866c:   e3560f5b    cmp r6, #364    ; 0x16c
  1b8670:   ee244a00    vmul.f32    s8, s8, s0
  1b8674:   ee245a04    vmul.f32    s10, s8, s8
  1b8678:   ee245a05    vmul.f32    s10, s8, s10
  1b867c:   ee244a01    vmul.f32    s8, s8, s2
  1b8680:   ee255a02    vmul.f32    s10, s10, s4
  1b8684:   ee344a05    vadd.f32    s8, s8, s10
  1b8688:   f3531114    vbsl    d17, d3, d4
  1b868c:   f4c0183f    vst1.32 {d17[0]}, [r0 :32]
  1b8690:   1affffeb    bne 1b8644 <_ZN6bela_35sines20sines_quadrant_cubic17hf39864046bb5dcf6E+0x2c>
  1b8694:   e8bd8070    pop {r4, r5, r6, pc}
  1b8698:   30800000    .word   0x30800000

Looks ok, aside from that an­noy­ing memset which the com­pil­er re­al­ly should be able to avoid, but it seems no au­to-vec­tor­iza­tion was per­formed (s6, s8, etc. are sin­gle-el­e­ment reg­is­ters). Just for fun, let's see what hap­pens when run­ning it na­tive­ly on our x86_64 desk­top ma­chine.

sines/std               time:   [446.49 ns 447.54 ns 448.68 ns]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
sines/quadrant_cubic    time:   [98.704 ns 99.057 ns 99.386 ns]
Found 10 outliers among 100 measurements (10.00%)
  10 (10.00%) high mild

That is… a bit faster. But while my desk­top can per­form the DSP much more quick­ly, there is ab­so­lute­ly no chance to get the buf­fer size down to 2 sam­ples (the low­est set­ting is 32 sam­ples on my RME HD­SPe AIO, which is a high-qual­i­ty pro­fes­sion­al au­dio in­ter­face), which the Bela can, due to its RTOS ker­nel. And the code got au­to-vec­tor­ized too, as ev­i­denced by the mulps, addps, and oth­er packed SSE2 in­struc­tions (you can al­so see the three fi­nal sam­ples be­ing com­put­ed us­ing scalar in­struc­tions, as 91 isn't di­vis­i­ble by four — the point­less memset re­mains, how­ev­er):

00000000001e7890 <_ZN6bela_35sines20sines_quadrant_cubic17hf4a7fc3f0238a853E>:
  1e7890:   41 57                   push   %r15
  1e7892:   41 56                   push   %r14
  1e7894:   53                      push   %rbx
  1e7895:   49 89 f7                mov    %rsi,%r15
  1e7898:   49 89 fe                mov    %rdi,%r14
  1e789b:   31 db                   xor    %ebx,%ebx
  1e789d:   ba 6c 01 00 00          mov    $0x16c,%edx
  1e78a2:   31 f6                   xor    %esi,%esi
  1e78a4:   ff 15 06 03 0e 00       callq  *0xe0306(%rip)        # 2c7bb0 <memset@GLIBC_2.2.5>
  1e78aa:   66 0f 6f 05 9e de 04    movdqa 0x4de9e(%rip),%xmm0        # 235750 <anon.be0cd5bf402d6cd1684266fd1b25d153.1.llvm.9828885523271912942+0xa3>
  1e78b1:   00
  1e78b2:   0f 28 0d f7 b0 06 00    movaps 0x6b0f7(%rip),%xmm1        # 2529b0 <anon.eab9b9e9888d57a198cd6903a1003a27.17.llvm.10396227554844612960+0x119>
  1e78b9:   0f 28 15 00 b1 06 00    movaps 0x6b100(%rip),%xmm2        # 2529c0 <anon.eab9b9e9888d57a198cd6903a1003a27.17.llvm.10396227554844612960+0x129>
  1e78c0:   0f 28 1d 09 b1 06 00    movaps 0x6b109(%rip),%xmm3        # 2529d0 <anon.eab9b9e9888d57a198cd6903a1003a27.17.llvm.10396227554844612960+0x139>
  1e78c7:   66 0f 6f 25 11 b1 06    movdqa 0x6b111(%rip),%xmm4        # 2529e0 <anon.eab9b9e9888d57a198cd6903a1003a27.17.llvm.10396227554844612960+0x149>
  1e78ce:   00
  1e78cf:   90                      nop
  1e78d0:   f3 41 0f 6f 2c 9f       movdqu (%r15,%rbx,4),%xmm5
  1e78d6:   66 0f 6f f5             movdqa %xmm5,%xmm6
  1e78da:   66 0f 72 f6 01          pslld  $0x1,%xmm6
  1e78df:   66 0f 72 e6 1f          psrad  $0x1f,%xmm6
  1e78e4:   66 0f 6f fd             movdqa %xmm5,%xmm7
  1e78e8:   66 0f ef fe             pxor   %xmm6,%xmm7
  1e78ec:   66 0f fa fe             psubd  %xmm6,%xmm7
  1e78f0:   66 0f db f8             pand   %xmm0,%xmm7
  1e78f4:   0f 5b f7                cvtdq2ps %xmm7,%xmm6
  1e78f7:   0f 59 f1                mulps  %xmm1,%xmm6
  1e78fa:   0f 28 fe                movaps %xmm6,%xmm7
  1e78fd:   0f 59 fe                mulps  %xmm6,%xmm7
  1e7900:   0f 59 fe                mulps  %xmm6,%xmm7
  1e7903:   0f 59 f2                mulps  %xmm2,%xmm6
  1e7906:   0f 59 fb                mulps  %xmm3,%xmm7
  1e7909:   0f 58 fe                addps  %xmm6,%xmm7
  1e790c:   0f 54 f8                andps  %xmm0,%xmm7
  1e790f:   66 0f db ec             pand   %xmm4,%xmm5
  1e7913:   66 0f eb ef             por    %xmm7,%xmm5
  1e7917:   f3 41 0f 7f 2c 9e       movdqu %xmm5,(%r14,%rbx,4)
  1e791d:   48 83 c3 04             add    $0x4,%rbx
  1e7921:   48 83 fb 58             cmp    $0x58,%rbx
  1e7925:   75 a9                   jne    1e78d0 <_ZN6bela_35sines20sines_quadrant_cubic17hf4a7fc3f0238a853E+0x40>
  1e7927:   41 8b 87 60 01 00 00    mov    0x160(%r15),%eax
  1e792e:   89 c1                   mov    %eax,%ecx
  1e7930:   f7 d9                   neg    %ecx
  1e7932:   a9 00 00 00 40          test   $0x40000000,%eax
  1e7937:   0f 44 c8                cmove  %eax,%ecx
  1e793a:   81 e1 ff ff ff 7f       and    $0x7fffffff,%ecx
  1e7940:   0f 57 db                xorps  %xmm3,%xmm3
  1e7943:   f3 0f 2a d9             cvtsi2ss %ecx,%xmm3
  1e7947:   f3 0f 10 15 a9 b0 06    movss  0x6b0a9(%rip),%xmm2        # 2529f8 <anon.eab9b9e9888d57a198cd6903a1003a27.17.llvm.10396227554844612960+0x161>
  1e794e:   00
  1e794f:   f3 0f 59 da             mulss  %xmm2,%xmm3
  1e7953:   66 0f 6e e0             movd   %eax,%xmm4
  1e7957:   f3 0f 10 0d 9d b0 06    movss  0x6b09d(%rip),%xmm1        # 2529fc <anon.eab9b9e9888d57a198cd6903a1003a27.17.llvm.10396227554844612960+0x165>
  1e795e:   00
  1e795f:   0f 28 eb                movaps %xmm3,%xmm5
  1e7962:   f3 0f 59 eb             mulss  %xmm3,%xmm5
  1e7966:   f3 0f 59 eb             mulss  %xmm3,%xmm5
  1e796a:   0f 28 f3                movaps %xmm3,%xmm6
  1e796d:   f3 0f 59 f1             mulss  %xmm1,%xmm6
  1e7971:   f3 0f 10 1d 87 b0 06    movss  0x6b087(%rip),%xmm3        # 252a00 <anon.eab9b9e9888d57a198cd6903a1003a27.17.llvm.10396227554844612960+0x169>
  1e7978:   00
  1e7979:   f3 0f 59 eb             mulss  %xmm3,%xmm5
  1e797d:   f3 0f 5c f5             subss  %xmm5,%xmm6
  1e7981:   0f 54 f0                andps  %xmm0,%xmm6
  1e7984:   66 0f db 25 54 b0 06    pand   0x6b054(%rip),%xmm4        # 2529e0 <anon.eab9b9e9888d57a198cd6903a1003a27.17.llvm.10396227554844612960+0x149>
  1e798b:   00
  1e798c:   66 0f eb e6             por    %xmm6,%xmm4
  1e7990:   41 8b 87 64 01 00 00    mov    0x164(%r15),%eax
  1e7997:   89 c1                   mov    %eax,%ecx
  1e7999:   f7 d9                   neg    %ecx
  1e799b:   a9 00 00 00 40          test   $0x40000000,%eax
  1e79a0:   0f 44 c8                cmove  %eax,%ecx
  1e79a3:   81 e1 ff ff ff 7f       and    $0x7fffffff,%ecx
  1e79a9:   0f 57 ed                xorps  %xmm5,%xmm5
  1e79ac:   f3 0f 2a e9             cvtsi2ss %ecx,%xmm5
  1e79b0:   66 41 0f 7e a6 60 01    movd   %xmm4,0x160(%r14)
  1e79b7:   00 00
  1e79b9:   f3 0f 59 ea             mulss  %xmm2,%xmm5
  1e79bd:   66 0f 6e e0             movd   %eax,%xmm4
  1e79c1:   0f 28 f5                movaps %xmm5,%xmm6
  1e79c4:   f3 0f 59 f5             mulss  %xmm5,%xmm6
  1e79c8:   f3 0f 59 f5             mulss  %xmm5,%xmm6
  1e79cc:   f3 0f 59 e9             mulss  %xmm1,%xmm5
  1e79d0:   f3 0f 59 f3             mulss  %xmm3,%xmm6
  1e79d4:   f3 0f 5c ee             subss  %xmm6,%xmm5
  1e79d8:   0f 54 e8                andps  %xmm0,%xmm5
  1e79db:   66 0f 6f f0             movdqa %xmm0,%xmm6
  1e79df:   66 0f df f4             pandn  %xmm4,%xmm6
  1e79e3:   66 0f eb f5             por    %xmm5,%xmm6
  1e79e7:   41 8b 87 68 01 00 00    mov    0x168(%r15),%eax
  1e79ee:   89 c1                   mov    %eax,%ecx
  1e79f0:   f7 d9                   neg    %ecx
  1e79f2:   a9 00 00 00 40          test   $0x40000000,%eax
  1e79f7:   0f 44 c8                cmove  %eax,%ecx
  1e79fa:   81 e1 ff ff ff 7f       and    $0x7fffffff,%ecx
  1e7a00:   0f 57 e4                xorps  %xmm4,%xmm4
  1e7a03:   f3 0f 2a e1             cvtsi2ss %ecx,%xmm4
  1e7a07:   66 41 0f 7e b6 64 01    movd   %xmm6,0x164(%r14)
  1e7a0e:   00 00
  1e7a10:   f3 0f 59 e2             mulss  %xmm2,%xmm4
  1e7a14:   66 0f 6e d0             movd   %eax,%xmm2
  1e7a18:   f3 0f 59 cc             mulss  %xmm4,%xmm1
  1e7a1c:   0f 28 ec                movaps %xmm4,%xmm5
  1e7a1f:   f3 0f 59 ec             mulss  %xmm4,%xmm5
  1e7a23:   f3 0f 59 ec             mulss  %xmm4,%xmm5
  1e7a27:   f3 0f 59 eb             mulss  %xmm3,%xmm5
  1e7a2b:   f3 0f 5c cd             subss  %xmm5,%xmm1
  1e7a2f:   0f 54 c8                andps  %xmm0,%xmm1
  1e7a32:   66 0f df c2             pandn  %xmm2,%xmm0
  1e7a36:   66 0f eb c1             por    %xmm1,%xmm0
  1e7a3a:   66 41 0f 7e 86 68 01    movd   %xmm0,0x168(%r14)
  1e7a41:   00 00
  1e7a43:   4c 89 f0                mov    %r14,%rax
  1e7a46:   5b                      pop    %rbx
  1e7a47:   41 5e                   pop    %r14
  1e7a49:   41 5f                   pop    %r15
  1e7a4b:   c3                      retq
  1e7a4c:   0f 1f 40 00             nopl   0x0(%rax)

So the lack of au­to­mat­ic vec­tor­iza­tion is not due to our code not be­ing prop­er­ly vec­tor­iz­able, but more like­ly due to a lack of prop­er cost heuris­tics for NEON-in­struc­tions in the LLVM com­pil­er back­end (this is an as­sump­tion, if you have bet­ter in­for­ma­tion, feel free to con­tact me!).

Since the com­pil­er's au­to-vec­tor­iza­tion won't help us, is there any­thing we can do? Not on sta­ble Rust, as the NEON in­trin­sics un­der core::arch::arm are gat­ed be­hind the night­ly-on­ly ex­per­i­men­tal fea­ture stdsimd. I guess we'll have to brave that dark, left-hand path that is night­ly Rust. So first up, let's in­stall night­ly Rust!

rustup toolchain install nightly
rustup target add armv7-unknown-linux-gnueabihf --toolchain nightly

You may want to/have to add --allow-downgrade and de­fine a de­sired --profile and re­quired --components to the toolchain install, as de­scribed in the doc­u­men­ta­tion. To use the night­ly tool chain, you can pass +nightly as a first ar­gu­ment to all your cargo calls or — and this is the ap­proach I pre­fer — set a fold­er over­ride. So let's nav­i­gate to our pack­age fold­er and set an over­ride:

rustup override set nightly

At this point, there are two pos­si­ble ap­proach­es to achiev­ing ex­plic­it SIMD ac­cel­er­a­tion:

  1. Tar­get-spe­cif­ic in­trin­sics
  2. Rusts' (work-in-progress) por­ta­ble SIMD API portable-simd
  3. portable-simd's pre­de­ces­sor, packed_simd

In the hope of be­ing able to write my code in such a way that it works with a va­ri­ety of plat­forms, I de­cid­ed to try the sec­ond op­tion, but quick­ly found out that many nec­es­sary fea­tures aren't avail­able yet, so end­ed up go­ing with the third op­tion, which seems to be more com­plete at the time of writ­ing. To use packed_simd, add

packed_simd = { version = "0.3.6", package = "packed_simd_2", features = [ "into_bits" ] }

to your Cargo.toml. Now we can fair­ly di­rect­ly trans­late the (fixed) sin_quadrant_cubic func­tion to a SIMD ver­sion and add an ex­plic­it­ly split loop ver­sion of sines_quadrant_cubic:

use packed_simd::*;
fn sin_quadrant_cubic_x4(phasor: u32x4) -> f32x4 {
    let x = phasor;
    // no explicitly wrapping operations on u32x4, so we use a regular
    // left shift
    let q13 = u32x4::from_cast(i32x4::from_cast(x << 1) >> 31);
    // no wrapping negation on u32x4, so cast to i32x4 for negation
    let xm = ((u32x4::from_cast(-i32x4::from_cast(x)) & q13)
        | (x & !q13))
        & 0x7fff_ffff;
    // reciprocal multiplication is exact for powers of two
    let xmf = f32x4::from_cast(xm) * (0x4000_0000 as f32).recip();
    // no f32x4::powi
    let ret = 1.5 * xmf - 0.5 * (xmf * xmf * xmf);
    // no f32x4::copysign (one of the few things portable-simd has)
    ((u32x4::from_bits(ret) & 0x7fff_ffff) | (x & 0x8000_0000))

pub fn sines_quadrant_cubic_simd(
    phasors: &[Wrapping<u32>; 91],
) -> [f32; 91] {
    let mut ret = [0.0; 91];
    let mut phasor_chunks = phasors.chunks_exact(4);
    let mut ret_chunks = ret.chunks_exact_mut(4);
    for (p, r) in (&mut phasor_chunks).zip(&mut ret_chunks) {
        // transmute p from a slice of Wrapping<u32> to a slice of u32.
        // this is safe, as Wrapping is repr(transparent)
        use std::mem::transmute;
        let p = unsafe { transmute::<_, &[u32]>(p) };
        let p = u32x4::from_slice_unaligned(p);
    for (p, r) in phasor_chunks
        *r = sin_quadrant_cubic(*p);

Af­ter adding a cor­re­spond­ing bench­mark (and test!), we can eval­u­ate our SIMD-op­ti­mized func­tion.

sines/cubic_simd        time:   [1.7692 us 1.7733 us 1.7796 us]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high severe

Some­what sur­pris­ing­ly, the SIMD ver­sion is more than 4× faster than the scalar ver­sion. For our (al­ready au­to-vec­tor­ized) x86_64-ver­sion, the per­for­mance is un­sur­pris­ing­ly ef­fec­tive­ly iden­ti­cal to the ba­sic sines/quadrant_cubic. If you are in­ter­est­ed in how the ARM/NEON as­sem­bly looks like, here goes:

001b723c <_ZN6bela_35sines25sines_quadrant_cubic_simd17h963bbd989285d3cdE>:
  1b723c:   e92d4070    push    {r4, r5, r6, lr}
  1b7240:   e1a05001    mov r5, r1
  1b7244:   e3a01000    mov r1, #0
  1b7248:   e3a02f5b    mov r2, #364    ; 0x16c
  1b724c:   e1a04000    mov r4, r0
  1b7250:   e3a06000    mov r6, #0
  1b7254:   ebf94e7c    bl  ac4c <memset@plt>
  1b7258:   e3a005c2    mov r0, #813694976  ; 0x30800000
  1b725c:   f2c40650    vmov.i32    q8, #1073741824 ; 0x40000000
  1b7260:   f2c74f58    vmov.f32    q10, #1.5   ; 0x3fc00000
  1b7264:   eea20b90    vdup.32 q9, r0
  1b7268:   f3c3665f    vmov.i32    q11, #-1090519040   ; 0xbf000000
  1b726c:   f3c08670    vmvn.i32    q12, #-2147483648   ; 0x80000000
  1b7270:   e0850006    add r0, r5, r6
  1b7274:   e3560e15    cmp r6, #336    ; 0x150
  1b7278:   f460aa8f    vld1.32 {d26-d27}, [r0]
  1b727c:   e0840006    add r0, r4, r6
  1b7280:   f26ac8f0    vtst.32 q14, q13, q8
  1b7284:   f3f9e3ea    vneg.s32    q15, q13
  1b7288:   f35ec1fa    vbsl    q14, q15, q13
  1b728c:   f3c0c770    vbic.i32    q14, #-2147483648   ; 0x80000000
  1b7290:   f3fbc6ec    vcvt.f32.u32    q14, q14
  1b7294:   f34ccdf2    vmul.f32    q14, q14, q9
  1b7298:   f34cedfc    vmul.f32    q15, q14, q14
  1b729c:   f34cedfe    vmul.f32    q15, q14, q15
  1b72a0:   f34ccdf4    vmul.f32    q14, q14, q10
  1b72a4:   f34eedf6    vmul.f32    q15, q15, q11
  1b72a8:   f24ccdee    vadd.f32    q14, q14, q15
  1b72ac:   f36ca1f8    vbit    q13, q14, q12
  1b72b0:   f440aa8f    vst1.32 {d26-d27}, [r0]
  1b72b4:   0a000001    beq 1b72c0 <_ZN6bela_35sines25sines_quadrant_cubic_simd17h963bbd989285d3cdE+0x84>
  1b72b8:   e2866010    add r6, r6, #16
  1b72bc:   eaffffeb    b   1b7270 <_ZN6bela_35sines25sines_quadrant_cubic_simd17h963bbd989285d3cdE+0x34>
  1b72c0:   e5952168    ldr r2, [r5, #360]  ; 0x168
  1b72c4:   eeb76a08    vmov.f32    s12, #120   ; 0x3fc00000  1.5
  1b72c8:   e5956160    ldr r6, [r5, #352]  ; 0x160
  1b72cc:   eeb67a00    vmov.f32    s14, #96    ; 0x3f000000  0.5
  1b72d0:   e3120101    tst r2, #1073741824 ; 0x40000000
  1b72d4:   e5950164    ldr r0, [r5, #356]  ; 0x164
  1b72d8:   e1a03002    mov r3, r2
  1b72dc:   ed9f3a31    vldr    s6, [pc, #196]  ; 1b73a8 <_ZN6bela_35sines25sines_quadrant_cubic_simd17h963bbd989285d3cdE+0x16c>
  1b72e0:   12633000    rsbne   r3, r3, #0
  1b72e4:   e3160101    tst r6, #1073741824 ; 0x40000000
  1b72e8:   e1a01006    mov r1, r6
  1b72ec:   e3c33102    bic r3, r3, #-2147483648    ; 0x80000000
  1b72f0:   12611000    rsbne   r1, r1, #0
  1b72f4:   e3100101    tst r0, #1073741824 ; 0x40000000
  1b72f8:   e3c11102    bic r1, r1, #-2147483648    ; 0x80000000
  1b72fc:   ee013a10    vmov    s2, r3
  1b7300:   f3c00610    vmov.i32    d16, #-2147483648   ; 0x80000000
  1b7304:   ee001a10    vmov    s0, r1
  1b7308:   e1a01000    mov r1, r0
  1b730c:   12611000    rsbne   r1, r1, #0
  1b7310:   eeb81a41    vcvt.f32.u32    s2, s2
  1b7314:   e3c11102    bic r1, r1, #-2147483648    ; 0x80000000
  1b7318:   eeb80a40    vcvt.f32.u32    s0, s0
  1b731c:   f26011b0    vorr    d17, d16, d16
  1b7320:   ee021a10    vmov    s4, r1
  1b7324:   f26021b0    vorr    d18, d16, d16
  1b7328:   eeb82a42    vcvt.f32.u32    s4, s4
  1b732c:   ee211a03    vmul.f32    s2, s2, s6
  1b7330:   ee200a03    vmul.f32    s0, s0, s6
  1b7334:   ee222a03    vmul.f32    s4, s4, s6
  1b7338:   ee214a01    vmul.f32    s8, s2, s2
  1b733c:   ee203a00    vmul.f32    s6, s0, s0
  1b7340:   ee225a02    vmul.f32    s10, s4, s4
  1b7344:   ee214a04    vmul.f32    s8, s2, s8
  1b7348:   ee203a03    vmul.f32    s6, s0, s6
  1b734c:   ee200a06    vmul.f32    s0, s0, s12
  1b7350:   ee211a06    vmul.f32    s2, s2, s12
  1b7354:   ee225a05    vmul.f32    s10, s4, s10
  1b7358:   ee222a06    vmul.f32    s4, s4, s12
  1b735c:   ee244a07    vmul.f32    s8, s8, s14
  1b7360:   ee233a07    vmul.f32    s6, s6, s14
  1b7364:   ee062a10    vmov    s12, r2
  1b7368:   ee255a07    vmul.f32    s10, s10, s14
  1b736c:   ee076a10    vmov    s14, r6
  1b7370:   ee311a44    vsub.f32    s2, s2, s8
  1b7374:   ee300a43    vsub.f32    s0, s0, s6
  1b7378:   f3571110    vbsl    d17, d7, d0
  1b737c:   ee030a10    vmov    s6, r0
  1b7380:   e2840e16    add r0, r4, #352    ; 0x160
  1b7384:   f3562111    vbsl    d18, d6, d1
  1b7388:   ee322a45    vsub.f32    s4, s4, s10
  1b738c:   f3530112    vbsl    d16, d3, d2
  1b7390:   f4c0183f    vst1.32 {d17[0]}, [r0 :32]
  1b7394:   e2840f59    add r0, r4, #356    ; 0x164
  1b7398:   f4c0083f    vst1.32 {d16[0]}, [r0 :32]
  1b739c:   e2840f5a    add r0, r4, #360    ; 0x168
  1b73a0:   f4c0283f    vst1.32 {d18[0]}, [r0 :32]
  1b73a4:   e8bd8070    pop {r4, r5, r6, pc}
  1b73a8:   30800000    .word   0x30800000

This looks pret­ty good. No ex­cess bounds check­ing, the on­ly re­al an­noy­ance be­ing the point­less memset call at 001b7254 we just can't seem to get rid of. But that's al­so a lot of du­pli­cate code for the last three scalar val­ues. And both is­sues can eas­i­ly be fixed! The for­mer by pass­ing in an ex­ist­ing mu­ta­ble ar­ray ref­er­ence. If you are a fre­quent Rust us­er, you may think of us­ing an ar­ray of MaybeUninit<f32x4>, but the fi­nal array_assume_init re­sults in an even more ex­pen­sive memcpy. Be­sides, that would in­volve unsafe code, and a &mut as out-ref­er­ence — while not ex­act­ly id­iomat­ic Rust — works fine and is very sim­ple. (Note to self: check out the uninit crate) The lat­ter can be solved by round­ing up the num­ber of sines from 91 to 92 and dis­card­ing all the com­pli­cat­ed chunks_exact(_mut) stuff:

pub fn sines_full_simd(phasors: &[u32x4; 23], ret: &mut [f32x4; 23]) {
    for (p, r) in phasors.iter().zip(ret.iter_mut()) {
        *r = sin_quadrant_cubic_x4(*p);

Now let's bench­mark what the ef­fect of do­ing slight­ly more work to make ev­ery­thing more ef­fi­cient has:

sines/full_simd         time:   [1.1176 us 1.1183 us 1.1192 us]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high severe

Nice! And the gen­er­at­ed as­sem­bly has be­come a lot neater too:

001b880c <_ZN6bela_35sines15sines_full_simd17he84230b156fb0d27E>:
  1b880c:   e3a035c2    mov r3, #813694976  ; 0x30800000
  1b8810:   f2c40650    vmov.i32    q8, #1073741824 ; 0x40000000
  1b8814:   f2c74f58    vmov.f32    q10, #1.5   ; 0x3fc00000
  1b8818:   eea23b90    vdup.32 q9, r3
  1b881c:   f3c3665f    vmov.i32    q11, #-1090519040   ; 0xbf000000
  1b8820:   e3a02000    mov r2, #0
  1b8824:   f3c08670    vmvn.i32    q12, #-2147483648   ; 0x80000000
  1b8828:   e0803002    add r3, r0, r2
  1b882c:   f463aacf    vld1.64 {d26-d27}, [r3]
  1b8830:   e0813002    add r3, r1, r2
  1b8834:   f26ac8f0    vtst.32 q14, q13, q8
  1b8838:   e2822010    add r2, r2, #16
  1b883c:   f3f9e3ea    vneg.s32    q15, q13
  1b8840:   e3520e17    cmp r2, #368    ; 0x170
  1b8844:   f35ec1fa    vbsl    q14, q15, q13
  1b8848:   f3c0c770    vbic.i32    q14, #-2147483648   ; 0x80000000
  1b884c:   f3fbc6ec    vcvt.f32.u32    q14, q14
  1b8850:   f34ccdf2    vmul.f32    q14, q14, q9
  1b8854:   f34cedfc    vmul.f32    q15, q14, q14
  1b8858:   f34cedfe    vmul.f32    q15, q14, q15
  1b885c:   f34ccdf4    vmul.f32    q14, q14, q10
  1b8860:   f34eedf6    vmul.f32    q15, q15, q11
  1b8864:   f24ccdee    vadd.f32    q14, q14, q15
  1b8868:   f36ca1f8    vbit    q13, q14, q12
  1b886c:   f443aacf    vst1.64 {d26-d27}, [r3]
  1b8870:   1affffec    bne 1b8828 <_ZN6bela_35sines15sines_full_simd17he84230b156fb0d27E+0x1c>
  1b8874:   e12fff1e    bx  lr

And on our desk­top ma­chine, this code is faster than the au­to-vec­tor­ized code as well! Due both to the sim­pler gen­er­at­ed code and the use of aligned loads/stores in­stead of un­aligned loads/stores.

sines/full_simd         time:   [42.771 ns 42.886 ns 42.999 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

At this point, there is very lit­tle we can do to in­crease per­for­mance of our ba­sic tone gen­er­a­tion fur­ther and this post is al­ready very long, so I'll leave the FFI for the Bela C API for the next post. Un­til then, feel free to fol­low me and send me a DM on Twit­ter if you have any ques­tions or com­ments. I'm al­so ac­tive on the Bela fo­rum.