Page 1 of 6

FPGA implementation

PostPosted: Tue Oct 28, 2014 12:59 am
by synthop
Hey guys, first let me say what amazing work you have all already done in accurately recreating the original OPL3. So much of the work in matching the bit-accuracy of original design has been done by you guys. It's very impressive.

So as some of you may know, I've begun implementing the OPL3 in an FPGA. I'm using SystemVerilog for the design and Octave for analysis. I'll be targeting a Xilinx Zynq-7000 on a Digilent Zybo board--it's a pretty cheap board that has a pretty nice DAC on it. The ARM cores are there for when I get around to the software end of things.

So far I've got the start of an Operator: the phase increment and accumulator, including the original log sine and exp LUTs implemented in ROMs. The paper that Steffen wrote helped a lot. The details of all the math are a bit over my head, but it's very interesting and clever how they were able to sneak the gain in there as an addition using no multipliers. Though multipliers are trivial now in FPGAs I suppose it was different back in the early 90s (when I was barely a teenager).

So the output of my Operator looks pretty good so far I think. Increasing env decreases the gain, as we expect I believe. Interesting as we get to numbers over 300 the output becomes so tiny its almost not really useable. What I've noticed however are fairly prominent glitches on my output occasionally but regular and periodic:

Image

At first I thought it was due to errors in 1s complement math; I changed to 2s complement as a test and the errors became less frequent but still apparent. Perhaps these are also present in the original design as well. Is this something you guys have noticed? It could very well be I'm introducing some errors somewhere. I'm using a 20-bit accumulator, and here's the relevant code (I know it's a language might not be familiar with but you can probably get the gist):

Code: Select all
   
    opl3_log_sine_lut log_sine_lut_inst (
        .theta(phase_acc[18] ? ~phase_acc[17:10] : phase_acc[17:10]),
        .out(log_sin_out),
       .*
    );
   
    always_ff @(posedge clk)
        log_sin_plus_gain <= log_sin_out + (env << 3);
       
    opl3_exp_lut exp_lut_inst (
        .in(~log_sin_plus_gain[7:0]),
        .out(exp_out),
        .*
    );
   
    always_ff @(posedge clk)
        tmp_out0 <= (2**10 + exp_out) << 1;
       
    always_ff @(posedge clk)
        if (phase_acc[19])
            out <= ~(tmp_out0 >> log_sin_plus_gain[LOG_SIN_PLUS_GAIN_WIDTH-1:8]);
        else
            out <= tmp_out0 >> log_sin_plus_gain[LOG_SIN_PLUS_GAIN_WIDTH-1:8];


I notice the glitches seem to occur when the MSB of the output of the log sin LUT doesn't quite toggle:
Image

It sort of makes sense that it happens because there's only one value in the sin log LUT that has the MSB set, so if it's off by one the value will be off by so much. And then that quantization error is then multiplied by the exp. If it's inherant to the design I'm fine with it, I just want to verify that it's correct. What do you guys think?

Re: FPGA implementation

PostPosted: Tue Oct 28, 2014 5:01 am
by synthop
I'm sorry I spoke too soon. I figured out I was accidentally truncating the MSB of the log sin LUT output inside that module. You can see in the above waveform, log_sin_out is only 11 bits when it should be 12. So everytime 0 was input which should give an output of 2137, that value ended up being 89.

Output sine wave looks nice and smooth now!

Image

Re: FPGA implementation

PostPosted: Tue Oct 28, 2014 8:27 am
by sto
Hi, nice to see you're making progress (and thanks for the mail with the master clock bug). I just wanted to say that I'm still working on the document; here's the WIP so far, but I'm working on it only very occassionally: http://earvillage.square7.ch/opl3math.pdf.

Re: FPGA implementation

PostPosted: Wed Nov 05, 2014 3:18 am
by synthop
I've got the generic operator functioning:

All 8 ws selections:
Image
Image
Image
Image
Image
Image
Image
Image

And the envelope generator:
Image

Close up of start of attack:
Image

In addition, I've also coded up the register file, though it's not wired up to anything yet. I'm going to hang it off the AXI bus of one of the ARM cores in order to talk to it with software.

The output portion is also complete. The Zybo board has a DAC on it that expects an I2S input and that's all working. Pretty cool to hear the thing make sound (I tied a single operator to the output so it's just the carrier for now).

Re: FPGA implementation

PostPosted: Wed Nov 05, 2014 5:42 am
by sto
This looks awesome :D It lets me think again about buying an FPGA myself, although I know I would use it only for this one thing and then put it aside...

Re: FPGA implementation

PostPosted: Wed Nov 05, 2014 8:26 pm
by synthop
Yeah I guess the "problem" with FPGAs is that you can already do SO much with cheaper and easier to develop software solutions on microcontrollers like the Arduino. However I think with your math background you could do some some pretty interesting DSP stuff.

Re: FPGA implementation

PostPosted: Mon Nov 17, 2014 5:53 am
by synthop
Well bad news, the design is larger than my FPGA:

Code: Select all
+----------------------------+-------+-------+-----------+--------+
|          Site Type         |  Used | Fixed | Available |  Util% |
+----------------------------+-------+-------+-----------+--------+
| Slice LUTs*                | 18383 |     0 |     17600 | 104.44 |
|   LUT as Logic             | 18338 |     0 |     17600 | 104.19 |
|   LUT as Memory            |    45 |     0 |      6000 |   0.75 |
|     LUT as Distributed RAM |     0 |     0 |           |        |
|     LUT as Shift Register  |    45 |     0 |           |        |
| Slice Registers            | 11806 |     0 |     35200 |  33.53 |
|   Register as Flip Flop    | 11806 |     0 |     35200 |  33.53 |
|   Register as Latch        |     0 |     0 |     35200 |   0.00 |
| F7 Muxes                   |   570 |     0 |      8800 |   6.47 |
| F8 Muxes                   |     0 |     0 |      4400 |   0.00 |
+----------------------------+-------+-------+-----------+--------+


I also have yet to add the rhythm section and the timers. I could possibly tweak the design a bit to get the utilization down, but these devices usually don't like utilization over about 70-80% as they become very hard to route. There are definitely a few things I need to optimize before I throw in the towel however (e.g. global tremolo).

The Zybo board does have the smallest Zynq FPGA, the xc7z010. If I target the next size up, the xc7z020, the design easily fits:

Code: Select all
+----------------------------+-------+-------+-----------+-------+
|          Site Type         |  Used | Fixed | Available | Util% |
+----------------------------+-------+-------+-----------+-------+
| Slice LUTs*                | 19257 |     0 |     53200 | 36.19 |
|   LUT as Logic             | 19212 |     0 |     53200 | 36.11 |
|   LUT as Memory            |    45 |     0 |     17400 |  0.25 |
|     LUT as Distributed RAM |     0 |     0 |           |       |
|     LUT as Shift Register  |    45 |     0 |           |       |
| Slice Registers            | 11866 |     0 |    106400 | 11.15 |
|   Register as Flip Flop    | 11866 |     0 |    106400 | 11.15 |
|   Register as Latch        |     0 |     0 |    106400 |  0.00 |
| F7 Muxes                   |   644 |     0 |     26600 |  2.42 |
| F8 Muxes                   |    10 |     0 |     13300 |  0.07 |
+----------------------------+-------+-------+-----------+-------+


The logical board to get would be the Avnet Zedboard which has a xc7z020. At $400 US it costs twice as much as the Zybo, but with the much fatter FPGA its a more versatile board. It also has an i2s audio codec on it--transitioning the design to this board should be piece of cake.

For now I'll continue working with this board and assuming an OPL2. With half the operators instantiated I'm only at about 63% slice utilization. If I can't seem to get utilization under control, when I save up some dough, I'll spring for the Zedboard.

Re: FPGA implementation

PostPosted: Mon Nov 17, 2014 6:24 pm
by sto
If you can reduce the size by reducing the operator count, I assume you are trying to replicate that sub-logic 36 times, which isn't necessary. The OPL 3 needs 288 cycles per output sample, so there are 288/36=8 cycles for a full operator cycle if you only implement the logic once and reuse it. Channels with up to 4 operators can then be implemented by routing the operators' output from/to temporary registers. But I guess it's a bit tricky to squeeze the logic into 8 cycles.

The OPL 2 has only 18 operators, but there are only 72/18=4 cycles per operator, which is interesting, because AFAIK the OPL 2 has a delay between writing to a register and a reaction to that, whereas the OPL 3 seems to show an immediate reaction. (This information is a bit vague, as I can't tell any sources, but I hope I remember correctly.)

Re: FPGA implementation

PostPosted: Mon Nov 17, 2014 7:50 pm
by synthop
Yeah I'll think about it how to do that. I'm actually using a 12.727MHz master clock instead of the 14.318MHz because my DAC uses 256x oversampling, and this way I can keep everything in one clock domain for simplicity. But yeah I see what you're saying. This just got more interesting.

The clock frequency is so slow that I can shove a ton of combinational logic in between registers, so I can get a lot done per clock cycle.

Re: FPGA implementation

PostPosted: Tue Nov 18, 2014 6:25 am
by synthop
Yeah, wow that made such a difference. It's so obvious now this is what the original chip did too. I guess this is why it's good to talk about a design with others--if your head is down in the details too much sometimes you miss the big picture! I created the state machine to time share the operator resources, and replicated the registers that must be saved for each slot (phase accumulator, envelope state and value, etc). Here's the new utilization:

Code: Select all
+----------------------------+------+-------+-----------+-------+
|          Site Type         | Used | Fixed | Available | Util% |
+----------------------------+------+-------+-----------+-------+
| Slice LUTs*                | 6492 |     0 |     17600 | 36.88 |
|   LUT as Logic             | 6447 |     0 |     17600 | 36.63 |
|   LUT as Memory            |   45 |     0 |      6000 |  0.75 |
|     LUT as Distributed RAM |    0 |     0 |           |       |
|     LUT as Shift Register  |   45 |     0 |           |       |
| Slice Registers            | 6563 |     0 |     35200 | 18.64 |
|   Register as Flip Flop    | 6563 |     0 |     35200 | 18.64 |
|   Register as Latch        |    0 |     0 |     35200 |  0.00 |
| F7 Muxes                   |  764 |     0 |      8800 |  8.68 |
| F8 Muxes                   |  214 |     0 |      4400 |  4.86 |
+----------------------------+------+-------+-----------+-------+


I still have a lot of verification to do as the architecture has changed quite a bit, but it's looking good so far. With my master clock I have 7 cycles to complete each operator slot update--not a problem with this chip and this clock speed; as I mentioned I can cram a ton of combinational logic between registers.