![]() ![]() Next is the dual-frequency autonomous solution. Its accuracy is approximately 2.5-3 meters CEP50. The simplest solution is the single-frequency autonomous solution. It is called autonomous because the receiver makes this decision autonomously, without external corrections. This means a solution based solely on GNSS satellite data. ![]() In that case, it would take 1024 cycles + latency through the cores.Let's talk about the types of positioning solutions.Īutonomous. It should all be done at the same time: transfer from ram to IP, multiply incoming data (without storing) and send them off the S2MM port. It seems you transfer data from RAM to your IP, then mutiply the matrix, then transfer from your IP to the RAM, taking around 1024 cycles for each operation. I would expect something closer to the 1343 cycles it took you for the DMA transfer, which would if you pipeline your operations properly. The thing that worries me in your results is the 3654 it took you to perform the matrix-constant multiply algorithm. You would get better rates if you used 64 bits AXI (I guessed you used 32 bits) and configure the AXI-DMA to use larger burst length. The 1343 cycles to transfer 4096 bytes seems a little slow, but not too off if your design is under stress. That said, you should have better result. In those case, it's easier to take advantage of the FPGA's parallelism, multiplying by a constant is simply not complex enough. Multiplying 2 matrices requires more operations, more data dependency, more memory access, etc. 100MHz seems pretty slow, you can probably ramp it up to 150-200MHz. It runs somewhere between 666MHz to 1GHz while your logic runs at 100MHz. Note: My frequency is 100 MHz so each clock cycle is 10ns.ĭon't forget that the ARM processor runs at much faster speed than the programmable logic. If my objective was to design an FPGA model that accelarates software why am I so far off when I'm following an example? Knowing that the same multiplication takes 1.42x10⁽⁻⁵⁾ and the FPGA one takes 3.85x10⁽⁻⁵⁾ one can notice that the CPU is almost 4 times faster than the ARM and almost 10 times faster than the FPGA. ![]() I measured the time in my PC to do a 32x32 matrix multiplication by 2.0 and it's 3.84x10⁽⁻⁶⁾ seconds. Can you tell me why and if this big of a difference makes sense? I saw somewhere that S2MM transfers are expected to be slower than MM2s transfers in the Zedboard. What can I do to improve these transfers speeds? Why is my FPGA design slower than the ARM when multiplying a matrix by 2.0 but not when multiplying two matrices?ĭo these AXI DMA velocities look okay to you? By comparing them to Sadri's video it seems that I can transfer way faster. I have multiple questions in which I hope you can help: The S2MM transfer in turn has a velocity of 167.2 MBytes/s because it transfered 4096 bytes in 2336 clock cycles. I'm using AXI DMA also to transfer data rom and to the DDR and I measured the MM2S (Memory-Mapped to Stream) transfer to 1343 clock cycles to transfer 4096 bytes, which results in a transfer spped of 290.8 Mbytes/second. I already tried changing the port that connects the ARM and the AXI DMA block to HP instead of ACP and the results are the same. See this to check the accelaration factors that I'm talking about in the matrix mult example. So basically, the FPGA is almost 3 times slower. FPGA side) I noticed that the first took me 1425 clock cyles where the second took 3654 clock cycles. However, when comparing the results of multiplying a 32x32 matrix by 2.0 in the ARM (after the optimization -O3) with the results in the Hardware (i.e. ![]() Based on the example in here, I tried a very similar example (but instead of multiplying two matrices I just multiply all the elements in a matrix by 2.0). ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |