-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance on STM32 board? #15
Comments
Ah, yes I suspect it’s that not having an fpu makes that delay object too
heavy.
I’ve been using leaf mostly in stm32f7 and stm32h7, both of which have fpu
so it’s based around floats being efficient on the hardware. You can check
out some leaf on my other repos, like electrosteel_embedded, in the daisy
audio folder
…On Fri, Sep 22, 2023 at 1:34 PM Artem Synytsyn ***@***.***> wrote:
Hello everyone! I'm interested in running your framework on my Blackpill
board (STM32F411CEU6). I've noticed that I'm experiencing poor performance,
and I'm wondering if I might be using your framework incorrectly or if this
is expected behavior. Since there are no examples provided for embedded
applications, I'm concerned that I may have made a mistake somewhere.
To provide some context, I've connected a PCM5102 DAC along with DMA to
ensure that the MCU isn't overwhelmed with a simple I2S operation. Below,
I've included the relevant portions of my code from the main.c file, along
with comments for clarity
#include "leaf.h"
I2S_HandleTypeDef hi2s1;
// Constants
#define SAMPLERATE 44000
#define LEAF_BUFFER_SIZE (2*44100)
#define BUFFER_SIZE 8192
// Buffers
char mempool[10000]; // LEAF Memory pool
uint16_t samplebuffer[BUFFER_SIZE] = {0}; // Buffer, used for transmission to I2S codec with DMA
// Pointers, used for switching between buffers in DMA transfer
volatile uint16_t *current_buffer_element_ptr = samplebuffer;
volatile size_t current_buffer_element_cntr = 0;
// LEAF objects
LEAF leaf;
tCycle cycle;
tHermiteDelay delay;
// Utility functions
float rnd_func()
{
return ((float)rand() / (float)(RAND_MAX));
}
// Callbacks used for DMA transfers. When first part of buffer was sent(i2s_transfer_half_complited_callback called)
// I put current_buffer_element_ptr to the beginning and allow LEAF to fill it, in this time another half of buffer was sent to the
// DAC. And vise versa
void i2s_transfer_complited_callback(I2S_HandleTypeDef *hi2s)
{
if (current_buffer_element_cntr >= BUFFER_SIZE / 2)
{
current_buffer_element_ptr = samplebuffer + BUFFER_SIZE / 2;
current_buffer_element_cntr = 0;
} else {
printf("buffer overrun");
}
}
void i2s_transfer_half_complited_callback(I2S_HandleTypeDef *hi2s)
{
if (current_buffer_element_cntr >= BUFFER_SIZE / 2)
{
current_buffer_element_ptr = samplebuffer;
current_buffer_element_cntr = 0;
} else {
printf("buffer overrun");
}
}
int main(void)
{
// CUBEMX stuff
HAL_Init();
SystemClock_Config();
MX_GPIO_Init();
MX_DMA_Init();
MX_I2S1_Init();
MX_NVIC_Init();
// Callbacks for DMA transfer, where I switch buffers
HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_COMPLETE_CB_ID, &i2s_transfer_complited_callback);
HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_HALF_COMPLETE_CB_ID, &i2s_transfer_half_complited_callback);
HAL_I2S_Transmit_DMA(&hi2s1, samplebuffer, sizeof(samplebuffer)/sizeof(samplebuffer[0]));
// LEAF stuff init.
LEAF_init(&leaf, SAMPLERATE, mempool, LEAF_BUFFER_SIZE, &rnd_func);
tCycle_init(&cycle, &leaf);
tCycle_setFreq(&cycle, 220);
tHermiteDelay_init(&delay, 2000, 2500, &leaf);
tHermiteDelay_setGain(&delay, 0.5f);
uint64_t counter = 0;
while (1)
{
// If DMA controller succesfully finished transfer to audio codec, we can put new data there.
// This part work ok when simple stuff are done there.
if (current_buffer_element_cntr < BUFFER_SIZE / 2)
{
counter++;
if ((counter % 100000) == 10000)
tCycle_setFreq(&cycle, 220);
else if ((counter % 100000) == 20000)
tCycle_setFreq(&cycle, 330);
else if ((counter % 100000) == 30000)
tCycle_setFreq(&cycle, 220);
else if ((counter % 100000) == 40000)
tCycle_setFreq(&cycle, 0);
float processed_value = tCycle_tick(&cycle);
//float delayed_value = tHermiteDelay_tick(&delay, processed_value); // <<<< LOOK HERE
//processed_value = delayed_value; // <<<<< LOOK HERE
*(current_buffer_element_ptr + current_buffer_element_cntr) = (uint16_t) (0x0fff * (1.0f + processed_value));
current_buffer_element_cntr++;
}
}
}
We can assume that the code is functioning correctly. I have provided a
recording of the sound when the sequence is running as expected:
Record - sequence, works ok
<https://recorder.google.com/7a84e457-62ad-42a7-b041-f76fc30f43e0>
Next, I uncommented a section marked as "<<<< LOOK HERE." This enabled a
delay for the audio, and as a result, the sound became severely distorted:
Record - sequence + delay, broken
<https://recorder.google.com/e2904a43-fd83-4bb8-99f6-43528ba1f1db>
I also tested similar code on a host machine (you can find it in my fork
and example:
https://github.com/leechwort/LEAF/blob/master/Examples/sawtooth-sequence.c)
and it worked ok. This leads me to suspect that the issue might be related
to performance limitations on the STM32 board.
In summary, I have a few questions:
- Can you suggest what might be causing this behavior on the STM32
board? Is there a specific way I should be using your framework for
embedded systems that differs from using it on a host machine?
- Do you have any example projects specifically designed for the STM32
platform that I could refer to for guidance?
- It appears that the FPU (Floating-Point Unit) is not utilized in
this framework. Do you have plans to implement FPU support in the future?
—
Reply to this email directly, view it on GitHub
<#15>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGAY7F652H3K6KMHTXQRJTX3XDZZANCNFSM6AAAAAA5DNO54U>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Oh wait, are you saying there is an fpu on your f4? That’s not enabled as
part of leaf library, you just need to enable it as a compiler flag
…On Fri, Sep 22, 2023 at 3:52 PM Jeff Snyder ***@***.***> wrote:
Ah, yes I suspect it’s that not having an fpu makes that delay object too
heavy.
I’ve been using leaf mostly in stm32f7 and stm32h7, both of which have fpu
so it’s based around floats being efficient on the hardware. You can check
out some leaf on my other repos, like electrosteel_embedded, in the daisy
audio folder
On Fri, Sep 22, 2023 at 1:34 PM Artem Synytsyn ***@***.***>
wrote:
> Hello everyone! I'm interested in running your framework on my Blackpill
> board (STM32F411CEU6). I've noticed that I'm experiencing poor performance,
> and I'm wondering if I might be using your framework incorrectly or if this
> is expected behavior. Since there are no examples provided for embedded
> applications, I'm concerned that I may have made a mistake somewhere.
>
> To provide some context, I've connected a PCM5102 DAC along with DMA to
> ensure that the MCU isn't overwhelmed with a simple I2S operation. Below,
> I've included the relevant portions of my code from the main.c file, along
> with comments for clarity
>
> #include "leaf.h"
> I2S_HandleTypeDef hi2s1;
>
> // Constants
> #define SAMPLERATE 44000
> #define LEAF_BUFFER_SIZE (2*44100)
> #define BUFFER_SIZE 8192
>
> // Buffers
> char mempool[10000]; // LEAF Memory pool
> uint16_t samplebuffer[BUFFER_SIZE] = {0}; // Buffer, used for transmission to I2S codec with DMA
>
> // Pointers, used for switching between buffers in DMA transfer
> volatile uint16_t *current_buffer_element_ptr = samplebuffer;
> volatile size_t current_buffer_element_cntr = 0;
>
> // LEAF objects
> LEAF leaf;
> tCycle cycle;
> tHermiteDelay delay;
>
> // Utility functions
> float rnd_func()
> {
> return ((float)rand() / (float)(RAND_MAX));
> }
>
>
> // Callbacks used for DMA transfers. When first part of buffer was sent(i2s_transfer_half_complited_callback called)
> // I put current_buffer_element_ptr to the beginning and allow LEAF to fill it, in this time another half of buffer was sent to the
> // DAC. And vise versa
> void i2s_transfer_complited_callback(I2S_HandleTypeDef *hi2s)
> {
> if (current_buffer_element_cntr >= BUFFER_SIZE / 2)
> {
> current_buffer_element_ptr = samplebuffer + BUFFER_SIZE / 2;
> current_buffer_element_cntr = 0;
> } else {
> printf("buffer overrun");
> }
> }
>
> void i2s_transfer_half_complited_callback(I2S_HandleTypeDef *hi2s)
> {
> if (current_buffer_element_cntr >= BUFFER_SIZE / 2)
> {
> current_buffer_element_ptr = samplebuffer;
> current_buffer_element_cntr = 0;
> } else {
> printf("buffer overrun");
> }
> }
>
> int main(void)
> {
> // CUBEMX stuff
> HAL_Init();
> SystemClock_Config();
> MX_GPIO_Init();
> MX_DMA_Init();
> MX_I2S1_Init();
> MX_NVIC_Init();
>
> // Callbacks for DMA transfer, where I switch buffers
> HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_COMPLETE_CB_ID, &i2s_transfer_complited_callback);
> HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_HALF_COMPLETE_CB_ID, &i2s_transfer_half_complited_callback);
> HAL_I2S_Transmit_DMA(&hi2s1, samplebuffer, sizeof(samplebuffer)/sizeof(samplebuffer[0]));
>
> // LEAF stuff init.
> LEAF_init(&leaf, SAMPLERATE, mempool, LEAF_BUFFER_SIZE, &rnd_func);
> tCycle_init(&cycle, &leaf);
> tCycle_setFreq(&cycle, 220);
> tHermiteDelay_init(&delay, 2000, 2500, &leaf);
> tHermiteDelay_setGain(&delay, 0.5f);
>
> uint64_t counter = 0;
> while (1)
> {
> // If DMA controller succesfully finished transfer to audio codec, we can put new data there.
> // This part work ok when simple stuff are done there.
> if (current_buffer_element_cntr < BUFFER_SIZE / 2)
> {
> counter++;
>
> if ((counter % 100000) == 10000)
> tCycle_setFreq(&cycle, 220);
> else if ((counter % 100000) == 20000)
> tCycle_setFreq(&cycle, 330);
> else if ((counter % 100000) == 30000)
> tCycle_setFreq(&cycle, 220);
> else if ((counter % 100000) == 40000)
> tCycle_setFreq(&cycle, 0);
>
> float processed_value = tCycle_tick(&cycle);
> //float delayed_value = tHermiteDelay_tick(&delay, processed_value); // <<<< LOOK HERE
> //processed_value = delayed_value; // <<<<< LOOK HERE
> *(current_buffer_element_ptr + current_buffer_element_cntr) = (uint16_t) (0x0fff * (1.0f + processed_value));
> current_buffer_element_cntr++;
> }
> }
> }
>
> We can assume that the code is functioning correctly. I have provided a
> recording of the sound when the sequence is running as expected:
> Record - sequence, works ok
> <https://recorder.google.com/7a84e457-62ad-42a7-b041-f76fc30f43e0>
>
> Next, I uncommented a section marked as "<<<< LOOK HERE." This enabled a
> delay for the audio, and as a result, the sound became severely distorted:
> Record - sequence + delay, broken
> <https://recorder.google.com/e2904a43-fd83-4bb8-99f6-43528ba1f1db>
>
> I also tested similar code on a host machine (you can find it in my fork
> and example:
> https://github.com/leechwort/LEAF/blob/master/Examples/sawtooth-sequence.c)
> and it worked ok. This leads me to suspect that the issue might be related
> to performance limitations on the STM32 board.
>
> In summary, I have a few questions:
>
> - Can you suggest what might be causing this behavior on the STM32
> board? Is there a specific way I should be using your framework for
> embedded systems that differs from using it on a host machine?
> - Do you have any example projects specifically designed for the
> STM32 platform that I could refer to for guidance?
> - It appears that the FPU (Floating-Point Unit) is not utilized in
> this framework. Do you have plans to implement FPU support in the future?
>
> —
> Reply to this email directly, view it on GitHub
> <#15>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAGAY7F652H3K6KMHTXQRJTX3XDZZANCNFSM6AAAAAA5DNO54U>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
hi artem and jeff
i believe the f411ce has a 32 bit float.
…-mfloat-abi=hard -mfpu=fpv4-sp-d16
should turn it on.
using doubles will cause it to slow to a crawl though.
and even with an FPU, integer operations are often much faster.
On Sep 22, 2023, at 12:52 PM, Jeff Snyder ***@***.***> wrote:
Ah, yes I suspect it’s that not having an fpu makes that delay object too
heavy.
I’ve been using leaf mostly in stm32f7 and stm32h7, both of which have fpu
so it’s based around floats being efficient on the hardware. You can check
out some leaf on my other repos, like electrosteel_embedded, in the daisy
audio folder
On Fri, Sep 22, 2023 at 1:34 PM Artem Synytsyn ***@***.***>
wrote:
> Hello everyone! I'm interested in running your framework on my Blackpill
> board (STM32F411CEU6). I've noticed that I'm experiencing poor performance,
> and I'm wondering if I might be using your framework incorrectly or if this
> is expected behavior. Since there are no examples provided for embedded
> applications, I'm concerned that I may have made a mistake somewhere.
>
> To provide some context, I've connected a PCM5102 DAC along with DMA to
> ensure that the MCU isn't overwhelmed with a simple I2S operation. Below,
> I've included the relevant portions of my code from the main.c file, along
> with comments for clarity
>
> #include "leaf.h"
> I2S_HandleTypeDef hi2s1;
>
> // Constants
> #define SAMPLERATE 44000
> #define LEAF_BUFFER_SIZE (2*44100)
> #define BUFFER_SIZE 8192
>
> // Buffers
> char mempool[10000]; // LEAF Memory pool
> uint16_t samplebuffer[BUFFER_SIZE] = {0}; // Buffer, used for transmission to I2S codec with DMA
>
> // Pointers, used for switching between buffers in DMA transfer
> volatile uint16_t *current_buffer_element_ptr = samplebuffer;
> volatile size_t current_buffer_element_cntr = 0;
>
> // LEAF objects
> LEAF leaf;
> tCycle cycle;
> tHermiteDelay delay;
>
> // Utility functions
> float rnd_func()
> {
> return ((float)rand() / (float)(RAND_MAX));
> }
>
>
> // Callbacks used for DMA transfers. When first part of buffer was sent(i2s_transfer_half_complited_callback called)
> // I put current_buffer_element_ptr to the beginning and allow LEAF to fill it, in this time another half of buffer was sent to the
> // DAC. And vise versa
> void i2s_transfer_complited_callback(I2S_HandleTypeDef *hi2s)
> {
> if (current_buffer_element_cntr >= BUFFER_SIZE / 2)
> {
> current_buffer_element_ptr = samplebuffer + BUFFER_SIZE / 2;
> current_buffer_element_cntr = 0;
> } else {
> printf("buffer overrun");
> }
> }
>
> void i2s_transfer_half_complited_callback(I2S_HandleTypeDef *hi2s)
> {
> if (current_buffer_element_cntr >= BUFFER_SIZE / 2)
> {
> current_buffer_element_ptr = samplebuffer;
> current_buffer_element_cntr = 0;
> } else {
> printf("buffer overrun");
> }
> }
>
> int main(void)
> {
> // CUBEMX stuff
> HAL_Init();
> SystemClock_Config();
> MX_GPIO_Init();
> MX_DMA_Init();
> MX_I2S1_Init();
> MX_NVIC_Init();
>
> // Callbacks for DMA transfer, where I switch buffers
> HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_COMPLETE_CB_ID, &i2s_transfer_complited_callback);
> HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_HALF_COMPLETE_CB_ID, &i2s_transfer_half_complited_callback);
> HAL_I2S_Transmit_DMA(&hi2s1, samplebuffer, sizeof(samplebuffer)/sizeof(samplebuffer[0]));
>
> // LEAF stuff init.
> LEAF_init(&leaf, SAMPLERATE, mempool, LEAF_BUFFER_SIZE, &rnd_func);
> tCycle_init(&cycle, &leaf);
> tCycle_setFreq(&cycle, 220);
> tHermiteDelay_init(&delay, 2000, 2500, &leaf);
> tHermiteDelay_setGain(&delay, 0.5f);
>
> uint64_t counter = 0;
> while (1)
> {
> // If DMA controller succesfully finished transfer to audio codec, we can put new data there.
> // This part work ok when simple stuff are done there.
> if (current_buffer_element_cntr < BUFFER_SIZE / 2)
> {
> counter++;
>
> if ((counter % 100000) == 10000)
> tCycle_setFreq(&cycle, 220);
> else if ((counter % 100000) == 20000)
> tCycle_setFreq(&cycle, 330);
> else if ((counter % 100000) == 30000)
> tCycle_setFreq(&cycle, 220);
> else if ((counter % 100000) == 40000)
> tCycle_setFreq(&cycle, 0);
>
> float processed_value = tCycle_tick(&cycle);
> //float delayed_value = tHermiteDelay_tick(&delay, processed_value); // <<<< LOOK HERE
> //processed_value = delayed_value; // <<<<< LOOK HERE
> *(current_buffer_element_ptr + current_buffer_element_cntr) = (uint16_t) (0x0fff * (1.0f + processed_value));
> current_buffer_element_cntr++;
> }
> }
> }
>
> We can assume that the code is functioning correctly. I have provided a
> recording of the sound when the sequence is running as expected:
> Record - sequence, works ok
> <https://recorder.google.com/7a84e457-62ad-42a7-b041-f76fc30f43e0>
>
> Next, I uncommented a section marked as "<<<< LOOK HERE." This enabled a
> delay for the audio, and as a result, the sound became severely distorted:
> Record - sequence + delay, broken
> <https://recorder.google.com/e2904a43-fd83-4bb8-99f6-43528ba1f1db>
>
> I also tested similar code on a host machine (you can find it in my fork
> and example:
> https://github.com/leechwort/LEAF/blob/master/Examples/sawtooth-sequence.c)
> and it worked ok. This leads me to suspect that the issue might be related
> to performance limitations on the STM32 board.
>
> In summary, I have a few questions:
>
> - Can you suggest what might be causing this behavior on the STM32
> board? Is there a specific way I should be using your framework for
> embedded systems that differs from using it on a host machine?
> - Do you have any example projects specifically designed for the STM32
> platform that I could refer to for guidance?
> - It appears that the FPU (Floating-Point Unit) is not utilized in
> this framework. Do you have plans to implement FPU support in the future?
>
> —
> Reply to this email directly, view it on GitHub
> <#15>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAGAY7F652H3K6KMHTXQRJTX3XDZZANCNFSM6AAAAAA5DNO54U>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
—
Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNIJEETPA2MVACPM5Q3W5LX3XUBTANCNFSM6AAAAAA5DNO54U>.
You are receiving this because you are subscribed to this thread.
|
Thank your replies, guys! I've re-checked that FPU is enabled on my board, even made some benchmark for piece of code:
~19 sec without FPU and ~6 sec with. And no changes for LEAF performance. I'm still thinking maybe I'm missing something. Sure, F411 has much less perfomance then stm32f7, but is delay so much power-consuming? Also, my fault about last question. I meant "CMSIS DSP" unit. |
Ok, looks like things becomes better with increasing buffers, but since I have a limits for RAM blackpill maybe time to switch to something with onboard additional RAM:) |
I looked over the leaf delay code, and there is nothing particularly slow in it. I can see some places to speed things up (use & and poweroftwo buffers for circular wrap - hand optimize the Hermite to minimize operations).
However, a couple tips that might help speed things up:
Try computing the first half after first half complete, and second half after second half complete. That’ll give you a little more leeway
Here’s what i’m doing with the 12-bit DAC (at 96k, 64 sample blocks)
void HAL_DAC_ConvCpltCallbackCh1(DAC_HandleTypeDef *hdac)
{
ChaosHalfBlock(32);
}
void HAL_DAC_ConvHalfCpltCallbackCh1(DAC_HandleTypeDef* hdac)
{
ChaosHalfBlock(0);
}
Also, I call the audio code in the callback (as you see above). Tn my code the main while() loop
is lower priority.
Use at least -g3 with F4 processors, and -gfast if in a tight spot.
Tom
… On Sep 22, 2023, at 2:55 PM, Artem Synytsyn ***@***.***> wrote:
Ok, looks like things becomes better with increasing buffers, but since I have a limits for RAM blackpill maybe time to switch to something with onboard additional RAM:)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Hello everyone! I'm interested in running your framework on my Blackpill board (STM32F411CEU6). I've noticed that I'm experiencing poor performance, and I'm wondering if I might be using your framework incorrectly or if this is expected behavior. Since there are no examples provided for embedded applications, I'm concerned that I may have made a mistake somewhere.
To provide some context, I've connected a PCM5102 DAC along with DMA to ensure that the MCU isn't overwhelmed with a simple I2S operation. Below, I've included the relevant portions of my code from the main.c file, along with comments for clarity
We can assume that the code is functioning correctly. I have provided a recording of the sound when the sequence is running as expected:
Record - sequence, works ok
Next, I uncommented a section marked as "<<<< LOOK HERE." This enabled a delay for the audio, and as a result, the sound became severely distorted:
Record - sequence + delay, broken
I also tested similar code on a host machine (you can find it in my fork and example: https://github.com/leechwort/LEAF/blob/master/Examples/sawtooth-sequence.c) and it worked ok. This leads me to suspect that the issue might be related to performance limitations on the STM32 board.
In summary, I have a few questions:
The text was updated successfully, but these errors were encountered: