Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIFI Issue - random reboot when STA is lost (investigation) #276

Open
stuartpittaway opened this issue Feb 6, 2024 · 37 comments
Open

WIFI Issue - random reboot when STA is lost (investigation) #276

stuartpittaway opened this issue Feb 6, 2024 · 37 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@stuartpittaway
Copy link
Owner

This ticket is to investigate seemingly random reboots of the controller (often related to also losing WIFI STA) with latest firmware version Release-2023-12-27-12-02

May be related to #239

@stuartpittaway stuartpittaway added the help wanted Extra attention is needed label Feb 6, 2024
@stuartpittaway stuartpittaway self-assigned this Feb 6, 2024
@stuartpittaway
Copy link
Owner Author

stuartpittaway commented Feb 6, 2024

Test on my development rig:

  • no sd card
  • no TFT display
  • V4.2 PCB controller + current shunt addon
  • 6 cells being monitoring (v4.90 board)
  • ESP32 reported at boot up: ESP32 Chip model = 1, Rev 1, Cores=2, Features=50
  • Powered from USB cable into ESP32 (from desktop PC)
  • MQTT enabled, INFLUX disabled, Home Assistant API not used
  • MQTT broker configured for mqtt://test.mosquitto.org:1884, port 1884, username: rw, password: readwrite

ESP32 connected to WIFI hot spot on mobile phone (Android). On boot up, controller report (filtered for wifi events only, logging for MQTT increased to DEBUG level)

D (6469) diybms: starting wifi_init_sta
I (6492) diybms: WIFI SSID: XXXXXXXXXXXXXXXX
I (6569) diybms: Hostname: DIYBMS-005CED90
D (6570) diybms: wifi_init_sta finished
I (13606) diybms: WIFI_EVENT_STA_START
D (13707) diybms: total_free_byte=156976 total_allocated_byte=132724 largest_free_blk=110580 min_free_byte=154632 alloc_blk=360 free_blk=5 total_blk=365
I (15392) diybms: WIFI_EVENT_STA_DISCONNECTED
I (15395) diybms: WIFI connect quick retry 1
I (17809) diybms: WIFI_EVENT_STA_DISCONNECTED
I (17812) diybms: WIFI connect quick retry 2
I (17941) diybms: WIFI_EVENT_STA_CONNECTED channel=11, rssi=-41
I (17966) diybms: IP ADDRESS HAS CHANGED
I (17969) diybms: Request time from time.google.com
I (17970) diybms: Timezone=UTC0DST
I (17971) diybms: The current date/time is: Thu Jan  1 00:00:10 1970
I (17996) diybms: You can access DIYBMS interface at http://DIYBMS-005CED90.local or http://192.168.1.87
W (18512) diybms-mqtt: MQTT enabled, but not yet init
W (19608) diybms-mqtt: MQTT enabled, but not yet init
I (43745) diybms-mqtt: MQTT counters: Err_Con=0,Err_Trans=0,Conn=0,Disc=0
I (43746) diybms-mqtt: esp_mqtt_client_init
I (43750) diybms-mqtt: esp_mqtt_client_start
I (44254) diybms-mqtt: MQTT_EVENT_CONNECTED
I (46619) diybms-mqtt: Rule status payload
D (46627) diybms-mqtt: Topic:emon/diybms2/rule, ID:0, Length:103
I (46628) diybms-mqtt: Outputs status payload
D (46634) diybms-mqtt: Topic:emon/diybms2/output, ID:0, Length:25
I (48542) diybms-mqtt: MQTT Payload for cell data

Data is successfully transmitted to MQTT server and web interface is working as expected.

Upon terminating the WIFI hot spot on the Android phone:

I (284130) diybms: WIFI_EVENT_STA_DISCONNECTED
E (284132) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 51
E (284133) MQTT_CLIENT: Poll read error: 119, aborting connection
I (284140) diybms-mqtt: MQTT_EVENT_DISCONNECTED
I (284207) diybms-mqtt: MQTT counters: Err_Con=0,Err_Trans=0,Conn=1,Disc=1
I (284233) diybms-mqtt: Stopping MQTT client
W (286282) diybms-mqtt: MQTT enabled, but not connected
W (289710) diybms-mqtt: MQTT enabled, but not connected
W (291285) diybms-mqtt: MQTT enabled, but not connected
W (291285) diybms-mqtt: MQTT enabled, but not connected
W (291286) diybms-mqtt: MQTT enabled, but not connected
I (299155) diybms: WIFI connect quick retry 1
W (301288) diybms-mqtt: MQTT enabled, but not yet init
I (301569) diybms: WIFI_EVENT_STA_DISCONNECTED
I (301571) diybms: WIFI connect quick retry 2
I (301709) diybms-rules: Set error 2:ModuleCountMismatch
I (301710) diybms: Active errors=1
W (301711) diybms-mqtt: MQTT enabled, but not yet init
I (303985) diybms: WIFI_EVENT_STA_DISCONNECTED
I (303988) diybms: WIFI connect quick retry 3

** removed similar messages **

I (313650) diybms: WIFI_EVENT_STA_DISCONNECTED
I (313653) diybms: WIFI connect quick retry 7
I (313713) diybms-rules: Set error 2:ModuleCountMismatch
I (313714) diybms: Active errors=1
W (313715) diybms-mqtt: MQTT enabled, but not yet init
I (314266) diybms: Trying to connect WIFI
E (314267) wifi:sta is connecting, return error
ESP_ERROR_CHECK_WITHOUT_ABORT failed: esp_err_t 0x3007 (ESP_ERR_WIFI_CONN) at 0x4008ea0b
file: "src/main.cpp" line 4187
func: void loop()
expression: esp_wifi_connect()
I (316066) diybms: WIFI_EVENT_STA_DISCONNECTED

** removed similar messages **
I (554684) diybms: Trying to connect WIFI
I (436909) diybms: WIFI_EVENT_STA_DISCONNECTED
E (436910) diybms: Connect to WIFI AP failed, tried 28 times

Upon re-enabling the WIFI hot spot on the Android phone:

I (765022) diybms: Trying to connect WIFI
I (765122) diybms: WIFI_EVENT_STA_CONNECTED channel=11, rssi=-48
I (765150) diybms: IP ADDRESS HAS CHANGED
I (765150) diybms: Request time from time.google.com
I (765151) diybms: Timezone=UTC0DST
I (765152) diybms: The current date/time is: Tue Feb  6 11:56:59 2024
I (765174) diybms: You can access DIYBMS interface at http://DIYBMS-005CED90.local or http://192.168.1.87
I (795081) diybms-mqtt: MQTT counters: Err_Con=0,Err_Trans=0,Conn=0,Disc=0
I (795081) diybms-mqtt: esp_mqtt_client_init
I (795086) diybms-mqtt: esp_mqtt_client_start
I (795276) diybms-mqtt: MQTT_EVENT_CONNECTED

The code in the controller is designed for the following action when a loss of WIFI is detected (event WIFI_EVENT_STA_DISCONNECTED)

  • Calls ShutdownAllNetworkServices() stop_webserver / stopMqtt / stopMDNS
  • Set wifi_isconnected = false
  • Attempts to call esp_wifi_connect() up to 25 times - log messages reported as "connect quick retry"

After 25 times, the message reported is Connect to WIFI AP failed, tried XXX times.

Once the 25 attempts have failed, esp_wifi_connect() is called inside the main loop, approx. every 30 seconds, reported as "Trying to connect WIFI"

As can be seen from the above logs, the development rig environment as described appears to work correctly and recovers from WIFI disconnection and errors successfully.

@stuartpittaway
Copy link
Owner Author

Related to #220

@stuartpittaway
Copy link
Owner Author

Ok, managed to get a GURU if I repeat disable wifi hotspot and quickly re-enable it.

I (1759137) diybms: WIFI connect quick retry 1
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x401b5f3e  PS      : 0x00060a30  A0      : 0x801b6023  A1      : 0x3ffd8f00
A2      : 0x3ffb62d4  A3      : 0xffffffff  A4      : 0x00000000  A5      : 0xffffffff  
A6      : 0x00000000  A7      : 0x3ffe3458  A8      : 0x3ffdae70  A9      : 0x3ffd8e70
A10     : 0x00000000  A11     : 0x00000001  A12     : 0x3ffe2928  A13     : 0x3ffe2928  
A14     : 0x3ffe3428  A15     : 0x3ffe3462  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000000  LBEG    : 0x4008c0e1  LEND    : 0x4008c0f1  LCOUNT  : 0xfffffffe  


Backtrace: 0x401b5f3b:0x3ffd8f00 0x401b6020:0x3ffd8f50

  #0  0x401b5f3b:0x3ffd8f00 in handler_execute at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:139
      (inlined by) esp_event_loop_run at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:590
  #1  0x401b6020:0x3ffd8f50 in esp_event_loop_run_task at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:115 (discriminator 15)  

stuartpittaway added a commit that referenced this issue Feb 6, 2024
@stuartpittaway
Copy link
Owner Author

Possible fix firmware (experimental)
diybms_controller_firmware_experimental_bug276.zip

@jetronic18s
Copy link

jetronic18s commented Feb 6, 2024

Hello Stuart, I also noticed that the DIYBMS (Firmware 2023-11-28) was restarting. It seems to have restarted 3 times in a very short time. Unfortunately, I cannot yet say whether this is related to the WLAN. I will try to do tests with WLAN until the end of the week.

Screenshot_20240206_143232

Screenshot_20240206_143246

I could see from the uptime of the controller that it has really restarted.

@stuartpittaway
Copy link
Owner Author

It seems to have restarted 3 times in a very short time.

It seems to trigger a reboot if the WIFI connection is lost and restored within a second or two, but it looks like a bug in the controller code (as expected!) so I'm hoping this version works as expected.

@jetronic18s
Copy link

A few days ago I also observed an internal BMS error, which is really strange. I have never seen such errors before. I have been using the system for over a year without ever seeing anything like this. It may be important for the analysis

Screenshot_20240129_221621_nl victronenergy_edit_1054046487623017

@red0909
Copy link

red0909 commented Feb 6, 2024

the experimental firmware does not start on my esp, black screen.
tried with two different esp32 and two different computers

@stuartpittaway
Copy link
Owner Author

the experimental firmware does not start on my esp, black screen. tried with two different esp32 and two different computers

This isn't a complete flash image - if you re-flash the "release" version, then use the over the air upgrade feature to apply this experimental one.

@red0909
Copy link

red0909 commented Feb 6, 2024

ok now it is running. disconnected wifi several times, no reboot.
now i need to wait some days and watch how my inverter behaves

@red0909
Copy link

red0909 commented Feb 9, 2024

now it is running for two days no issues so far.
but i noticed that the controller refuses to connect to network with hidden ssid, this was possible with december firmware but the reconnect problem was there even if the wifi ssid was not hidden.

if this is the trade off for a stable running controller i can live with it, maby not for all user?

@stuartpittaway
Copy link
Owner Author

stuartpittaway commented Feb 9, 2024

I've not made any changes to the wifi stack - so a hidden SSID shouldn't be a problem.

I've a log file from another user who has tested this firmware and unfortunately it didn't solve his reboot. He uses a Fritzbox which does appear to be a common problem with ESP32 hardware.

CONTROLLER - ver:cbe2f3314cf6ac9e3db3e1cdb27aa386e6facbcc compiled 2024-02-06T12:40:00.542Z
ESP32 Chip model = 1, Rev 1, Cores=2, Features=50

I (245621) diybms: WIFI_EVENT_STA_DISCONNECTED
I (245621) diybms: ShutdownAllNetworkServices
I (245621) diybms-web: httpd_stop
I (245722) diybms: stop mdns
I (245734) diybms: WIFI connect quick retry 1
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x401b5f92  PS      : 0x00060030  A0      : 0x801b6077  A1      : 0x3ffd8da0  
A2      : 0x3ffb6328  A3      : 0xffffffff  A4      : 0x00000000  A5      : 0xffffffff  
A6      : 0x00000000  A7      : 0x3ffe2fc8  A8      : 0x3ffdad40  A9      : 0x3ffd8d10  
A10     : 0x00000000  A11     : 0x00000001  A12     : 0x3ffe2438  A13     : 0x3ffe2438  
A14     : 0x3ffe2f98  A15     : 0x3ffe2fd2  SAR     : 0x00000004  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x4008c0e1  LEND    : 0x4008c0f1  LCOUNT  : 0xfffffffe  


Backtrace: 0x401b5f8f:0x3ffd8da0 0x401b6074:0x3ffd8df0

which decodes as

0x401b5f92: handler_execute at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:145
0x401b5f92: esp_event_loop_run at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:590
0x401b5f8f: handler_execute at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:139
0x401b5f8f: esp_event_loop_run at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:590
0x401b6074: esp_event_loop_run_task at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:115

@N1c084
Copy link
Contributor

N1c084 commented Feb 9, 2024 via email

@stuartpittaway
Copy link
Owner Author

Do you think an ESP32 with Ethernet port can solve the pb ?

No idea, I don't have one and it would also need significant code changes to make it work

@HerrFrodo1
Copy link

YES! LAN is the solution!!! ;-)

@red0909
Copy link

red0909 commented Feb 9, 2024

I've a log file from another user who has tested this firmware and unfortunately it didn't solve his reboot. He uses a Fritzbox which does appear to be a common problem with ESP32 hardware.

have he tested this with other esp32?

well i dont have a fritzbox, but i had also to replace my wifi router because some esp32 have not connected to my previous one...

@Linusten
Copy link

Linusten commented Feb 9, 2024

Sadly i have no logs, but also a Fritz!Box and the same issues.

@red0909
Copy link

red0909 commented Feb 9, 2024

Sadly i have no logs, but also a Fritz!Box and the same issues.

try to make a wifi hotspot on your phone and connect to that. if it will not reboot so the fritzbox is the issue

@HerrFrodo1
Copy link

HerrFrodo1 commented Feb 12, 2024

@red0909

well i dont have a fritzbox, but i had also to replace my wifi router because some esp32 have not connected to my previous one...<

Which other router you bought?

@HerrFrodo1
Copy link

@red0909

try to make a wifi hotspot on your phone and connect to that. if it will not reboot so the fritzbox is the issue

The Hotspot on Iphone is not the right way for testing.
It only shares the Internet with a connected WiFi subscriber. It does not create an internal network that can be accessed.
Calling the web app seems to be a possible source of the problem - possibly in connection with MQTT.
I tried it.

Yesterday I switched off the Fritz!Box WiFi and tested a TP-Link Accesspoint(TL-WR841N).
There was still a problem with ESP32 crashing. Interesting thing....with the Fritz!WLAN Repeater, the crashes usually occurred after the WiFi was switched off. With TP-Link, the crashes now happen when you turn on the WiFi... and only after you open the web app.

@stuartpittaway
Copy link
Owner Author

It only shares the Internet with a connected WiFi subscriber. It does not create an internal network that can be accessed.

You can access the DIYBMS web interface directly from the phone web browser, when testing in this fashion.

@red0909
Copy link

red0909 commented Feb 12, 2024

Which other router you bought?

dlink dsr-250n

it need some tricky fw updates 5 times to the new fw but this router is not longer supported and should not be for internet use.

i use it offline my network for my inverters and this bms is offline.
could use only a 8 port switch but the diy bms require wifi, its the only device in my network using wifi.
i dont trust wifi for critical devices, the diy bms needs a password too or at least a simple 4digit pin.

@stuartpittaway
i disconnect wifi sometimes to see what happens.
this experimental fw still running good, no reboots here. on a cheap fake esp32

@HerrFrodo1
Copy link

@red0909

... i dont trust wifi for critical devices, ...

It's the same with me. diyBMS is the only device on my network without LAN :-(
Our WiFi is switched off from 9 p.m. to 6 a.m.
Then the most important data from the diyBMS comes from the Victron Cerbo GX via the battery Can-Bus.

@stuartpittaway
Copy link
Owner Author

the diy bms needs a password too or at least a simple 4digit pin.

Security isn't really possible on these sort of devices (ESP32) - at least not without a full TLS encryption layer/certificates - otherwise any sort of password or PIN is pointless as they could be sniffed off the network.

@red0909
Copy link

red0909 commented Feb 20, 2024

so 14 days now with experimental fw, no reboot no problems with the wifi.

@stuartpittaway
Copy link
Owner Author

Hi @red0909 been 3 weeks now, whats the feedback?

@red0909
Copy link

red0909 commented Mar 13, 2024

no problems as far i can see, but i am not using mqtt or homeasistant.
running stable no reboot with cheap esp32 module
canbus signal is stable too

@jetronic18s
Copy link

jetronic18s commented Apr 11, 2024

Hello Stuart,

I installed a DIYBMS a few days ago.
A controller board v4.5 on a 18s1p battery.

I have installed the last 4 official releases on the controller and whenever the Fritzbox was rebooted or the wifi was turned off. The controller board is restarted.

I then installed the beta "diybms_controller_firmware_experimental_bug276.zip" and the problem was gone. I must have restarted the Fritzbox 2-3 times without a problem.

Today the power was probably off for about 1h during the installation. So the Fritzbox was off and the controller board restarted.
I was able to determine this through the uptime and also the undervoltage error (relay dropped out briefly).

The DIYBMS is connected to the router as follows (MESH is active):
Fritzbox <--> Repeater 1750e <--> DIYBMS

Unfortunately I have no access to the serial console of the controller

Nobody wants to hear that here, sorry.

@red0909
Copy link

red0909 commented Apr 12, 2024

@jetronic18s
what powersupply do you have for the controller? have you measured the voltage at the controller screws?
i think this is some sort of a power issue

@jetronic18s
Copy link

@red0909
I supply the controller via a DCDC (Mean Well DDR-30L-5) from the battery. I have exactly 5V in idle mode. In the event of a fault, I would not be able to measure the voltage.

@stuartpittaway
Copy link
Owner Author

The DIYBMS is connected to the router as follows (MESH is active):
Fritzbox <--> Repeater 1750e <--> DIYBMS

Yes, that appears to be the common pattern of failure - using Fritzbox along with mesh/repeater wifi units.

Very similar to this problem... arendst/Tasmota#14986

@ruza87
Copy link
Contributor

ruza87 commented Oct 24, 2024

Hi Stuart,

I think I experienced the same issue recently: I needed to update configuration on my WiFi router, so I rebooted it and just a few seconds later the whole house plunged into darkness, as diyBMS rebooted and relay controlling the inverter went off :)

I managed to replicate the issue by stopping the wifi for ~5secs and then enabling it again. I have MikroTik cAP ac router, no TFT, MQTT enabled. The backtrace is pretty much the same as yours:

Core  0 register dump:
PC      : 0x401b31a2  PS      : 0x00060730  A0      : 0x801b3287  A1      : 0x3ffd7c80  
A2      : 0x3ffb6304  A3      : 0xffffffff  A4      : 0x00000000  A5      : 0xffffffff
A6      : 0x00000000  A7      : 0x3ffe23ac  A8      : 0x3ffdffe8  A9      : 0x3ffd7bf0  
A10     : 0x00000000  A11     : 0x00000001  A12     : 0x3ffe3500  A13     : 0x3ffe3500
A14     : 0x3ffe237c  A15     : 0x3ffe23b6  SAR     : 0x0000001f  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x4008a4dc  LEND    : 0x4008a4f2  LCOUNT  : 0xffffffff

Backtrace: 0x401b319f:0x3ffd7c80 0x401b3284:0x3ffd7cd0

  #0  0x401b319f:0x3ffd7c80 in handler_execute at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:139
      (inlined by) esp_event_loop_run at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:590
  #1  0x401b3284:0x3ffd7cd0 in esp_event_loop_run_task at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:115 (discriminator 15)

Apparently, the ESP is accessing a wrong address (EXCVADDR = null). My guess is that the ESP was trying to send an event to a component that was recently freed. I've made a slight change to the event_handler() code in main.cpp to never call ShutdownAllNetworkServices(). Once the services are started, they're running forever (the MQTT will reconnect automatically once the WiFi is up again). The second (and probably irelevant) change was to block simultaneous esp_wifi_connect() call from the main loop() while the controller is trying "WiFi quick retries" from the event_handler() routine.

With the modified firmware I cannot make diyBMS to crash. I've tried several "wifi dropouts" from 1sec to 1hour and it's still running and reconnects once the WiFi is back. I can prepare PR if you want to look.

@Jestergnet
Copy link

I thought it was normal, I have the same problem...

@stuartpittaway
Copy link
Owner Author

Yes @ruza87 a PR would be fabulous.

I've chased this bug for months, the main problem is that I cannot reproduce the error with my router.

@jetronic18s
Copy link

Great work, with 2 installations it is no longer a problem if the WLAN fails. No reboot when the Fritzbox (with Mesh) restarts.

It looks really good, I'll keep watching.

Thank you very much
@ruza87 @stuartpittaway

@jetronic18s
Copy link

I have changed another installation to the new firmware with Wifi Fix, again no reboot after Wifi failure.
During the installation you could always trigger the reboot by unplugging the AVM Powerline Adapter.
Constellation was Fritzbox with AVM Powerline WLAN Mesh.

How are the other affected users doing?
Please give us some feedback, then Stuart can close all the Wifi Reboot Issues.

@Linusten
Copy link

Linusten commented Nov 6, 2024

I am also on the FritzBox Mesh setup. No more issues :) Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants