-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV to perd on openwrt mt7621 #74
Comments
Hi Ansuel, Thanks for letting us know. Do you have any idea what measurement the unit was running at the time of the segfault? Is it consistent or intermittent? Regards, Michel |
I got this everytime i restart the atlas bins, so i can repro this easily. Also I notice sometime this gets triggered but not in normal testing, if you want I can provide my probe id if that can be useful. |
Hi @Ansuel, Can you check what the memory usage is for this system? Also, I have noticed various other references to OpenWRT and processes segfaulting in this manner. What release of OpenWRT are you on? Regards, Michel |
@michel-stam Hi normally there is plenty of space... for example for my router running for 3 days with atlas tools i have 350mb of free ram. I normally run everything on master
by what i notice online that kind of error are related to memory not correctly freed or handled by the bin but i can be totally wrong. But again i can repro the error with a simple /etc/init.d/atlas restart so we have plenty of way to investigate this. |
Same issue here, the 5040 probe is correctly recognised as online but not sending data. Target Platform ramips/mt7621
Available for further testing. |
Hi, Thanks for the debug. It seems indeed eperd is having some issues. Would you be able to test on a stable OpenWRT release? Just to make sure there's not a development effort that causes the problem? I'm not excluding at this point there's a bug, but at the same time I don't have any other platform with this problem (amongst which is an ath79, which is also a MIPS, 2 ARM systems and some x86_64 systems). We are in the process of pushing 5080 to the GIT repo, which is delayed somewhat. Regards, Michel |
Hi Michel, Yesterday I updated to 22.03.2 test release and I encounter the same issue. Thanks |
Hey Andrea, Sorry I thought I was responding to Ansuel, but thank you for testing with a stable release of OpenWRT. It seems you're using the public OpenWRT package, which has changed the internals somewhat from the version RIPE NCC develops in-house. I cannot guarantee its function, sorry. That being said, I'm not unwilling to help, I'll do what I can. Looking at the logfile:
eperd is complaining its "crontab" file 'root' does not exist. that may mean that the path that the data files are stored do not exist. Can you check whether all the paths referenced by eperd/perd/eooqd exist? Another area to test is the "telnetd" process. I recall that on the public OpenWRT versions there were problems starting that particular process. Please make sure that its running. This is a directory structure you'll typically find on a working atlas software probe in /var/atlas-probe/crons:
I hope this helps, please let me know. Best regards Michel |
Just to add some info... from master and 22.03 there are not many difference related to this package (but i understand you want info from stable build)
could be that crond is trying to find a home directory for root but he is not finding it? no idea how it works internally but i agree that we should understand where that comes from...
from the scripts the base_dir is set to /usr/libexec/atlas-probe-scripts and it does correctly exist... from there cron dir and the run dir correctly exist and the pid are written to it... what i notice is that cron never had files in them... every directory is empty...
do you have a way to test this? From what i can see the process should be running but i wonder if there is a communication problem... but thinking about it, if telentd wasn't working then the probe should not register at all right? Actually now that i'm checking it i have file in cron directory... but i can clearly find the problem here... this is from main The cron values are absolute path and doesn't follow the BASE_DIR variable... let me do a test but this seems a problem that should be fixed on your end as it doesn't make sense to provide a BASE_DIR variable where each internal script can use a different base_dir and then hardcode it in the cron that actually execute stuff... But correct me if i'm wrong... anyway while at it I will try to make a symbolic link from that hardcoded path to the real one. |
Hi Ansuel, Difficult to say. The author of the package wrote his own scripts around the atlas startup process, which doesn't seem to work perfectly. The existing process in the RIPE NCC version isn't perfect either, and will be looked at at some point in the future. Not that that helps you right now. The code indeed seems to reference /home/atlas (probe-busybox/include/libbb.h), which as you suggest could be a softlink to where you'd like to store the data. Its not ideal, but it is a quick fix until we have had the time to refactor this bit of code. I am hoping you should start to see improvements in this area as of 5090. |
Main problem is that with the workaround in place still we have
and now data/new/main is empty :( The command works correctly tho (by manually running it)
|
@michel-stam ok can confirm that both eperd and perd crash with the following command these are the main cron program and because of this nothing is send as the periodical cron are not run... the process is not restored so the probe is leaved in this unstable state where it's registered but it does nothing. Fun thing is that as they crash, the pid is still there. (would be an idea to add some check if these 2 crucial program are actually running by checking the pid in the run dir actually exist in the system?)
|
Hi Ansuel, Thanks for the help, appreciate it. We actually added a restart functionality in 5060, and fixed some issues in 5070 with that. However, if the process continuously crashes that isn't gonna help a lot, it will just restart it over and over. Do you have the option of running the perd and eperd process through strace to see where the crash occurs? I'm half expecting some directory or path to not exist. Regards, Michel |
is a debug build needed (for the atlas busybox) or i can just run the command with strace? |
@michel-stam this is strace
|
also running
I have
why it seems to me a problem with condmv ? |
Is it normal that the descriptor is 3 on writing file?? |
Additional info: the segfault happen when -A is used... without it condmv works correctly and the file is moved. |
@michel-stam found the cause of the seg fault.... preparing a pr for stable ripe-busybox... this both fix perd and eperd as perd crash by using condmv from crond and eperd crash cause he has it's own condmv functions internally that are affected by the same problem. |
All of these commands share the same codebase (a derivative of busybox), so it may be some common code. As to your second question, the -A seems to add a record to the source file prior to renaming to the second file. That record contains the argument to -A. So this seems to be normal. I've executed the command on the MIPS probe I have next to me:
As you can see the interesting happens after the open. The IOCTL is different. The stat makes sense, followed by some ioctl (which I presume is related to the C library), followed by the write (fprintf), followed by a close and a rename. This is uClibc btw. In the case where it goes wrong, we're using musl, not uClibc. I've looked at the __fdopen function in src/stdio/__fdopen.c which is one of the few functions that calls TIOCGWINSZ. It is also called from within fopen( ) in src/stdio/fopen.c so that makes sense. Nothing much happens after the ioctl, so I think we can conclude fopen( ) returns. I do not see a call to sys_write( ) or syscall( ), which I expect to be called from the fprintf( ) function, so my guess is that something goes wrong there. That would be around here, or an equivalent line in the musl libc src/stdio/__stdio_write.c. So far I haven't seen any reason why this is the case, though. |
@michel-stam it seems this target is very restrictive on the long int and long long int cast... the real culprit is this... and makes condmv crash... but checking other file we have wrong cast for time_t all over the code. Now i'm testing this fix and the /home/atlas link and see if there are crashes... |
Ok yes can confirm that with the custom busybox data are finally sent and nothing crash anymore |
After replacing the original busybox with the one provided by @Ansuel it's working fine. |
32bit systems have time_t set to long long int while 64bit system have time_t set to long int. This is problematic as in the busybox code this is not handled correctly and we have some casted to long and some not casted at all. Some arch (found this problem on a mt7621) may be restrictive about casting and crash with segmentation fault if time_t is cast to %ld instead of the correct %lld. This is the cause of RIPE-NCC/ripe-atlas-software-probe#74 Use the correct type and cast every time_t to (long long) so that eperd and condmv doesn't crash anymore and the measurement works correctly. Signed-off-by: Christian Marangi <[email protected]>
32bit systems have time_t set to long long int while 64bit system have time_t set to long int. This is problematic as in the busybox code this is not handled correctly and we have some casted to long and some not casted at all. Some arch (found this problem on a mt7621) may be restrictive about casting and crash with segmentation fault if time_t is cast to %ld instead of the correct %lld. This is the cause of RIPE-NCC/ripe-atlas-software-probe#74 Use the correct type and cast every time_t to (long long) so that eperd and condmv doesn't crash anymore and the measurement works correctly. Signed-off-by: Christian Marangi <[email protected]>
@michel-stam i posted the pr that fix this and improve other small thing |
Hi Ansuel, I saw that the pr converts time to unsigned long long in some cases, and long long in others. Can you cast everything to unsigned long long? I will take the patch in when reworked and test it here on a couple of platforms as well. Regards, Michel |
@michel-stam sure! hope it's not a problem changing the %lu to %lld in some part... this is why i had time mixed stuff |
32bit systems have time_t set to long long int while 64bit system have time_t set to long int. This is problematic as in the busybox code this is not handled correctly and we have some casted to long and some not casted at all. Some arch (found this problem on a mt7621) may be restrictive about casting and crash with segmentation fault if time_t is cast to %ld instead of the correct %lld. This is the cause of RIPE-NCC/ripe-atlas-software-probe#74 Use the correct type and cast every time_t to (unsigned long long) so that eperd and condmv doesn't crash anymore and the measurement works correctly. Signed-off-by: Christian Marangi <[email protected]>
32bit systems have time_t set to long long int while 64bit system have time_t set to long int. This is problematic as in the busybox code this is not handled correctly and we have some casted to long and some not casted at all. Some arch (found this problem on a mt7621) may be restrictive about casting and crash with segmentation fault if time_t is cast to %ld instead of the correct %lld. This is the cause of RIPE-NCC/ripe-atlas-software-probe#74 Use the correct type and cast every time_t to (unsigned long long) so that eperd and condmv doesn't crash anymore and the measurement works correctly. Signed-off-by: Christian Marangi <[email protected]>
@michel-stam OK i should have replaced the mixed cast everywhere with unsigned long long and %llu |
Nice work @Ansuel, I'll pull these the reworked commit into our git repo and start testing on our platforms. Thanks! Regards, Michel |
@michel-stam btw are these spike normal? |
Please note the graphs show the network traffic for the host, not only the part sent/received by the probe software, so those peaks likely belong to your (or your OS's) activity. A typical probe has some single-digit kbit/s traffic |
@robert-kisteleki the txrx report is enabled by default just as an extra info. if the entire network traffic is reported then it makes sense to have these spike. |
I'm running a probe with openwrt mt7621 and I notice lots of these error. Wonder if you are interested in these crash.
Fell free to ask any question/help on debugging this.
The text was updated successfully, but these errors were encountered: