Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault caused by Asyncio concurrent access to properties #210

Open
vindict-gh opened this issue Oct 24, 2022 · 4 comments
Open

Comments

@vindict-gh
Copy link

vindict-gh commented Oct 24, 2022

OS : Debian Bullseye (11)
Kernel : 5.10.0-15-amd64
ZFS Version : 2.1.4-1~bpo11+1
Impacted version : release/22.02.4

While using py-libzfs with Asyncio, accessing properties of several datasets concurrently causes a segfault in accessing getmntany to retrieve the 'mountpoint' property.

We were able to reproduce easily with this simple script.

from concurrent.futures import ThreadPoolExecutor
import time

import libzfs

dataset_name_fmt = 'pool/tank/my-dataset'
num = 200


def async_test(zfs_hdl, i):
   ds_name = f"{dataset_name_fmt}-{i}"
   ds = zfs_hdl.get_dataset(ds_name)
   return ds.properties['used'].rawvalue

def main():
   with ThreadPoolExecutor() as executor:
       res = []
       zfs_hdl = libzfs.ZFS()
       for i in range(num):
           res.append(executor.submit(async_test, zfs_hdl, i))

       for r in res:
           print(r.result())

if __name__ == '__main__':
   main()

Below, the gdb --args python3 test_script.py output :

Thread 13 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdcff9700 (LWP 634278)]
0x00007ffff7cd6b94 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff7cd6b94 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7cb4a1c in _IO_getline_info () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff7cbe675 in fgets_unlocked () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff7d35736 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ffff7d35fd4 in getmntent_r () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007ffff74a2ce4 in _sol_getmntent () from /lib/x86_64-linux-gnu/libzfs_core.so.3
#6  0x00007ffff74a2d73 in getmntany () from /lib/x86_64-linux-gnu/libzfs_core.so.3
#7  0x00007ffff74c60d9 in libzfs_mnttab_find () from /lib/x86_64-linux-gnu/libzfs.so.4
#8  0x00007ffff74c6a18 in ?? () from /lib/x86_64-linux-gnu/libzfs.so.4
#9  0x00007ffff74c882e in zfs_prop_get () from /lib/x86_64-linux-gnu/libzfs.so.4
#10 0x00007ffff757d6cf in __pyx_pf_6libzfs_11ZFSProperty_8refresh (__pyx_v_self=0x7fffc406e580) at libzfs.c:39479
#11 __pyx_pw_6libzfs_11ZFSProperty_9refresh (__pyx_v_self=<libzfs.ZFSProperty at remote 0x7fffc406e580>, unused=<optimized out>) at libzfs.c:39405
#12 0x00007ffff7594af6 in __Pyx_PyObject_CallMethO (arg=0x0, func=<built-in method refresh of libzfs.ZFSProperty object at remote 0x7fffc406e580>)
    at libzfs.c:108486
#13 __Pyx_PyObject_CallNoArg (func=<built-in method refresh of libzfs.ZFSProperty object at remote 0x7fffc406e580>) at libzfs.c:42976
#14 0x00007ffff75adf55 in __pyx_pf_6libzfs_15ZFSPropertyDict_2refresh (__pyx_v_self=0x7fffdc63fd10) at libzfs.c:64773
#15 __pyx_pw_6libzfs_15ZFSPropertyDict_3refresh (__pyx_v_self={}, unused=<optimized out>) at libzfs.c:64533
#16 0x00007ffff7594af6 in __Pyx_PyObject_CallMethO (arg=0x0, func=<built-in method refresh of libzfs.ZFSPropertyDict object at remote 0x7fffdc63fd10>)
    at libzfs.c:108486
#17 __Pyx_PyObject_CallNoArg (func=<built-in method refresh of libzfs.ZFSPropertyDict object at remote 0x7fffdc63fd10>) at libzfs.c:42976
#18 0x00007ffff7594de8 in __pyx_pf_6libzfs_9ZFSObject_10properties___get__ (__pyx_v_self=<optimized out>) at libzfs.c:67695
#19 __pyx_pw_6libzfs_9ZFSObject_10properties_1__get__ (__pyx_v_self=<optimized out>) at libzfs.c:2096
#20 __pyx_getprop_6libzfs_9ZFSObject_properties (o=<optimized out>, x=<optimized out>) at libzfs.c:30081
#21 0x0000000000529f3e in getset_get (descr=0x7ffff768d640, obj=<libzfs.ZFSDataset at remote 0x7fffdc641200>, type=<optimized out>)
    at ../Objects/descrobject.c:185
#22 0x0000000000526f68 in _PyObject_GenericGetAttrWithDict (obj=<optimized out>, name='properties', dict=<optimized out>, suppress=0)
    at ../Objects/object.c:1201
#23 0x0000000000511ed1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:2994
#24 0x0000000000528b63 in _PyEval_EvalFrame (throwflag=0,
    f=Frame 0x7ffff6f18200, for file /root/test_segv_lib.py, line 16, in test_coucou (zfs_hdl=<libzfs.ZFS at remote 0x7ffff6f61990>, i=56, v='test-constantin-56', ds=<libzfs.ZFSDataset at remote 0x7fffdc641200>), tstate=0xa8a5c0) at ../Include/internal/pycore_ceval.h:40
#25 function_code_fastcall (globals=<optimized out>, nargs=2, args=<optimized out>, co=<optimized out>, tstate=0xa8a5c0) at ../Objects/call.c:330
#26 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:367
#27 0x0000000000513e8b in do_call_core (kwdict={}, callargs=(<libzfs.ZFS at remote 0x7ffff6f61990>, 56), func=<function at remote 0x7ffff7873040>,
    tstate=<optimized out>) at ../Python/ceval.c:5120

I was able to prevent the segfault by reactivating the GIL during libzfs.zfs_prop_get() call in ZFSProperty.refresh() and in ZFS.__dataset_handles()

I do believe that the nogil behaviour was intended. Could you please explain why the nogil is important for you here ? If you want I already have a patch and can provide a PR to re-acquire the GIL upon libzfs.zfs_prop_get() calls.

If any information is missing I remain at your disposal.

@ghost
Copy link

ghost commented Oct 26, 2022

In general we don't guarantee libzfs is even thread-safe, let alone reentrant. Since you have 200 tasks on an executor that probably has fewer than 200 threads, there are multiple tasks trying to use the same thread-local buffer buf when calling getmntent_r() in _sol_getmntent().

nogil is important because we're calling into a C function that does a whole bunch of syscalls and IO. That would block the entire Python process, which may have plenty of other things to do in the meantime. Removing nogil avoids the reentrance in this test case, but at too great a cost.

A potential solution I think would be to use an asyncio.Lock around async use of py-libzfs. If I understand asyncio.Lock correctly this will block tasks on the same thread but not on different threads, preventing simultaneous use of the thread-local static buffer. At least that's how I'm interpreting "not thread-safe" in the documentation. Or maybe some other mechanism is a better fit.

@vindict-gh
Copy link
Author

I'll be taking your feedback into account to find the least-cost path while protecting py-libzfs for async.

If this is something that interest you, I would be happy to share with you my findings and contribute them back to the upstream. If you believe you don't want to support asyncio on libzfs (which is fine, no argument here whatsoever), we will propose a fork. In either case your feedback would be invaluable.

Depending on the course of action (official support or not) feel free to close this issue (or not).

If you think of anyone who could help by reviewing this, this would come a long way.

In the meantime, I thank you for your time.

@yocalebo
Copy link
Contributor

I'll be taking your feedback into account to find the least-cost path while protecting py-libzfs for async.

If this is something that interest you, I would be happy to share with you my findings and contribute them back to the upstream. If you believe you don't want to support asyncio on libzfs (which is fine, no argument here whatsoever), we will propose a fork. In either case your feedback would be invaluable.

Depending on the course of action (official support or not) feel free to close this issue (or not).

If you think of anyone who could help by reviewing this, this would come a long way.

In the meantime, I thank you for your time.

Hi @vindict-gh . We are definitely open for you to contribute the required changes to make this safe to be used with asyncio. We would, of course, have to review it closely.

@yocalebo
Copy link
Contributor

Just keep in mind there are 2 "gotchas" to look out for that Ryan commented above.

1.Reentrancy
2. Thread safety

Some of the libzfs methods do not use reentrant safe syscalls and so that will have to be considered when making any changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants