-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cxl list, no matching device found #246
Comments
The cause of the issue is because there's no CXL device found in sysfs. The common causes of this could be:
Can you provide the following info, please:
|
Is it enough? Since the device is an FPGA, I see some traffic around every second. The BIOS has the |
@alexisfrjp Thanks for the info. You have everything needed from the Kernel and cxl-cli perspective to work with CXL Type 3 devices. However, since Officially, Intel doesn't support Type 3 in the Sapphire Rapids time frame, only Type 2 is supported. Unofficially, if the BIOS vendors kept the 'Type 3 Legacy Mode' BIOS feature, CXL devices do work on the platforms I'm using (Supermicro with Xeon Gold & Platinum CPUs). Q) Do you see any DeviceDAX entries under Q) Does If YES to the above, then you need to convert the 'devdax' namespace to a 'system-ram' type and it'll show a new cpu-less/memory-only NUMA node. Use:
You can read my blog entry on how this should work - How To Extend Volatile System Memory (RAM) using Persistent Memory on Linux. Ignore the early part of the blog as it's outdated. Also, ignore the references to PMem. CXL Type 3 works the same as PMem in this use case. Q) Do you see any entries in Q) Do you see any other If the above actions don't yield much info, the next steps are
|
Thanks for your reply @sscargal !
But
Exactly. It's been confirmed by Intel, my workstation CPU supports CXL, and I got an unofficial BIOS from Asus enabling the "Type3 legacy" BIOS feature. I also have the project for the Type2, the result is the same.
There is no /dev/dax* files.
Nothing...
Nothing...
I have the source-code of the project, it was validated and all the BIOS options are enabled.
The special BIOS version has been provided by Intel (provided by Asus).
It has been confirmed it does. All the CXL workflow is excessively unclear. It's a mix of everything and nothing at the same time. It's only a use case, not what CXL is. My understanding is CXL, like PCIe, just provides a memory bus to any memory, it can be DRAM, flash, Persistent Memory, can even just be registers (seen as flip-flop from FPGA). I just have an CXL.Mem enabled device and I'd like to map the memory space so that I can use it from the user-space. Why is it so complicated... All the tools give different results, Thanks for your help! Edit: Let me know if I'm in the wrong project for my use-case. |
Which devices |
CXL.mem |
If we look back at the
Since you're not getting the /dev/cxl/mem and corresponding /dev/dax0.0 device, this happens when the CXL device isn't exposing itself as Special Purpose (SP) to UEFI (EFI_MEMORY_SP), or the BIOS is ignoring it. Special Purpose memory devices will have a 'soft reserved' e820. The Kernel will then provision the CXL.mem as a memdev with a corresponding devdax device that applications can use directly, or it can be converted to a For non-Special Purpose memory devices, the BIOS/UEFI maps the memory as 'usable,' and the Kernel treats the CXL memory as main memory (DRAM), meaning it's put into the ZONE_NORMAL zone and used as main memory with no user control (cxl-cli/daxctl). For SP memory, you should see 'ZONE_MOVABLE' for your CXL memory. You can look at the memory zones using The next obvious step is to look at the ACPI tables to see what you've configured in the FPGA. I would also look at resolving this kernel warning
The CXL 2.0 spec defines several DVSEC bits as mandatory. Could you check which ones you've set and are missing that will fix that warning? If the kernel driver isn't initialized, this would explain some of the Kernel behavior. |
Exactly, I'm mapping 8GB from the FPGA via CXL.Mem.
Nevertheless, the user-guide I'm following for the CXL example design mentions the use of
Moreover, thank you for the links but like lots of CXL-related documentations, there are lots of acronyms and they target BIOS/UEFI/Kernel experts, not engineers who just want to work with a CXL.Mem device.
It looks like it's system-ram only, and what I should like to have is SP, correct? I will try to compile the same kernel with
I will have a look at it.
Indeed, I will open a ticket at Intel. They advertised the CXL IP core as CXL 2.0. I know understand why I can see memory traffic in the fpga even though nothing is running; it is actually the kernel using it as system-ram. I can even see it in top/htop. Thank you very much, it has been very fruitful and I have more directions to investigate. |
Thanks for the update.
Correct. The FPGA is working correctly in the non-SP mode, which is fine for some use cases, but you have no control or management of the CXL device. As you observed, the Kernel is free to randomly allocate memory from DRAM or CXL which is not what most people want (IMHO). As an aside note, what you have now is the default behavior on AMD Genoa servers, and there's a BIOS option to switch CXL to be SP memory - like Intel. Intel's default behavior is to treat CXL as special purpose memory and manage it with the I recommend removing the Lastly, there are some notes on CXL Kernel development here that you will find useful in your CXL journey. |
Thank you Steve! After removing the Regarding the dmesg warning :
https://github.com/torvalds/linux/blob/929ed21dfdb6ee94391db51c9eedb63314ef6847/drivers/cxl/pci.c#L681 Is Thank you for this link, it'll definitely be useful! And for your help, it's now much more clear and I will keep reading resources and stay tuned! |
Great progress! Thanks for the update. Hopefully, this command will allow you to convert the devdax device into a system-ram and see the new NUMA node that applications can use. It'll look like your first
Note: This is not persistent across reboots, so you'll need to write an With the NUMA node established, you can use
I don't see the error on real CXL Type 3 devices, including some that are FPGA implementations, so it's up to you if you want to continue to resolve this or not as it will require you to implement the missing data within the FPGA. Looking at the code, this shouldn't impact what you need to move forward, so it's up to you. Your FPGA stream is not 100% feature complete with respect to the DVSEC, and likely elsewhere too, which is what I'd expect. The FPGA route takes you down becoming a 'CXL Device Vendor' path where some input/development is required from your side to fill in the gaps. As you said, the error originates from trying to access the DVSEC features. From https://github.com/torvalds/linux/blob/929ed21dfdb6ee94391db51c9eedb63314ef6847/drivers/pci/pci.c#L753, the comment says why:
Your device implements the Vendor ID (1e98) and some DVSEC info, but the problem accessing the Extended DVSEC info:
Which is looking for fields described in the comment:
You can see some of this is missing in the
This is an exercise for you to implement.
The Kernel drivers are still under development relative to the CXL Specifications, mainly CXL 3.0 at this point, but they have matured enough for CXL 1.1 and 2.0 to be production ready for real CXL devices (including FPGA). The drivers are further along the CXL roadmap than the CPUs. The drivers implement the CXL specifications so they are vendor neutral and should work with all CXL devices. CXL device vendors are free to write their own custom drivers to add additional features and functionality for their devices beyond what the CXL specs say. You have the necessary Kernel requirements (6.3 or newer), although if you encounter OS/Kernel problems, it's worth reaching out to the CXL Kernel community for help or to file bugs. |
Unfortunately, it's still loaded as system-ram.
I see the dax device file but it's already in
I tried your command:
Checking
and numactl:
Sorry, I still don't see any /dev/cxl/mem* entries. Removing DVSEC issuesOf course I want to fix all the issues even if they aren't blocking. More I solve, more I learn. Moreover, my company is a member of the CXL consortium, I have access to all the specs. It's the ultimate goal to develop our own core.
Regarding the Here is the result:
I still think there is an issue in kernel/driver code since lspci is able to detects all the PCIe extended's DVSEC capabilities. The DVSEC ID=0000 is clearly present. I'm definitely not a big fan of these CXL1.1 devices with 2.0 features... It's very confusing. Understood, thank you! It's perfect if it's production-ready for CXL1.1 and 2.0! |
I was able to force it:
It lost 2GB.
|
Great progress & updates! Regarding this message:
Fedora is one of those distros that choose to compile the Kernel to auto-online memory. This is a distro choice. You can change this behavior using the following. The Kernel config file should include:
To disable this feature:
This change doesn't persist across system reboots. See the Kernel Hot Plug documentation to learn more on how to enable/disable memory blocks and to change their zone. I love the changes to https://lore.kernel.org/linux-cxl/ is useful for viewing the Linux Kernel CXL developer mailing list. Feel free to join the list if you want to. It's highly active. This is the list to email for support, suggestions, or patch submissions. Regarding this action:
Since the Kernel auto-onlines memory, the The reason you lost 2GB is because 3 of the 4 memory blocks are ONLINE
The Kernel uses 2GiB memory block sizes. I suspect the cause of the loss of one block is likely an alignment problem caused by forcing the memory online of an already online devdax -> |
Now I have a lot to read and to understand and I will come back to you later. (I also have to check exactly why the kernel doesn't find the PCIe extended capability of the CXL's DVSEC whereas lspci does.) |
@sscargal Hello, I have read the entire conversation. The issue I am facing is that I want to use CXL as universal RAM and make it an independent NUMA node.However, I am using a QEMU-emulated DRAM device, and for some reason, when I run 'numactl -H', it cannot find any zNUMA nodes. |
QEMU Version: 8.0.50 |
Your help would be greatly appreciated! |
@Yemaoxin Please open your own ticket, your problem is different than mine. |
I have a CXL Type3 device plugged into the server.
I can see it with
lspci
but not withcxl-cli
.I'm very new to the CXL world and it seems a bit messy, all the searches I do are unclear.
The text was updated successfully, but these errors were encountered: