Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU support #135

Merged
merged 21 commits into from
Apr 4, 2022
Merged

AMDGPU support #135

merged 21 commits into from
Apr 4, 2022

Conversation

zhuyifei1999
Copy link
Contributor

@zhuyifei1999 zhuyifei1999 commented Mar 21, 2022

A picture is worth a thousand words:
Screenshot_2022-03-21_10-53-19 c

This PR isn't fully ready for merge yet. I think I'd like some feedback and do some polishing first (hence the RFC). My AMDGPU is also an iGPU / APU which limits how much I can test myself.

Most the access is using libdrm and kernel APIs. <drm/amdgpu_drm.h> is a kernel UAPI header so I directly included it. For libdrm I'm using dlopen and re-declaring the constants to avoid a compile-time dependency, just like what the original code does with NVML.

I also did some code refactoring to make my life easier, hope that is okay. I really cannot deal with typedef structs

CC #106, i915 kernel changes doesn't seem to have merged into mainline yet.

Some of them are very useless. If we really want we can fix them
later, but -Wall already contains most of the useful warnings.
There's no reason to need a cast from void * to another pointer in C.
We do that all the time without a cast with malloc.
Using typedef struct aliases is makes it more difficult to understand
what exactly is being abstrated away [1], I need this change to
understand the code.

nvmlDevice_t in continue to use nvmlDevice_t to make consistent
with nvml headers.

[1] https://www.kernel.org/doc/html/latest/process/coding-style.html#typedefs
Since we are adding more vendors, having some sort of generic
simplifies the code a lot, instead of using unions.
So we don't need to hardcode supported vendors in generic
extract_gpuinfo code.
* Use shifts for mask and make it into ssize_t (so it gets arithmetic
  shift instead of logical)
* Remove hardcoded names of nvidia in variable names outside of nvidia
  file.
Mostly this is just interacting with DRM kernel code and libdrm.
This is via the fdinfo interface. The enumeration of all the visible
fds on a system is slightly expensive, but I'm not sure how to avoid it.

The logic is partly adapted from unmerged intel_gpu_top code [1],
adapted for AMDGPU. i915 kernel changes has not been merged into
mainline yet, AFAICT.

[1] https://patchwork.freedesktop.org/series/100571/
@Syllo
Copy link
Owner

Syllo commented Mar 22, 2022

Amazing!
I will try to take a look at it this weekend.

@Syllo Syllo self-assigned this Mar 22, 2022
@Syllo
Copy link
Owner

Syllo commented Mar 26, 2022

I successfully tested on a system I have access to with an AMD Radeon RX 6800 XT.
I also fixed some issues I encountered while browsing the code.

For the fan speed, I saw that you added a comment. I did not find a way to get it through AMDGPU. lm_sensors seem to get the correct info though, so I will investigate where to retrieve it.

Edit: Never mind I found that for the fan https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

@zhuyifei1999
Copy link
Contributor Author

Awesome!

Yeah I didn't bother with checking for close(-1) since it's practically a no-op, similar to free(NULL); though unlike free it has syscall overhead.

The reason I didn't do fan speed was that it seemed like it was slightly convoluted to retrieve. Rather than being a libdrm API call, I have to go through sysfs; doable but code can get messy and I can't test it.

@zhuyifei1999 zhuyifei1999 changed the title [RFC] AMDGPU support AMDGPU support Mar 27, 2022
close(allocated->hwmonFD);
if (allocated->sysfsFD >= 0)
close(allocated->sysfsFD);
fclose(allocated->fanSpeedFILE);
Copy link
Contributor Author

@zhuyifei1999 zhuyifei1999 Mar 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm actually relying on the closing of the program to free all the resource usages; it doesn't actually go through the device list to free each device's handles. This allocate_list is more of freeing the array of devices itself.

How I made it work is pre-allocating an array of devices, and when we successfuly get the handles for one device, add that device to the exported linked list. The whole array is added the "allocations" linked list just once.

Should we add this free-each-device's handles?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Why are you using a list if there is only one element in the "allocations" list, which is static to this file?

I guess that relying on the program termination to close the file descriptors is fine in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at first I wasn't sure how many times get_device_handles gets called :)

// There should be one directory inside hwmon, with a name having the following pattern hwmon[0-9]+
if (dirEntry->d_type == DT_DIR) {
size_t matchLen = 0;
for (matchLen = 0; matchLen < sizeof(hwmon) - 1; ++matchLen) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be strncmp(dirEntry->d_name, "hwmon", sizeof("hwmon") - 1)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I'll fix that.

if (dirEntry) {
gpu_info->hwmonFD = openat(dirfd(hwmonDir), dirEntry->d_name, O_RDONLY);
} else
gpu_info->hwmonFD = -1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about instead of adding these else-s, pre-initialize gpu_info->hwmonFD = -1; and only assign it if things go well? Could also avoid nested ifs by returning early.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.
I do the convoluted if/else spaghetti as a first pass when the code is still changing many times.
I will clean this up.

@Syllo
Copy link
Owner

Syllo commented Mar 27, 2022

Other than that, does the patch work on your APU?
I am wondering, because in that case the fan is shared with the CPU and might not be registered with HWMON as such.

@zhuyifei1999
Copy link
Contributor Author

The APU does not expose fan information on the GPU's hwmon, AFAICT.

zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ lspci | grep VGA
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev d1)
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls
ari_enabled                devspec           hwmon          max_link_speed           msi_irqs                           pp_dpm_sclk            rescan         smartshift_apu_power
backlight                  dma_mask_bits     i2c-0          max_link_width           numa_node                          pp_dpm_socclk          reset          smartshift_dgpu_power
boot_vga                   driver            i2c-1          mem_info_gtt_total       pcie_replay_count                  pp_force_state         reset_method   subsystem
broken_parity_status       driver_override   i2c-12         mem_info_gtt_used        power                              pp_mclk_od             resource       subsystem_device
class                      drm               i2c-2          mem_info_preempt_used    power_dpm_force_performance_level  pp_num_states          resource0      subsystem_vendor
config                     enable            i2c-3          mem_info_vis_vram_total  power_dpm_state                    pp_od_clk_voltage      resource0_wc   thermal_throttling_logging
consistent_dma_mask_bits   firmware_node     iommu          mem_info_vis_vram_used   power_state                        pp_power_profile_mode  resource2      uevent
consumer:pci:0000:08:00.1  fw_version        iommu_group    mem_info_vram_total      pp_cur_state                       pp_sclk_od             resource2_wc   vbios_version
current_link_speed         gpu_busy_percent  irq            mem_info_vram_used       pp_dpm_dcefclk                     pp_table               resource4      vendor
current_link_width         gpu_metrics       link           mem_info_vram_vendor     pp_dpm_fclk                        product_name           resource5
d3cold_allowed             graphics          local_cpulist  modalias                 pp_dpm_mclk                        product_number         revision
device                     hdcp_srm          local_cpus     msi_bus                  pp_dpm_pcie                        remove                 serial_number
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls hwmon/
hwmon1
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls hwmon/hwmon1/
device  freq1_input  freq1_label  in0_input  in0_label  in1_input  in1_label  name  power  power1_average  power1_label  subsystem  temp1_input  temp1_label  uevent

@Syllo
Copy link
Owner

Syllo commented Apr 2, 2022

I think that we are almost good.
I will update the README and check for the amdgpu_drm.h to conditionally compile the support for AMDGPU.

@zhuyifei1999
Copy link
Contributor Author

zhuyifei1999 commented Apr 2, 2022

Nice!

I'm including amdgpu_drm.h directly because it's a kernel UAPI header, and I think any sane compiler toolchain on Linux would include the kernel UAPI headers. Though, up to you, could copy the declarations too since kernel-userspace ABI has guaranteed stability.

@Syllo
Copy link
Owner

Syllo commented Apr 3, 2022

The problem with NVIDIA was that the nvml header was not packaged on all the distributions, and downloading a recent header could break the build. Also I was not certain I could redistribute a version of the nvidia header. That is why I included the definitions directly.

Things are different with libdrm. The header are consistently available on any distribution and updated with the kernel. Much cleaner than NVIDIA.

Hence, relying on libdrm for AMDGPU support is fine in my opinion!

If you are fine with the latest changes I will merge it into master.

@zhuyifei1999
Copy link
Contributor Author

The problem with NVIDIA was that the nvml header was not packaged on all the distributions, and downloading a recent header could break the build. Also I was not certain I could redistribute a version of the nvidia header. That is why I included the definitions directly.

I see.

Neat Videocard TOP

Nice naming lol

If you are fine with the latest changes I will merge it into master.

No objections. Go ahead. Thanks!

@Syllo Syllo merged commit 346021b into Syllo:master Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants