AMDGPU support #135

zhuyifei1999 · 2022-03-21T20:19:56Z

A picture is worth a thousand words:

This PR isn't fully ready for merge yet. I think I'd like some feedback and do some polishing first (hence the RFC). My AMDGPU is also an iGPU / APU which limits how much I can test myself.

Most the access is using libdrm and kernel APIs. <drm/amdgpu_drm.h> is a kernel UAPI header so I directly included it. For libdrm I'm using dlopen and re-declaring the constants to avoid a compile-time dependency, just like what the original code does with NVML.

I also did some code refactoring to make my life easier, hope that is okay. ~~I really cannot deal with typedef structs~~

CC #106, i915 kernel changes doesn't seem to have merged into mainline yet.

Some of them are very useless. If we really want we can fix them later, but -Wall already contains most of the useful warnings.

There's no reason to need a cast from void * to another pointer in C. We do that all the time without a cast with malloc.

Using typedef struct aliases is makes it more difficult to understand what exactly is being abstrated away [1], I need this change to understand the code. nvmlDevice_t in continue to use nvmlDevice_t to make consistent with nvml headers. [1] https://www.kernel.org/doc/html/latest/process/coding-style.html#typedefs

Since we are adding more vendors, having some sort of generic simplifies the code a lot, instead of using unions.

So we don't need to hardcode supported vendors in generic extract_gpuinfo code.

* Use shifts for mask and make it into ssize_t (so it gets arithmetic shift instead of logical) * Remove hardcoded names of nvidia in variable names outside of nvidia file.

Mostly this is just interacting with DRM kernel code and libdrm.

This is via the fdinfo interface. The enumeration of all the visible fds on a system is slightly expensive, but I'm not sure how to avoid it. The logic is partly adapted from unmerged intel_gpu_top code [1], adapted for AMDGPU. i915 kernel changes has not been merged into mainline yet, AFAICT. [1] https://patchwork.freedesktop.org/series/100571/

Syllo · 2022-03-22T20:27:29Z

Amazing!
I will try to take a look at it this weekend.

Syllo · 2022-03-26T15:26:51Z

I successfully tested on a system I have access to with an AMD Radeon RX 6800 XT.
I also fixed some issues I encountered while browsing the code.

For the fan speed, I saw that you added a comment. I did not find a way to get it through AMDGPU. lm_sensors seem to get the correct info though, so I will investigate where to retrieve it.

Edit: Never mind I found that for the fan https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

zhuyifei1999 · 2022-03-26T20:33:08Z

Awesome!

Yeah I didn't bother with checking for close(-1) since it's practically a no-op, similar to free(NULL); though unlike free it has syscall overhead.

The reason I didn't do fan speed was that it seemed like it was slightly convoluted to retrieve. Rather than being a libdrm API call, I have to go through sysfs; doable but code can get messy and I can't test it.

zhuyifei1999 · 2022-03-27T15:04:15Z

src/extract_gpuinfo_amdgpu.c

+      close(allocated->hwmonFD);
+    if (allocated->sysfsFD >= 0)
+      close(allocated->sysfsFD);
+    fclose(allocated->fanSpeedFILE);


Hmm, I'm actually relying on the closing of the program to free all the resource usages; it doesn't actually go through the device list to free each device's handles. This allocate_list is more of freeing the array of devices itself.

How I made it work is pre-allocating an array of devices, and when we successfuly get the handles for one device, add that device to the exported linked list. The whole array is added the "allocations" linked list just once.

Should we add this free-each-device's handles?

I see. Why are you using a list if there is only one element in the "allocations" list, which is static to this file?

I guess that relying on the program termination to close the file descriptors is fine in that case.

Yeah, at first I wasn't sure how many times get_device_handles gets called :)

zhuyifei1999 · 2022-03-27T15:08:21Z

src/extract_gpuinfo_amdgpu.c

+          // There should be one directory inside hwmon, with a name having the following pattern hwmon[0-9]+
+          if (dirEntry->d_type == DT_DIR) {
+            size_t matchLen = 0;
+            for (matchLen = 0; matchLen < sizeof(hwmon) - 1; ++matchLen) {


Could be strncmp(dirEntry->d_name, "hwmon", sizeof("hwmon") - 1)?

Correct, I'll fix that.

zhuyifei1999 · 2022-03-27T15:09:57Z

src/extract_gpuinfo_amdgpu.c

+        if (dirEntry) {
+          gpu_info->hwmonFD = openat(dirfd(hwmonDir), dirEntry->d_name, O_RDONLY);
+        } else
+          gpu_info->hwmonFD = -1;


How about instead of adding these else-s, pre-initialize gpu_info->hwmonFD = -1; and only assign it if things go well? Could also avoid nested ifs by returning early.

You are right.
I do the convoluted if/else spaghetti as a first pass when the code is still changing many times.
I will clean this up.

Syllo · 2022-03-27T17:40:50Z

Other than that, does the patch work on your APU?
I am wondering, because in that case the fan is shared with the CPU and might not be registered with HWMON as such.

zhuyifei1999 · 2022-03-27T17:44:23Z

The APU does not expose fan information on the GPU's hwmon, AFAICT.

zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ lspci | grep VGA
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev d1)
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls
ari_enabled                devspec           hwmon          max_link_speed           msi_irqs                           pp_dpm_sclk            rescan         smartshift_apu_power
backlight                  dma_mask_bits     i2c-0          max_link_width           numa_node                          pp_dpm_socclk          reset          smartshift_dgpu_power
boot_vga                   driver            i2c-1          mem_info_gtt_total       pcie_replay_count                  pp_force_state         reset_method   subsystem
broken_parity_status       driver_override   i2c-12         mem_info_gtt_used        power                              pp_mclk_od             resource       subsystem_device
class                      drm               i2c-2          mem_info_preempt_used    power_dpm_force_performance_level  pp_num_states          resource0      subsystem_vendor
config                     enable            i2c-3          mem_info_vis_vram_total  power_dpm_state                    pp_od_clk_voltage      resource0_wc   thermal_throttling_logging
consistent_dma_mask_bits   firmware_node     iommu          mem_info_vis_vram_used   power_state                        pp_power_profile_mode  resource2      uevent
consumer:pci:0000:08:00.1  fw_version        iommu_group    mem_info_vram_total      pp_cur_state                       pp_sclk_od             resource2_wc   vbios_version
current_link_speed         gpu_busy_percent  irq            mem_info_vram_used       pp_dpm_dcefclk                     pp_table               resource4      vendor
current_link_width         gpu_metrics       link           mem_info_vram_vendor     pp_dpm_fclk                        product_name           resource5
d3cold_allowed             graphics          local_cpulist  modalias                 pp_dpm_mclk                        product_number         revision
device                     hdcp_srm          local_cpus     msi_bus                  pp_dpm_pcie                        remove                 serial_number
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls hwmon/
hwmon1
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls hwmon/hwmon1/
device  freq1_input  freq1_label  in0_input  in0_label  in1_input  in1_label  name  power  power1_average  power1_label  subsystem  temp1_input  temp1_label  uevent

Syllo · 2022-04-02T13:42:49Z

I think that we are almost good.
I will update the README and check for the amdgpu_drm.h to conditionally compile the support for AMDGPU.

zhuyifei1999 · 2022-04-02T19:53:24Z

Nice!

I'm including amdgpu_drm.h directly because it's a kernel UAPI header, and I think any sane compiler toolchain on Linux would include the kernel UAPI headers. Though, up to you, could copy the declarations too since kernel-userspace ABI has guaranteed stability.

Syllo · 2022-04-03T13:58:30Z

The problem with NVIDIA was that the nvml header was not packaged on all the distributions, and downloading a recent header could break the build. Also I was not certain I could redistribute a version of the nvidia header. That is why I included the definitions directly.

Things are different with libdrm. The header are consistently available on any distribution and updated with the kernel. Much cleaner than NVIDIA.

Hence, relying on libdrm for AMDGPU support is fine in my opinion!

If you are fine with the latest changes I will merge it into master.

zhuyifei1999 · 2022-04-03T19:43:20Z

The problem with NVIDIA was that the nvml header was not packaged on all the distributions, and downloading a recent header could break the build. Also I was not certain I could redistribute a version of the nvidia header. That is why I included the definitions directly.

I see.

Neat Videocard TOP

Nice naming lol

If you are fine with the latest changes I will merge it into master.

No objections. Go ahead. Thanks!

zhuyifei1999 added 8 commits March 21, 2022 11:51

optimization_flags.cmake: Disable quite a few warnings

60c45e6

Some of them are very useless. If we really want we can fix them later, but -Wall already contains most of the useful warnings.

extract_gpuinfo_nvidia: Get rid of dlsym cast

5cc731b

There's no reason to need a cast from void * to another pointer in C. We do that all the time without a cast with malloc.

Make devices into a linked list and make use container_of

00c4afa

Since we are adding more vendors, having some sort of generic simplifies the code a lot, instead of using unions.

Make GPU vendors use a hooking mechanism

f45e184

So we don't need to hardcode supported vendors in generic extract_gpuinfo code.

Prepare mask for inclusion of multiple vendors

7ca8b55

* Use shifts for mask and make it into ssize_t (so it gets arithmetic shift instead of logical) * Remove hardcoded names of nvidia in variable names outside of nvidia file.

Add basic global (not per-process) GPU usage for AMDGPU

f4ba53a

Mostly this is just interacting with DRM kernel code and libdrm.

zhuyifei1999 force-pushed the amdgpu branch from 01468d8 to 27c15a4 Compare March 21, 2022 20:51

Syllo self-assigned this Mar 22, 2022

Syllo added the enhancement label Mar 22, 2022

Syllo added 4 commits March 26, 2022 16:19

Get the right gpu name in the setup window

faad31e

Reset return status when handle or function are not loaded

6761d89

Convert AMDGPU watts to the datastructure milliwatts

94e718e

Don't close negative file descriptors and fix error check

2b141bc

Syllo force-pushed the amdgpu branch from 1ef4f24 to 2b141bc Compare March 26, 2022 15:19

zhuyifei1999 changed the title ~~[RFC] AMDGPU support~~ AMDGPU support Mar 27, 2022

zhuyifei1999 commented Mar 27, 2022

View reviewed changes

Syllo added 2 commits April 2, 2022 11:29

Fan speed extraction from sysfs

52140a1

AMDGPU max temps and PCIe gen/width and usage

a37f12b

Syllo force-pushed the amdgpu branch from d531f0c to a37f12b Compare April 2, 2022 13:35

AMDGPU power cap info

1e87bc0

Syllo added 2 commits April 3, 2022 15:00

Use libdrm for AMDGPU support

0d889a5

Update README and manpage with AMDGPU support info

6a80006

Syllo added 4 commits April 4, 2022 19:41

Dev tooling improvement

78443d4

Fix goto label not preceding a statement

8d8c256

Disable over-verbose warnings and enable useful ones

d0ae974

Bump initial major release with AMDGPU support

7630a73

Syllo force-pushed the amdgpu branch from a86558a to 7630a73 Compare April 4, 2022 17:57

Syllo merged commit 346021b into Syllo:master Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGPU support #135

AMDGPU support #135

zhuyifei1999 commented Mar 21, 2022 •

edited

Loading

Syllo commented Mar 22, 2022

Syllo commented Mar 26, 2022 •

edited

Loading

zhuyifei1999 commented Mar 26, 2022

zhuyifei1999 Mar 27, 2022 •

edited

Loading

Syllo Mar 27, 2022

zhuyifei1999 Mar 27, 2022

zhuyifei1999 Mar 27, 2022

Syllo Mar 27, 2022

zhuyifei1999 Mar 27, 2022

Syllo Mar 27, 2022

Syllo commented Mar 27, 2022

zhuyifei1999 commented Mar 27, 2022

Syllo commented Apr 2, 2022

zhuyifei1999 commented Apr 2, 2022 •

edited

Loading

Syllo commented Apr 3, 2022

zhuyifei1999 commented Apr 3, 2022

AMDGPU support #135

AMDGPU support #135

Conversation

zhuyifei1999 commented Mar 21, 2022 • edited Loading

Syllo commented Mar 22, 2022

Syllo commented Mar 26, 2022 • edited Loading

zhuyifei1999 commented Mar 26, 2022

zhuyifei1999 Mar 27, 2022 • edited Loading

Choose a reason for hiding this comment

Syllo Mar 27, 2022

Choose a reason for hiding this comment

zhuyifei1999 Mar 27, 2022

Choose a reason for hiding this comment

zhuyifei1999 Mar 27, 2022

Choose a reason for hiding this comment

Syllo Mar 27, 2022

Choose a reason for hiding this comment

zhuyifei1999 Mar 27, 2022

Choose a reason for hiding this comment

Syllo Mar 27, 2022

Choose a reason for hiding this comment

Syllo commented Mar 27, 2022

zhuyifei1999 commented Mar 27, 2022

Syllo commented Apr 2, 2022

zhuyifei1999 commented Apr 2, 2022 • edited Loading

Syllo commented Apr 3, 2022

zhuyifei1999 commented Apr 3, 2022

zhuyifei1999 commented Mar 21, 2022 •

edited

Loading

Syllo commented Mar 26, 2022 •

edited

Loading

zhuyifei1999 Mar 27, 2022 •

edited

Loading

zhuyifei1999 commented Apr 2, 2022 •

edited

Loading