Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI 5.0.x maps/binds to cores even if asked to use hwthreads #12967

Open
LourensVeen opened this issue Dec 7, 2024 · 0 comments
Open

OpenMPI 5.0.x maps/binds to cores even if asked to use hwthreads #12967

LourensVeen opened this issue Dec 7, 2024 · 0 comments

Comments

@LourensVeen
Copy link

Background information

I'm trying to start an MPI program on a particular subset of resources within an allocation (or locally, the results are the same). I'm mapping with a rankfile and binding the MPI processes to a specific set of CPUs that way. My software is set up to specify logical CPUs, these being hwthreads on my 8C16T laptop.

My test suite includes OpenMPI 3.1.6 and 4.1.6, which both work fine, but I can't get OpenMPI 5.0.x to bind to hwthreads, it insists on interpreting the numbers in the rankfile as core ids. As a result, everything runs fine but in the wrong place.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

I tested 5.0.1, 5.0.3 and 5.0.5, which don't work, and 4.1.6 and 3.1.6, which do. Versions 5.0.x use hwloc 2.11.1, while the older ones use hwloc 1.11.13.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I used Spack v0.22 and v0.23, with SLURM support and external PMIX, hwloc and libevent.

Please describe the system on which you are running

  • Operating system/version:

Kubuntu 22.04

I'm running inside of a Docker container, and can reproduce the problem both in a SLURM allocation and outside of one, but only on 5.0.x.

  • Computer hardware:

Lenovo P14s, AMD 7 PRO 5850U CPU with 8 cores and 16 hwthreads

(output of ompi_info --all omitted because the message is too long)

Output of lstopo -v Using hwloc 2.11.1
Machine (P#0 total=45015732KB DMIProductName=21A0000SMH DMIProductVersion="ThinkPad P14s Gen 2a" DMIBoardVendor=LENOVO DMIBoardName=21A0000SMH DMIBoardVersion="SDK0J40697 WIN" DMIBoardAssetTag="Not Available" DMIChassisVendor=LENOVO DMIChassisType=10 DMIChassisVersion=None DMIChassisAssetTag="No Asset Information" DMIBIOSVendor=LENOVO DMIBIOSVersion="R1MET58W (1.28 )" DMIBIOSDate=08/13/2024 DMISysVendor=LENOVO Backend=Linux LinuxCgroup=/ OSName=Linux OSRelease=5.15.0-124-generic OSVersion="#134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024" HostName=headnode Architecture=x86_64 hwlocVersion=2.11.1 ProcessName=lstopo)
  Package L#0 (P#0 total=45015732KB CPUVendor=AuthenticAMD CPUFamilyNumber=25 CPUModelNumber=80 CPUModel="AMD Ryzen 7 PRO 5850U with Radeon Graphics     " CPUStepping=0)
    NUMANode L#0 (P#0 local=45015732KB total=45015732KB)
    L3Cache L#0 (P#0 size=16384KB linesize=64 ways=16 Inclusive=0)
      L2Cache L#0 (P#0 size=512KB linesize=64 ways=8 Inclusive=1)
        L1dCache L#0 (P#0 size=32KB linesize=64 ways=8 Inclusive=0)
          L1iCache L#0 (P#0 size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#0 (P#0)
              PU L#0 (P#0)
              PU L#1 (P#1)
      L2Cache L#1 (P#1 size=512KB linesize=64 ways=8 Inclusive=1)
        L1dCache L#1 (P#1 size=32KB linesize=64 ways=8 Inclusive=0)
          L1iCache L#1 (P#1 size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#1 (P#1)
              PU L#2 (P#2)
              PU L#3 (P#3)
      L2Cache L#2 (P#2 size=512KB linesize=64 ways=8 Inclusive=1)
        L1dCache L#2 (P#2 size=32KB linesize=64 ways=8 Inclusive=0)
          L1iCache L#2 (P#2 size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#2 (P#2)
              PU L#4 (P#4)
              PU L#5 (P#5)
      L2Cache L#3 (P#3 size=512KB linesize=64 ways=8 Inclusive=1)
        L1dCache L#3 (P#3 size=32KB linesize=64 ways=8 Inclusive=0)
          L1iCache L#3 (P#3 size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#3 (P#3)
              PU L#6 (P#6)
              PU L#7 (P#7)
      L2Cache L#4 (P#4 size=512KB linesize=64 ways=8 Inclusive=1)
        L1dCache L#4 (P#4 size=32KB linesize=64 ways=8 Inclusive=0)
          L1iCache L#4 (P#4 size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#4 (P#4)
              PU L#8 (P#8)
              PU L#9 (P#9)
      L2Cache L#5 (P#5 size=512KB linesize=64 ways=8 Inclusive=1)
        L1dCache L#5 (P#5 size=32KB linesize=64 ways=8 Inclusive=0)
          L1iCache L#5 (P#5 size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#5 (P#5)
              PU L#10 (P#10)
              PU L#11 (P#11)
      L2Cache L#6 (P#6 size=512KB linesize=64 ways=8 Inclusive=1)
        L1dCache L#6 (P#6 size=32KB linesize=64 ways=8 Inclusive=0)
          L1iCache L#6 (P#6 size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#6 (P#6)
              PU L#12 (P#12)
              PU L#13 (P#13)
      L2Cache L#7 (P#7 size=512KB linesize=64 ways=8 Inclusive=1)
        L1dCache L#7 (P#7 size=32KB linesize=64 ways=8 Inclusive=0)
          L1iCache L#7 (P#7 size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#7 (P#7)
              PU L#14 (P#14)
              PU L#15 (P#15)
  HostBridge L#0 (buses=0000:[00-07])
    PCIBridge L#1 (busid=0000:00:02.1 id=1022:1634 class=0604(PCIBridge) link=3.94GB/s buses=0000:[01-01])
      PCI L#0 (busid=0000:01:00.0 id=144d:a80a class=0108(NVMExp) link=3.94GB/s)
        Block(Disk) L#0 (Size=1953514584 SectorSize=512 LinuxDeviceID=259:0) "nvme0n1"
    PCIBridge L#2 (busid=0000:00:02.2 id=1022:1634 class=0604(PCIBridge) link=0.25GB/s buses=0000:[02-02])
      PCI L#1 (busid=0000:02:00.0 id=10ec:8168 class=0200(Ethernet) link=0.25GB/s)
    PCIBridge L#3 (busid=0000:00:02.3 id=1022:1634 class=0604(PCIBridge) link=0.62GB/s buses=0000:[03-03])
      PCI L#2 (busid=0000:03:00.0 id=8086:2725 class=0280(Network) link=0.62GB/s PCISlot=0)
    PCIBridge L#4 (busid=0000:00:02.6 id=1022:1634 class=0604(PCIBridge) link=0.25GB/s buses=0000:[05-05])
      PCI L#3 (busid=0000:05:00.0 id=10ec:8168 class=0200(Ethernet) link=0.25GB/s)
    PCIBridge L#5 (busid=0000:00:08.1 id=1022:1635 class=0604(PCIBridge) link=15.75GB/s buses=0000:[07-07])
      PCI L#4 (busid=0000:07:00.0 id=1002:1638 class=0300(VGA) link=15.75GB/s)
depth 0:           1 Machine (type #0)
 depth 1:          1 Package (type #1)
  depth 2:         1 L3Cache (type #6)
   depth 3:        8 L2Cache (type #5)
    depth 4:       8 L1dCache (type #4)
     depth 5:      8 L1iCache (type #9)
      depth 6:     8 Core (type #2)
       depth 7:    16 PU (type #3)
Special depth -3:  1 NUMANode (type #13)
Special depth -4:  6 Bridge (type #14)
Special depth -5:  5 PCIDev (type #15)
Special depth -6:  1 OSDev (type #16)
CPU kind #0 efficiency 0 cpuset 0x0000ffff
Output of lstopo --of xml Using hwloc 2.11.1
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc2.dtd">
<topology version="2.0">
  <object type="Machine" os_index="0" cpuset="0x0000ffff" complete_cpuset="0x0000ffff" allowed_cpuset="0x0000ffff" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001" gp_index="1">
    <info name="DMIProductName" value="21A0000SMH"/>
    <info name="DMIProductVersion" value="ThinkPad P14s Gen 2a"/>
    <info name="DMIBoardVendor" value="LENOVO"/>
    <info name="DMIBoardName" value="21A0000SMH"/>
    <info name="DMIBoardVersion" value="SDK0J40697 WIN"/>
    <info name="DMIBoardAssetTag" value="Not Available"/>
    <info name="DMIChassisVendor" value="LENOVO"/>
    <info name="DMIChassisType" value="10"/>
    <info name="DMIChassisVersion" value="None"/>
    <info name="DMIChassisAssetTag" value="No Asset Information"/>
    <info name="DMIBIOSVendor" value="LENOVO"/>
    <info name="DMIBIOSVersion" value="R1MET58W (1.28 )"/>
    <info name="DMIBIOSDate" value="08/13/2024"/>
    <info name="DMISysVendor" value="LENOVO"/>
    <info name="Backend" value="Linux"/>
    <info name="LinuxCgroup" value="/"/>
    <info name="OSName" value="Linux"/>
    <info name="OSRelease" value="5.15.0-124-generic"/>
    <info name="OSVersion" value="#134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024"/>
    <info name="HostName" value="headnode"/>
    <info name="Architecture" value="x86_64"/>
    <info name="hwlocVersion" value="2.11.1"/>
    <info name="ProcessName" value="lstopo"/>
    <object type="Package" os_index="0" cpuset="0x0000ffff" complete_cpuset="0x0000ffff" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="3">
      <info name="CPUVendor" value="AuthenticAMD"/>
      <info name="CPUFamilyNumber" value="25"/>
      <info name="CPUModelNumber" value="80"/>
      <info name="CPUModel" value="AMD Ryzen 7 PRO 5850U with Radeon Graphics     "/>
      <info name="CPUStepping" value="0"/>
      <object type="NUMANode" os_index="0" cpuset="0x0000ffff" complete_cpuset="0x0000ffff" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="52" local_memory="46096109568">
        <page_type size="4096" count="11253933"/>
        <page_type size="2097152" count="0"/>
        <page_type size="1073741824" count="0"/>
      </object>
      <object type="L3Cache" os_index="0" cpuset="0x0000ffff" complete_cpuset="0x0000ffff" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="8" cache_size="16777216" depth="3" cache_linesize="64" cache_associativity="16" cache_type="0">
        <info name="Inclusive" value="0"/>
        <object type="L2Cache" os_index="0" cpuset="0x00000003" complete_cpuset="0x00000003" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="7" cache_size="524288" depth="2" cache_linesize="64" cache_associativity="8" cache_type="0">
          <info name="Inclusive" value="1"/>
          <object type="L1Cache" os_index="0" cpuset="0x00000003" complete_cpuset="0x00000003" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="5" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="1">
            <info name="Inclusive" value="0"/>
            <object type="L1iCache" os_index="0" cpuset="0x00000003" complete_cpuset="0x00000003" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="6" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="2">
              <info name="Inclusive" value="0"/>
              <object type="Core" os_index="0" cpuset="0x00000003" complete_cpuset="0x00000003" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="2">
                <object type="PU" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="4"/>
                <object type="PU" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="9"/>
              </object>
            </object>
          </object>
        </object>
        <object type="L2Cache" os_index="1" cpuset="0x0000000c" complete_cpuset="0x0000000c" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="14" cache_size="524288" depth="2" cache_linesize="64" cache_associativity="8" cache_type="0">
          <info name="Inclusive" value="1"/>
          <object type="L1Cache" os_index="1" cpuset="0x0000000c" complete_cpuset="0x0000000c" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="12" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="1">
            <info name="Inclusive" value="0"/>
            <object type="L1iCache" os_index="1" cpuset="0x0000000c" complete_cpuset="0x0000000c" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="13" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="2">
              <info name="Inclusive" value="0"/>
              <object type="Core" os_index="1" cpuset="0x0000000c" complete_cpuset="0x0000000c" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="10">
                <object type="PU" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="11"/>
                <object type="PU" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="15"/>
              </object>
            </object>
          </object>
        </object>
        <object type="L2Cache" os_index="2" cpuset="0x00000030" complete_cpuset="0x00000030" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="20" cache_size="524288" depth="2" cache_linesize="64" cache_associativity="8" cache_type="0">
          <info name="Inclusive" value="1"/>
          <object type="L1Cache" os_index="2" cpuset="0x00000030" complete_cpuset="0x00000030" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="18" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="1">
            <info name="Inclusive" value="0"/>
            <object type="L1iCache" os_index="2" cpuset="0x00000030" complete_cpuset="0x00000030" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="19" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="2">
              <info name="Inclusive" value="0"/>
              <object type="Core" os_index="2" cpuset="0x00000030" complete_cpuset="0x00000030" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="16">
                <object type="PU" os_index="4" cpuset="0x00000010" complete_cpuset="0x00000010" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="17"/>
                <object type="PU" os_index="5" cpuset="0x00000020" complete_cpuset="0x00000020" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="21"/>
              </object>
            </object>
          </object>
        </object>
        <object type="L2Cache" os_index="3" cpuset="0x000000c0" complete_cpuset="0x000000c0" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="26" cache_size="524288" depth="2" cache_linesize="64" cache_associativity="8" cache_type="0">
          <info name="Inclusive" value="1"/>
          <object type="L1Cache" os_index="3" cpuset="0x000000c0" complete_cpuset="0x000000c0" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="24" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="1">
            <info name="Inclusive" value="0"/>
            <object type="L1iCache" os_index="3" cpuset="0x000000c0" complete_cpuset="0x000000c0" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="25" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="2">
              <info name="Inclusive" value="0"/>
              <object type="Core" os_index="3" cpuset="0x000000c0" complete_cpuset="0x000000c0" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="22">
                <object type="PU" os_index="6" cpuset="0x00000040" complete_cpuset="0x00000040" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="23"/>
                <object type="PU" os_index="7" cpuset="0x00000080" complete_cpuset="0x00000080" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="27"/>
              </object>
            </object>
          </object>
        </object>
        <object type="L2Cache" os_index="4" cpuset="0x00000300" complete_cpuset="0x00000300" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="32" cache_size="524288" depth="2" cache_linesize="64" cache_associativity="8" cache_type="0">
          <info name="Inclusive" value="1"/>
          <object type="L1Cache" os_index="4" cpuset="0x00000300" complete_cpuset="0x00000300" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="30" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="1">
            <info name="Inclusive" value="0"/>
            <object type="L1iCache" os_index="4" cpuset="0x00000300" complete_cpuset="0x00000300" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="31" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="2">
              <info name="Inclusive" value="0"/>
              <object type="Core" os_index="4" cpuset="0x00000300" complete_cpuset="0x00000300" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="28">
                <object type="PU" os_index="8" cpuset="0x00000100" complete_cpuset="0x00000100" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="29"/>
                <object type="PU" os_index="9" cpuset="0x00000200" complete_cpuset="0x00000200" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="33"/>
              </object>
            </object>
          </object>
        </object>
        <object type="L2Cache" os_index="5" cpuset="0x00000c00" complete_cpuset="0x00000c00" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="38" cache_size="524288" depth="2" cache_linesize="64" cache_associativity="8" cache_type="0">
          <info name="Inclusive" value="1"/>
          <object type="L1Cache" os_index="5" cpuset="0x00000c00" complete_cpuset="0x00000c00" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="36" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="1">
            <info name="Inclusive" value="0"/>
            <object type="L1iCache" os_index="5" cpuset="0x00000c00" complete_cpuset="0x00000c00" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="37" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="2">
              <info name="Inclusive" value="0"/>
              <object type="Core" os_index="5" cpuset="0x00000c00" complete_cpuset="0x00000c00" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="34">
                <object type="PU" os_index="10" cpuset="0x00000400" complete_cpuset="0x00000400" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="35"/>
                <object type="PU" os_index="11" cpuset="0x00000800" complete_cpuset="0x00000800" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="39"/>
              </object>
            </object>
          </object>
        </object>
        <object type="L2Cache" os_index="6" cpuset="0x00003000" complete_cpuset="0x00003000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="44" cache_size="524288" depth="2" cache_linesize="64" cache_associativity="8" cache_type="0">
          <info name="Inclusive" value="1"/>
          <object type="L1Cache" os_index="6" cpuset="0x00003000" complete_cpuset="0x00003000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="42" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="1">
            <info name="Inclusive" value="0"/>
            <object type="L1iCache" os_index="6" cpuset="0x00003000" complete_cpuset="0x00003000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="43" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="2">
              <info name="Inclusive" value="0"/>
              <object type="Core" os_index="6" cpuset="0x00003000" complete_cpuset="0x00003000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="40">
                <object type="PU" os_index="12" cpuset="0x00001000" complete_cpuset="0x00001000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="41"/>
                <object type="PU" os_index="13" cpuset="0x00002000" complete_cpuset="0x00002000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="45"/>
              </object>
            </object>
          </object>
        </object>
        <object type="L2Cache" os_index="7" cpuset="0x0000c000" complete_cpuset="0x0000c000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="50" cache_size="524288" depth="2" cache_linesize="64" cache_associativity="8" cache_type="0">
          <info name="Inclusive" value="1"/>
          <object type="L1Cache" os_index="7" cpuset="0x0000c000" complete_cpuset="0x0000c000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="48" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="1">
            <info name="Inclusive" value="0"/>
            <object type="L1iCache" os_index="7" cpuset="0x0000c000" complete_cpuset="0x0000c000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="49" cache_size="32768" depth="1" cache_linesize="64" cache_associativity="8" cache_type="2">
              <info name="Inclusive" value="0"/>
              <object type="Core" os_index="7" cpuset="0x0000c000" complete_cpuset="0x0000c000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="46">
                <object type="PU" os_index="14" cpuset="0x00004000" complete_cpuset="0x00004000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="47"/>
                <object type="PU" os_index="15" cpuset="0x00008000" complete_cpuset="0x00008000" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="51"/>
              </object>
            </object>
          </object>
        </object>
      </object>
    </object>
    <object type="Bridge" gp_index="78" bridge_type="0-1" depth="0" bridge_pci="0000:[00-07]">
      <object type="Bridge" gp_index="63" bridge_type="1-1" depth="1" bridge_pci="0000:[01-01]" pci_busid="0000:00:02.1" pci_type="0604 [1022:1634] [17aa:5094] 00" pci_link_speed="3.938462">
        <object type="PCIDev" gp_index="62" pci_busid="0000:01:00.0" pci_type="0108 [144d:a80a] [144d:a801] 00" pci_link_speed="3.938462">
          <object type="OSDev" gp_index="79" name="nvme0n1" subtype="Disk" osdev_type="0">
            <info name="Size" value="1953514584"/>
            <info name="SectorSize" value="512"/>
            <info name="LinuxDeviceID" value="259:0"/>
          </object>
        </object>
      </object>
      <object type="Bridge" gp_index="74" bridge_type="1-1" depth="1" bridge_pci="0000:[02-02]" pci_busid="0000:00:02.2" pci_type="0604 [1022:1634] [17aa:5094] 00" pci_link_speed="0.250000">
        <object type="PCIDev" gp_index="58" pci_busid="0000:02:00.0" pci_type="0200 [10ec:8168] [17aa:5094] 0e" pci_link_speed="0.250000"/>
      </object>
      <object type="Bridge" gp_index="60" bridge_type="1-1" depth="1" bridge_pci="0000:[03-03]" pci_busid="0000:00:02.3" pci_type="0604 [1022:1634] [17aa:5094] 00" pci_link_speed="0.615385">
        <object type="PCIDev" gp_index="53" pci_busid="0000:03:00.0" pci_type="0280 [8086:2725] [8086:0024] 1a" pci_link_speed="0.615385">
          <info name="PCISlot" value="0"/>
        </object>
      </object>
      <object type="Bridge" gp_index="69" bridge_type="1-1" depth="1" bridge_pci="0000:[05-05]" pci_busid="0000:00:02.6" pci_type="0604 [1022:1634] [17aa:5094] 00" pci_link_speed="0.250000">
        <object type="PCIDev" gp_index="72" pci_busid="0000:05:00.0" pci_type="0200 [10ec:8168] [17aa:5094] 15" pci_link_speed="0.250000"/>
      </object>
      <object type="Bridge" gp_index="67" bridge_type="1-1" depth="1" bridge_pci="0000:[07-07]" pci_busid="0000:00:08.1" pci_type="0604 [1022:1635] [5094:17aa] 00" pci_link_speed="15.753846">
        <object type="PCIDev" gp_index="64" pci_busid="0000:07:00.0" pci_type="0300 [1002:1638] [17aa:509b] d1" pci_link_speed="15.753846"/>
      </object>
    </object>
  </object>
  <support name="discovery.pu"/>
  <support name="discovery.numa"/>
  <support name="discovery.numa_memory"/>
  <support name="discovery.disallowed_pu"/>
  <support name="discovery.disallowed_numa"/>
  <support name="discovery.cpukind_efficiency"/>
  <support name="cpubind.set_thisproc_cpubind"/>
  <support name="cpubind.get_thisproc_cpubind"/>
  <support name="cpubind.set_proc_cpubind"/>
  <support name="cpubind.get_proc_cpubind"/>
  <support name="cpubind.set_thisthread_cpubind"/>
  <support name="cpubind.get_thisthread_cpubind"/>
  <support name="cpubind.set_thread_cpubind"/>
  <support name="cpubind.get_thread_cpubind"/>
  <support name="cpubind.get_thisproc_last_cpu_location"/>
  <support name="cpubind.get_proc_last_cpu_location"/>
  <support name="cpubind.get_thisthread_last_cpu_location"/>
  <support name="membind.set_thisthread_membind"/>
  <support name="membind.get_thisthread_membind"/>
  <support name="membind.set_area_membind"/>
  <support name="membind.get_area_membind"/>
  <support name="membind.alloc_membind"/>
  <support name="membind.firsttouch_membind"/>
  <support name="membind.bind_membind"/>
  <support name="membind.interleave_membind"/>
  <support name="membind.migrate_membind"/>
  <support name="membind.get_area_memlocation"/>
  <support name="custom.exported_support"/>
  <cpukind cpuset="0x0000ffff">
    <info name="FrequencyMaxMHz" value="1900"/>
  </cpukind>
</topology>
Output of mpirun test command
shell$ mpirun --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[headnode:46785] mca: base: component_find: searching NULL for plm components
[headnode:46785] mca: base: find_dyn_components: checking NULL for plm components
[headnode:46785] pmix:mca: base: components_register: registering framework plm components
[headnode:46785] pmix:mca: base: components_register: found loaded component slurm
[headnode:46785] pmix:mca: base: components_register: component slurm register function successful
[headnode:46785] pmix:mca: base: components_register: found loaded component ssh
[headnode:46785] pmix:mca: base: components_register: component ssh register function successful
[headnode:46785] mca: base: components_open: opening plm components
[headnode:46785] mca: base: components_open: found loaded component slurm
[headnode:46785] mca: base: components_open: component slurm open function successful
[headnode:46785] mca: base: components_open: found loaded component ssh
[headnode:46785] mca: base: components_open: component ssh open function successful
[headnode:46785] mca:base:select: Auto-selecting plm components
[headnode:46785] mca:base:select:(  plm) Querying component [slurm]
[headnode:46785] mca:base:select:(  plm) Querying component [ssh]
[headnode:46785] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[headnode:46785] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[headnode:46785] mca:base:select:(  plm) Selected component [ssh]
[headnode:46785] mca: base: close: component slurm closed
[headnode:46785] mca: base: close: unloading component slurm
[headnode:46785] [prterun-headnode-46785@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive start comm
[headnode:46785] mca: base: component_find: searching NULL for rmaps components
[headnode:46785] mca: base: find_dyn_components: checking NULL for rmaps components
[headnode:46785] pmix:mca: base: components_register: registering framework rmaps components
[headnode:46785] pmix:mca: base: components_register: found loaded component ppr
[headnode:46785] pmix:mca: base: components_register: component ppr register function successful
[headnode:46785] pmix:mca: base: components_register: found loaded component rank_file
[headnode:46785] pmix:mca: base: components_register: component rank_file has no register or open function
[headnode:46785] pmix:mca: base: components_register: found loaded component round_robin
[headnode:46785] pmix:mca: base: components_register: component round_robin register function successful
[headnode:46785] pmix:mca: base: components_register: found loaded component seq
[headnode:46785] pmix:mca: base: components_register: component seq register function successful
[headnode:46785] mca: base: components_open: opening rmaps components
[headnode:46785] mca: base: components_open: found loaded component ppr
[headnode:46785] mca: base: components_open: component ppr open function successful
[headnode:46785] mca: base: components_open: found loaded component rank_file
[headnode:46785] mca: base: components_open: found loaded component round_robin
[headnode:46785] mca: base: components_open: component round_robin open function successful
[headnode:46785] mca: base: components_open: found loaded component seq
[headnode:46785] mca: base: components_open: component seq open function successful
[headnode:46785] mca:rmaps:select: checking available component ppr
[headnode:46785] mca:rmaps:select: Querying component [ppr]
[headnode:46785] mca:rmaps:select: checking available component rank_file
[headnode:46785] mca:rmaps:select: Querying component [rank_file]
[headnode:46785] mca:rmaps:select: checking available component round_robin
[headnode:46785] mca:rmaps:select: Querying component [round_robin]
[headnode:46785] mca:rmaps:select: checking available component seq
[headnode:46785] mca:rmaps:select: Querying component [seq]
[headnode:46785] [prterun-headnode-46785@0,0]: Final mapper priorities
[headnode:46785]        Mapper: rank_file Priority: 100
[headnode:46785]        Mapper: ppr Priority: 90
[headnode:46785]        Mapper: seq Priority: 60
[headnode:46785]        Mapper: round_robin Priority: 10
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm creating map
[headnode:46785] [prterun-headnode-46785@0,0] setup:vm: working unmanaged allocation
[headnode:46785] [prterun-headnode-46785@0,0] using default hostfile /opt/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/openmpi-5.0.5-um6gykzrcb4d2xkmsf53ce5eswpj42zz/etc/prte-default-hostfile

======================   ALLOCATED NODES   ======================
    headnode: slots=1 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm only HNP in allocation
        aliases: headnode
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setting slots for node headnode by core
=================================================================

======================   ALLOCATED NODES   ======================
    headnode: slots=8 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
        aliases: headnode
=================================================================
[headnode:46785] [prterun-headnode-46785@0,0] rmaps:base set policy with ppr:1:node
[headnode:46785] [prterun-headnode-46785@0,0] rmaps:base policy ppr modifiers 1:node provided
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive processing msg
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive job launch command from [prterun-headnode-46785@0,0]
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive adding hosts
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive calling spawn
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive done processing commands
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_job
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm no new daemons required
[headnode:46785] mca:rmaps: mapping job prterun-headnode-46785@1

======================   ALLOCATED NODES   ======================
    headnode: slots=8 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
        aliases: headnode
=================================================================
[headnode:46785] mca:rmaps: setting mapping policies for job prterun-headnode-46785@1 inherit TRUE hwtcpus FALSE
[headnode:46785] [prterun-headnode-46785@0,0] using known nodes
[headnode:46785] [prterun-headnode-46785@0,0] Starting with 1 nodes in list
[headnode:46785] [prterun-headnode-46785@0,0] Filtering thru apps
[headnode:46785] [prterun-headnode-46785@0,0] Retained 1 nodes in list
[headnode:46785] [prterun-headnode-46785@0,0] node headnode has 8 slots available
[headnode:46785] AVAILABLE NODES FOR MAPPING:
[headnode:46785]     node: headnode daemon: 0 slots_available: 8
[headnode:46785] setdefaultbinding[366] binding not given - using bycore
[headnode:46785] mca:rmaps:rf: job prterun-headnode-46785@1 not using rankfile policy
[headnode:46785] mca:rmaps:ppr: mapping job prterun-headnode-46785@1 with ppr 1:node
[headnode:46785] mca:rmaps:ppr: job prterun-headnode-46785@1 assigned policy BYNODE:SLOT
[headnode:46785] [prterun-headnode-46785@0,0] using known nodes
[headnode:46785] [prterun-headnode-46785@0,0] Starting with 1 nodes in list
[headnode:46785] [prterun-headnode-46785@0,0] Filtering thru apps
[headnode:46785] [prterun-headnode-46785@0,0] Retained 1 nodes in list
[headnode:46785] [prterun-headnode-46785@0,0] node headnode has 8 slots available
[headnode:46785] AVAILABLE NODES FOR MAPPING:
[headnode:46785]     node: headnode daemon: 0 slots_available: 8
[headnode:46785] [prterun-headnode-46785@0,0] get_avail_ncpus: node headnode has 0 procs on it
[headnode:46785] mca:rmaps: compute bindings for job prterun-headnode-46785@1 with policy CORE:IF-SUPPORTED[1007]
[headnode:46785] mca:rmaps: bind [prterun-headnode-46785@1,INVALID] with policy CORE:IF-SUPPORTED
[headnode:46785] [prterun-headnode-46785@0,0] BOUND PROC [prterun-headnode-46785@1,INVALID][headnode] TO package[0][core:0]
[headnode:46785] [prterun-headnode-46785@0,0] complete_setup on job prterun-headnode-46785@1
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:launch_apps for job prterun-headnode-46785@1
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:send launch msg for job prterun-headnode-46785@1
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:launch wiring up iof for job prterun-headnode-46785@1
headnode
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:prted_cmd sending prted_exit commands
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive stop comm
[headnode:46785] mca: base: close: component ssh closed
[headnode:46785] mca: base: close: unloading component ssh
  • Network type:

No special hardware, connections are either local or over TCP between different Docker containers representing fake cluster nodes.


Details of the problem

When I use mpirun to start a program and ask for OpenMPI to map processes using a rankfile and hwthreads, it assigns whole cores to each process instead of individual threads. That is, the slots in the rankfile are always interpreted as core ids, not as hwthread (logical CPU) ids:

shell$ cat rankfile
rank 0=localhost slot=0,1
rank 1=localhost slot=2,3

shell$ mpirun -n 2 -rankfile rankfile python -c 'import os; print(os.sched_getaffinity(0))'
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.

  Deprecated option:   rankfile
  Corrected option:    --map-by rankfile:file=rankfile

We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------

--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.

  Deprecated option:   rankfile
  Corrected option:    --map-by rankfile:file=rankfile

We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------

{0, 1, 2, 3}
{4, 5, 6, 7}

This is as expected, because OpenMPI interprets the numbers as core ids by default, and cores 0 and 1 map to hwthreads {0, 1} and {2, 3} respectively. So let's use --use-hwthread-cpus to fix that:

shell$ mpirun -n 2 --use-hwthread-cpus --rankfile rankfile python -c 'import os; print(os.sched_getaffinity(0))'
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.

  Deprecated option:   rankfile
  Corrected option:    --map-by rankfile:file=rankfile

We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------

--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.

  Deprecated option:   rankfile
  Corrected option:    --map-by rankfile:file=rankfile

We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------

{4, 5, 6, 7}
{0, 1, 2, 3}

On pre-5.0.x, the above prints {0, 1} and {2, 3}, but 5.0.x seems to persist in using cores rather than hwthreads. It does complain about me using old syntax however, so let's try the newer one:

shell$ mpirun -n 2 --map-by rankfile:file=rankfile:hwtcpus python -c 'import os; print(os.sched_getaffinity(0))'
{0, 1, 2, 3}
{4, 5, 6, 7}

That fixes the warning, but not the problem. Let's see what it's actually doing:

shell$ mpirun -n 2 --map-by rankfile:file=rankfile:hwtcpus -v python -c 'import os; print(os.sched_getaffinity(0))'
--------------------------------------------------------------------------
ERROR: The "map-by" command line option was listed more than once on the command line.
Only one instance of this option is permitted.
Please correct your command line.
--------------------------------------------------------------------------

Okay, I don't think that was supposed to happen. Let's try a different way:

shell$ mpirun -n 2 --map-by rankfile:file=rankfile:hwtcpus --display-map --report-bindings python -c 'import os; print(os.sched_getaffinity(0))'

========================   JOB MAP   ========================
Data for JOB prterun-headnode-49764@1 offset 0 Total slots allocated 8
    Mapping policy: BYUSER:NOOVERSUBSCRIBE  Ranking policy: BYUSER Binding policy: CORE:IF-SUPPORTED
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: headnode Num slots: 8    Max slots: 0    Num procs: 2
        Process jobid: prterun-headnode-49764@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
        Process jobid: prterun-headnode-49764@1 App: 0 Process rank: 1 Bound: package[0][core:2-3]

=============================================================
[headnode:49764] Rank 0 bound to package[0][core:0-1]
[headnode:49764] Rank 1 bound to package[0][core:2-3]
{4, 5, 6, 7}
{0, 1, 2, 3}

Ah, it's trying to bind to cores, maybe that's it?

shell$ mpirun -n 2 --map-by rankfile:file=rankfile:hwtcpus --bind-to hwthread --display-map --report-bindings python -c 'import os; print(os.sched_getaffinity(0))'

========================   JOB MAP   ========================
Data for JOB prterun-headnode-50283@1 offset 0 Total slots allocated 8
    Mapping policy: BYUSER:NOOVERSUBSCRIBE  Ranking policy: BYUSER Binding policy: HWTHREAD
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: headnode Num slots: 8    Max slots: 0    Num procs: 2
        Process jobid: prterun-headnode-50283@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
        Process jobid: prterun-headnode-50283@1 App: 0 Process rank: 1 Bound: package[0][core:2-3]

=============================================================
[headnode:50283] Rank 0 bound to package[0][core:0-1]
[headnode:50283] Rank 1 bound to package[0][core:2-3]
{4, 5, 6, 7}
{0, 1, 2, 3}

Nope. Maybe try the old syntax again?

shell$ mpirun -n 2 --use-hwthread-cpus --rankfile rankfile --bind-to hwthread --display-map --report-bindings python -c 'import os; print(os.sched_getaffinity(0))'
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.

  Deprecated option:   rankfile
  Corrected option:    --map-by rankfile:file=rankfile

We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------

--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.

  Deprecated option:   rankfile
  Corrected option:    --map-by rankfile:file=rankfile

We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------


========================   JOB MAP   ========================
Data for JOB prterun-headnode-51016@1 offset 0 Total slots allocated 16
    Mapping policy: BYUSER:NOOVERSUBSCRIBE  Ranking policy: BYUSER Binding policy: HWTHREAD
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: headnode Num slots: 16   Max slots: 0    Num procs: 2
        Process jobid: prterun-headnode-51016@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
        Process jobid: prterun-headnode-51016@1 App: 0 Process rank: 1 Bound: package[0][core:2-3]

=============================================================
[headnode:51016] Rank 0 bound to package[0][core:0-1]
[headnode:51016] Rank 1 bound to package[0][core:2-3]
{4, 5, 6, 7}
{0, 1, 2, 3}

That Cpu Type: CORE may be the problem, but how to convince OpenMPI that I have hwthreads to bind to? And why does it work on earlier versions?

I can't find anything in the 5.0.x docs that suggests that this is intended, so I think it's a bug, either in the code (the ORTE to PRRTE switch maybe?) or in the docs. Or perhaps in my brain, if I missed something. At any rate, any help in fixing it would be much appreciated!

@LourensVeen LourensVeen changed the title OpenMPI 5.0.x binds to cores even if asked to use hwthreads OpenMPI 5.0.x maps/binds to cores even if asked to use hwthreads Dec 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants