Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsigned values dysfunction #457

Closed
Benco11-developement opened this issue Jun 20, 2024 · 9 comments
Closed

Unsigned values dysfunction #457

Benco11-developement opened this issue Jun 20, 2024 · 9 comments
Assignees
Labels
bug Something isn't working compiler

Comments

@Benco11-developement
Copy link


Describe the bug

Unsigning bytes somehow doesn't work like it should normally.

How To Reproduce

Here is a test case where we make a 24-bit unsigned int from 3 bytes :

import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
import uk.ac.manchester.tornado.api.TaskGraph;
import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
import uk.ac.manchester.tornado.api.annotations.Parallel;
import uk.ac.manchester.tornado.api.enums.DataTransferMode;
import uk.ac.manchester.tornado.api.exceptions.TornadoExecutionPlanException;
import uk.ac.manchester.tornado.api.types.arrays.ByteArray;
import uk.ac.manchester.tornado.api.types.arrays.IntArray;

public class Main {

    public static void main(String[] args) throws TornadoExecutionPlanException {
        int byte1 = Integer.parseInt("10011100", 2); // Store the byte as an int because the parse function expects a signed input but the binary number provided can't fit into a signed byte
        int byte2 = Integer.parseInt("00100111", 2);
        int byte3 = Integer.parseInt("11001001", 2);

        ByteArray a = new ByteArray(3);
        a.set(0, (byte) byte3);
        a.set(1, (byte) byte2);
        a.set(2, (byte) byte1);

        IntArray theoreticalResult = new IntArray(4); // Executed normally
        IntArray actualResult = new IntArray(4); // Executed with Tornado
        theoreticalResult.set(0, 0);
        actualResult.set(0, 0);

        TaskGraph graph = new TaskGraph("s0")
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, a, actualResult)
                .task("t0", Main::unsignedByte, a, actualResult)
                .transferToHost(DataTransferMode.EVERY_EXECUTION, actualResult);

        ImmutableTaskGraph immutableTaskGraph = graph.snapshot();
        try(TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph)) {
            executionPlan.execute();
            unsignedByte(a, theoreticalResult);
        }

        System.out.println("Expected : " + Integer.toBinaryString(theoreticalResult.get(0)) + "\nFound : " + Integer.toBinaryString(actualResult.get(0)));
        for(int i = 1; i < 4; i++) {
            System.out.println("Expected byte " + i + " : " + Integer.toBinaryString(theoreticalResult.get(4-i)) + "\nFound : " + Integer.toBinaryString(actualResult.get(4-i)));
        }
    }

    public static void unsignedByte(ByteArray a, IntArray result) {
        for(@Parallel int i = 0; i < 3; i++) {
            result.set(0, result.get(0) | ((a.get(i) & 0xFF) << (8*i)));
            result.set(i+1, a.get(i) & 0xFF);
        }
    }
}

Result :

Expected : 100111000010011111001001
Found : 11111111100111000000000000000000
Expected byte 1 : 10011100
Found : 11111111111111111111111110011100
Expected byte 2 : 100111
Found : 100111
Expected byte 3 : 11001001
Found : 11111111111111111111111111001001

Expected behavior

Bytes should be normally unsigned thanks to the & 0xFF

Computing system setup (please complete the following information):

  • OS: Windows 10
  • CUDA v12.5
  • TornadoVM 1.0.5 release

Additional context

Changing the backend does not help


@jjfumero
Copy link
Member

Thanks for the report. At a first look, this code cannot be parallelized:

 public static void unsignedByte(ByteArray a, IntArray result) {
        for(@Parallel int i = 0; i < 3; i++) {
            result.set(0, result.get(0) | ((a.get(i) & 0xFF) << (8*i)));   << Shared position across **all threads**: it needs a blocking access, The TornadoVM Parallel API does not offer barriers, but the TornadoVM Kernel API does. Besides, TornadoVM supports atomics for the OpenCL backend. 
            result.set(i+1, a.get(i) & 0xFF);                                   
        }
 }   

Note that TornadoVM does not solve data dependencies. It takes annotations as hints to parallelize. It is up to the user to ensure those regions can be parallelized. This is a similar concept to OpenMP, or OpenACC.

@Benco11-developement
Copy link
Author

Benco11-developement commented Jun 20, 2024

Thanks for your response. I modified the test case to make it parallelizable and added a control group to verify if the code was executed correctly :

import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
import uk.ac.manchester.tornado.api.TaskGraph;
import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
import uk.ac.manchester.tornado.api.annotations.Parallel;
import uk.ac.manchester.tornado.api.enums.DataTransferMode;
import uk.ac.manchester.tornado.api.exceptions.TornadoExecutionPlanException;
import uk.ac.manchester.tornado.api.types.arrays.ByteArray;
import uk.ac.manchester.tornado.api.types.arrays.IntArray;

import java.util.Arrays;
import java.util.Random;

public class Main {

    public static void main(String[] args) throws TornadoExecutionPlanException {
        Random r = new Random();
        int size = 32;

        ByteArray a = new ByteArray(size);

        // First half of input data includes bytes using all 8 bits, last half includes bytes using less than 8 bits
        for(int i = 0; i < size/2; i++) {
            a.set(i, (byte) (128 + r.nextInt(128)));
            a.set(i+size/2, (byte) (r.nextInt(128)));
        }

        IntArray theoreticalResult = new IntArray(size);
        IntArray theoreticalControlResult = new IntArray(size);

        IntArray actualResult = new IntArray(size);
        IntArray actualControlResult = new IntArray(size);

        theoreticalResult.init(0);
        theoreticalControlResult.init(0);

        actualResult.init(0);
        actualControlResult.init(0);

        TaskGraph graph = new TaskGraph("s0")
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, a, actualResult, actualControlResult)
                .task("t0", Main::unsignedByte, a, actualResult, actualControlResult, size)
                .transferToHost(DataTransferMode.EVERY_EXECUTION, actualResult, actualControlResult);

        ImmutableTaskGraph immutableTaskGraph = graph.snapshot();
        try(TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph)) {
            executionPlan.execute();
            unsignedByte(a, theoreticalResult, theoreticalControlResult, size);
        }

        if(Arrays.equals(theoreticalControlResult.toHeapArray(), actualControlResult.toHeapArray())) {
            System.out.println("Executed successfully");
        } else {
            System.out.println("Error during execution");
        }

        for(int i = 1; i < size; i++) {
            System.out.println("Expected byte " + i + " : " + Integer.toBinaryString(theoreticalResult.get(i-1)) + "\nFound : " + Integer.toBinaryString(actualResult.get(i-1)));
        }
    }

    public static void unsignedByte(ByteArray a, IntArray result, IntArray controlResult, int size) {
        for(@Parallel int i = 0; i < size; i++) {
            result.set(i, a.get(i) & 0xFF);
            controlResult.set(i, 2 * a.get(i));
        }
    }
}

Here is the output :

Executed successfully
Expected byte 1 : 10110011
Found : 11111111111111111111111110110011
Expected byte 2 : 10011100
Found : 11111111111111111111111110011100
Expected byte 3 : 10011100
Found : 11111111111111111111111110011100
Expected byte 4 : 11100110
Found : 11111111111111111111111111100110
Expected byte 5 : 11000000
Found : 11111111111111111111111111000000
Expected byte 6 : 10010100
Found : 11111111111111111111111110010100
Expected byte 7 : 10111010
Found : 11111111111111111111111110111010
Expected byte 8 : 11101000
Found : 11111111111111111111111111101000
Expected byte 9 : 10000101
Found : 11111111111111111111111110000101
Expected byte 10 : 11110111
Found : 11111111111111111111111111110111
Expected byte 11 : 11111011
Found : 11111111111111111111111111111011
Expected byte 12 : 11000010
Found : 11111111111111111111111111000010
Expected byte 13 : 10010111
Found : 11111111111111111111111110010111
Expected byte 14 : 11011110
Found : 11111111111111111111111111011110
Expected byte 15 : 10000001
Found : 11111111111111111111111110000001
Expected byte 16 : 10111011
Found : 11111111111111111111111110111011
Expected byte 17 : 1000101
Found : 1000101
Expected byte 18 : 110010
Found : 110010
Expected byte 19 : 1100011
Found : 1100011
Expected byte 20 : 100110
Found : 100110
Expected byte 21 : 1110011
Found : 1110011
Expected byte 22 : 1111010
Found : 1111010
Expected byte 23 : 111110
Found : 111110
Expected byte 24 : 1111
Found : 1111
Expected byte 25 : 1010101
Found : 1010101
Expected byte 26 : 11110
Found : 11110
Expected byte 27 : 1100101
Found : 1100101
Expected byte 28 : 1111100
Found : 1111100
Expected byte 29 : 111100
Found : 111100
Expected byte 30 : 1111011
Found : 1111011
Expected byte 31 : 111001
Found : 111001

As you can see, the bytes aren't unsigned well by the & 0xFF.

@jjfumero
Copy link
Member

Ok, thanks for the update. I would need to analyse the generated code as well as if there any compiler phase that changes this. We will take a look.

@jjfumero jjfumero added compiler bug Something isn't working labels Jun 21, 2024
@Benco11-developement
Copy link
Author

Benco11-developement commented Aug 6, 2024

Hello, after looking at the following generated kernel

.version 7.6 
.target sm_61 
.address_size 64 

.visible .entry s0_t0_unsignedbyte_arrays_bytearray_arrays_intarray_arrays_intarray_32(.param .u64 .ptr .global .align 8 kernel_context, .param .u64 .ptr .global .align 8 a, .param .u64 .ptr .global .align 8 result, .param .u64 .ptr .global .align 8 controlResult, .param .align 8 .u64 size) {
	.reg .s8 rsb<3>;
	.reg .u64 rud<9>;
	.reg .s64 rsd<4>;
	.reg .s32 rsi<10>;
	.reg .pred rpb<2>;
	.reg .u32 rui<5>;

BLOCK_0:
	ld.param.u64	rud0, [kernel_context];
	ld.param.u64	rud1, [a];
	ld.param.u64	rud2, [result];
	ld.param.u64	rud3, [controlResult];
	mov.u32	rui0, %nctaid.x;
	mov.u32	rui1, %ntid.x;
	mul.wide.u32	rud4, rui0, rui1;
	cvt.s32.u64	rsi0, rud4;
	mov.u32	rui2, %tid.x;
	mov.u32	rui3, %ctaid.x;
	mad.lo.s32	rsi1, rui3, rui1, rui2;

BLOCK_1:
	mov.s32	rsi2, rsi1;
LOOP_COND_1:
	setp.lt.s32	rpb0, rsi2, 32;
	@!rpb0 bra	BLOCK_3;

BLOCK_2:
	add.s32	rsi3, rsi2, 24;
	cvt.s64.s32	rsd0, rsi3;
	add.u64	rud5, rud1, rsd0;
	ld.global.s8	rsb0, [rud5];
	add.s32	rsi4, rsi2, 6;
	cvt.s64.s32	rsd1, rsi4;
	shl.b64	rsd2, rsd1, 2;
	add.u64	rud6, rud2, rsd2;
	cvt.s32.s8	rsi5, rsb0;
	st.global.s32	[rud6], rsi5;
	ld.global.s8	rsb1, [rud5];
	add.u64	rud7, rud3, rsd2;
	cvt.s32.s8	rsi6, rsb1;
	shl.b32	rsi7, rsi6, 1;
	st.global.s32	[rud7], rsi7;
	add.s32	rsi8, rsi0, rsi2;
	mov.s32	rsi2, rsi8;
	bra.uni	LOOP_COND_1;

BLOCK_3:
	ret;
}

I saw that the result int rsi5 is signed, explaining why we obtain this error, because the byte is not zero-extended unlike it would be on Java. I also tried using Byte.toUnsignedInt(a.get(i)) instead of a.get(i) & 0xFF but without success, the generated kernel still uses a signed int. Is there any way to force the usage of an unsigned type ?

@jjfumero
Copy link
Member

jjfumero commented Aug 8, 2024

Thanks for the report. It is possible that the code gen generates the wrong signed value. TornadoVM generates the code as it sees in the Graal IR / Tornado IR, but we might have missed something. Have you check with any other backend, like OpenCL or SPIR-V?

@Benco11-developement
Copy link
Author

The problem is the same in OpenCL :

#pragma OPENCL EXTENSION cl_khr_fp64 : enable  
#pragma OPENCL EXTENSION cl_khr_fp16 : enable  
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable  
__kernel void unsignedByte(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *a, __global uchar *result, __global uchar *controlResult, __private int size)
{
  ulong ul_8, ul_13, ul_1, ul_2, ul_0, ul_16; 
  long l_7, l_11, l_12; 
  char ch_9, ch_15; 
  int i_5, i_6, i_3, i_19, i_4, i_10, i_14, i_17, i_18; 

  // BLOCK 0
  ul_0  =  (ulong) a;
  ul_1  =  (ulong) result;
  ul_2  =  (ulong) controlResult;
  i_3  =  get_global_size(0);
  i_4  =  get_global_id(0);
  // BLOCK 1 MERGES [0 2 ]
  i_5  =  i_4;
  for(;i_5 < 32;)
  {
    // BLOCK 2
    i_6  =  i_5 + 24;
    l_7  =  (long) i_6;
    ul_8  =  ul_0 + l_7;
    ch_9  =  *((__global char *) ul_8);
    i_10  =  i_5 + 6;
    l_11  =  (long) i_10;
    l_12  =  l_11 << 2;
    ul_13  =  ul_1 + l_12;
    i_14  =  ch_9;
    *((__global int *) ul_13)  =  i_14;
    ch_15  =  *((__global char *) ul_8);
    ul_16  =  ul_2 + l_12;
    i_17  =  (int) ch_15;
    i_18  =  i_17 << 1;
    *((__global int *) ul_16)  =  i_18;
    i_19  =  i_3 + i_5;
    i_5  =  i_19;
  }  // B2
  // BLOCK 3
  return;
}  //  kernel

i_14 is a signed int, and thus, when converting from the signed byte ch_9, the sign is kept. And because of that, the results are different from Java. Moreover, after adding an & 0xFF to unsign the results computed by the kernel (like this Integer.toBinaryString(actualResult.get(i-1) & 0xFF)), the computed results match the expected results. So the problem is not that the signed value is wrong, but it is the fact that the result int, because it is signed itself, keeps the sign of the original byte. A solution to this could be to use an unsigned int, but I don't know if such a type exists in the current API and if it is recognized by the Tornado/Graal compiler. Manually implementing a two's complement could also be a solution, but I suppose it would lead to worse performance, and it would be a pain to implement on PTX backend because of the inability to use the not operator (see #339).

@Benco11-developement
Copy link
Author

I'm stupid... I forgot that char is unsigned. The problem can be solved by using char types instead of bytes. But that doesn't change the inconsistency between the behavior of Java conversion and OpenCL/Cuda conversion. Also, for longer numerical types such as short, int or long, the problem remains and I don't know of a way to unsign them in the compiled kernel.

@jjfumero
Copy link
Member

jjfumero commented Aug 9, 2024

I think it should be a way to register a TornadoVM/Graal plugin to do so, at least to force it from an API. We haven't look at this yet, but we will. A temporary solution could be to use the TornadoVM prebuilt API in which you can pass a directly the GPU code:

https://github.com/beehive-lab/TornadoVM/blob/master/tornado-unittests/src/main/java/uk/ac/manchester/tornado/unittests/prebuilt/PrebuiltTest.java#L68-L122

@jjfumero
Copy link
Member

Hi @Benco11-developement , we just merged a fix for this. You can take a look at our tests:
https://github.com/beehive-lab/TornadoVM/blob/master/tornado-unittests/src/main/java/uk/ac/manchester/tornado/unittests/numpromotion/TestZeroExtend.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working compiler
Projects
None yet
Development

No branches or pull requests

3 participants