[FEA] Explore reducing the worst case memory for (multi-)get_json_object #2247

revans2 · 2024-07-22T14:18:54Z

Is your feature request related to a problem? Please describe.
get_json_object and multi-get_json_object currently allocate a temporary output buffer that is the same size as the input buffer. This is to speed up the processing because trying to get the exact size of the output before running the command is very expensive to do. Instead the result is fixed up after the expensive kernel runs with a much cheaper operation. The size allocated is not the worst case that could happen for output, but it is close to that. When doing performance testing on the multi-get_json_object command it was clear that if we could add more paths the performance keeps getting better.

But we are limited by the amount of GPU memory available to a given task, and the heuristic used in the plugin to avoid OOM errors when combining paths together, for this one use case, puts us close to a parallelism of 8. It would be really great if we could find some ways to reduce that memory usage.

Describe the solution you'd like
At a minimum we could look at the path and reduce the amount of memory used by something close to the path length - 1. The reasoning behind this is that a null is output if a path does not match anything, but if it does match, then the path itself is effectively removed from the answer.

For example:

$.this.is.a.test

If that is matches up against

{"this":{"is":{"a":{"test":"This is a value string"}}}}

The result is

This is a value string

Of the 55 bytes used by the input string only 22 of them were in the output and "this", "is", "a", "test" were all stripped out.

If we want to be even more exact about this the algorithm would be something like, for a named PathInstructionJni The amount removed from a matched JSON string would be 5 + size of the name encoded in UTF-8. The 5 extra bytes being removed are for the leading {, the opening " for the key name, the ending ", the : separator, and finally the trailing } character. There could be even more removed, but this is fairly minimal. Also if we look at escaping in the JSON key we know that the amount removed would be at least as large as the key passed in. Most escape sequences take 2 bytes to encode a single byte, but even for \u it takes 6 bytes to encode any character that could be at most 4 bytes long in UTF-8.

For non-wildcard array values the amount removed would be at least 2. One for a beginning [ and another for a closing ]

For wildcard values there is to guarantee of anything being removed because the [ and ] characters may be kept.

Note that means that a path of $ would have to have the same output as the input, and in the worst case it might be larger.

A few other options that are somewhat less safe, but probably a lot more effective are to.

Just guess. Most of the time we are pulling back a single value (long, boolean, double, string) that is not that long, especially compared to the size of the input. If we just guess that the output is something like 1 KiB at the most per row and fall back to another option if 1 KiB appears to be too large (a.k.a the average input length <= 1KiB or the smaller of the 1KiB and the length minus the original heuristic).
Run a stripped down kernel that is looking for ["']FINAL_KEY["']: where FINAL_KEY is name of the final named PathInstructionJni in the query. It would have to be able to do some basic escape processing when doing the match and then it would try and calculate the length of the value that could be returned. Probably we would add up all of those lengths if there are multiple matches, just in case. We would need to be able to match opening and closing {}, [] and " or ' but we would not need to actually do validation of any kind on them.

In all of these cases it is now much more likely that we are going to get a result that is longer than we expected. We should make sure that we can recover from the properly and try to see how quickly we can recover.

We probably also want to provide an API for this estimation because we need the plugin code to get a good idea of how much parallelism is should try for. That or we need a version of the API where we pass down a memory budget along with all of the paths and then the JNI code handles the rest.

To be clear unless the recovery is super fast I don't think that guessing is a good option. Even though it would totally work for the average row I suspect that we are going to be closer to something like 80% for the average batch, which is not great.

The text was updated successfully, but these errors were encountered:

revans2 added ? - Needs Triage performance feature request labels Jul 22, 2024

This was referenced Aug 8, 2024

[FEA] Cluster/pack multi_get_json_object paths by common prefixes NVIDIA/spark-rapids#11263

Closed

[FEA] Do a better job with clustering json_paths for multi-get_json_object #2312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Explore reducing the worst case memory for (multi-)get_json_object #2247

[FEA] Explore reducing the worst case memory for (multi-)get_json_object #2247

revans2 commented Jul 22, 2024

[FEA] Explore reducing the worst case memory for (multi-)get_json_object #2247

[FEA] Explore reducing the worst case memory for (multi-)get_json_object #2247

Comments

revans2 commented Jul 22, 2024