Skip to content

Hash Join

Guodong Jin edited this page Apr 8, 2021 · 6 revisions

Related PRs/Issues:

Hash Join Build

Storage Layout

The storage of a simple hash table is divided into two parts: tuple blocks (htBlocks) and directory (htDirectory). Tuple blocks hold all materialized tuples, and the directory contains slots with pointers. Inside tuple blocks, each tuple reserves a pointer space pointing to a previous tuple within the same hash slot, in this way, we implement a chain-based collision handling. For each tuple, the storage layout is: |key|payloads|prev|. Inside the tuple payloads, we might have variable-length items, whose content will be stored separately in the overflowBlocks.

The materialization of build side tuples

Hash (based on the join key) can happen on both a flat vector and a list. For the join key, we only consider equality of nodeID for now (no join on property values, and no multiple join conditions).

Query example for join on a flat vector: (a)->(b)->(c)->(d)->(e) WHERE a.id=1 AND e.id=1. this is a bidirectional query, hash join happens on c.id. and the build side is the subquery (a)->(b)->(c). which starts from SCAN(a), then EXTEND(b) and EXTEND(c). thus, c might be a list chunk.

Query example for join on a list: (a)->(b)->(c), (b)->(d)->(e). the hash join happens on b.id, and the build side is the subquery (a)->(b)->(c), which starts from SCAN(a), then EXTEND(b). thus, b might be a flat chunk.

During materialization, there are two cases for a list chunk:

  • If it's the key, we would duplicate tuples in build chunks;
  • Else, each list vector is treated as a variable-sized value in the tuple.

Hash Join Probe

Construction of the output dataChunks:

  • probe side non-key data chunk stay unchanged. (flat will be flat, unflat will be unflat)
  • build side key data chunk and probe side key data chunk are merged. (flat/unflat depends on build side non-key data chunks)
  • build side non-key data chunk: if flat, merge into key; else, keep as separate data chunks
  • if there is no unflat non-key data chunk from build side, we always keep key data chunk as unflat, because all variables/data chunks from build side are merged into the key data chunk