Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorted dict #14

Draft
wants to merge 77 commits into
base: main
Choose a base branch
from
Draft

Sorted dict #14

wants to merge 77 commits into from

Conversation

amascolo
Copy link
Collaborator

@amascolo amascolo commented Oct 27, 2024

Support for sorting and hybrid queries (i.e. using SortedDict and Range). Includes all queries ported over to SDQL.

@amascolo amascolo self-assigned this Oct 27, 2024
@amascolo
Copy link
Collaborator Author

Details that came up during implementation:

  1. added a Timer expression in SDQL to mark where the timer should start – since we need to exclude the initial data sorting time in benchmarks (nice aside is that now we don't need to treat load expressions as a special case)

  2. added external function SortedIndices - to sort the initial data

  3. added external function SortedVec - to sort the tries

  4. special case to call emplace_back - when constructing @vec { <...> -> 1 } we interpret the relational form <...> -> 1 as i -> <...> and append elements to a std::vec

  5. special case to call at - since in SortedDict we can't use [ ] which has special semantics for insertion (this wasn't an issue previously for vecdict since after construction we only iterated on it – i.e. set semantics – whereas here we access elements)

  6. cache the last found element – to avoid calling SortedDict::find twice in situations where we check with contains then retrieve the element using at (in handwritten C++ code we could assign the result of find – i.e. an iterator - to some variable, but this isn't representable in SDQL)

@amascolo
Copy link
Collaborator Author

Have ported query FJ 3a to SDQL and checked its performance is identical to handwritten C++.

Codegen and runtime should work everywhere, just need to generate all the other queries in SDQL.

case DictNode(seq, hint) =>
case DictNode(Nil, _) => raise("Type inference needs backtracking to infer empty type { }")
// @vec { <...> -> 1 } treats the relational form <...> -> 1 it as a mapping i -> <...>
case DictNode(Seq((r: RecNode, Const(1))), hint @ Vec(_)) => DictType(IntType, run(r), hint)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear why you need to return this type? Is it because later on you are looking up over it based on the index? Otherwise, this makes the type system very complicated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's used by the "pure" sorting FJ queries in progs/job/sorting/fj.

For example, in query 3a the last line interm0(i).col0 is a lookup by index:

let interm0_unsort = sum(<mk_off, _> <- range(mk.size))
  let x0 = mk.movie_id(mk_off) in
    if (x0 ∈ t_trie0) then
      let t_trie1 = t_trie0(x0) in
      sum(<t_i, _> <- t_trie1)
        let t_off = t_offsets(t_i)
        @vec { <col0=mk.movie_id(mk_off), col1=mk.keyword_id(mk_off), col2=t.title(t_off)> -> 1 }

let interm0 = ext(`SortedVec`, 0, interm0_unsort)

let interm0_trie0 = sum(<i, _> <- range(ext(`Size`, interm0)))
  @st(ext(`Size`, interm0)) { interm0(i).col0 -> @range { i -> 1 } }

@@ -222,3 +224,6 @@ sealed trait LLQL
case class Initialise(tpe: Type, e: Exp) extends Exp with LLQL
case class Update(e: Exp, agg: Aggregation, dest: Sym) extends Exp with LLQL
case class Modify(e: Exp, dest: Sym) extends Exp with LLQL

/** Marks which section of the program to time in benchmarks */
case class Timer(exp: Exp) extends Exp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does exp contain here? The entire program that will be timed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exp contains the entire program after where the timer expression appears, which is what we want to time since we must exclude the initial load and sorting expressions.

This is simplest for now. In future, we could add more fine-grained control of the timer, e.g. timer_start / timer_elapsed / timer_stop to time different subsections of the program.

@amascolo
Copy link
Collaborator Author

@amirsh this PR is now ready to merge, subject to your approval:

  • Supports all sort-based and hybrid JOB queries in SDQL ✅
  • Performance within 2% of C++ (no regression on hash-based queries)

@amascolo amascolo requested a review from amirsh November 28, 2024 16:19
@amascolo amascolo marked this pull request as draft November 29, 2024 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants