Parallel pipeline #2779

janmasrovira · 2024-05-13T18:15:02Z

This pr introduces parallelism in the pipeline to gain performance. I've included benchmarks at the end.

Closes Detect import dependencies during setup phase #2750.

Flags:

There are two new global flags:

-N / --threads. It is used to set the number of capabilities. According to GHC documentation: Set the number of Haskell threads that can run truly simultaneously (on separate physical processors) at any given time. When compiling in parallel, we create this many worker threads. The default value is -N auto, which sets -N to half the number of logical cores, capped at 8.
--dev-show-thread-ids. When given, the thread id is printed in the compilation progress log. E.g.

Parallel compilation

I've added src/Parallel/ParallelTemplate.hs which contains all the concurrency related code. I think it is good to keep this code separated from the actual compiler code.
I've added a progress log (only for the parallel driver) that outputs a log of the compilation progress, similar to what stack/cabal do.

Code changes:

I've removed the setup stage where we were registering dependencies. Instead, the dependencies are registered when the pathResolver is run for the first time. This way it is safer.
Now the ImportTree is needed to run the pipeline. Cycles are detected during the construction of this tree, so I've removed Reader ImportParents from the pipeline.
For the package pathresolver, we do not support parallelism yet (we could add support for it in the future, but the gains will be small).
When -N1, the pipeline remains unchanged, so performance should be the same as in the main branch (except there is a small performance degradation due to adding the -threaded flag).
I've introduced PipelineOptions, which are options that are used to pass options to the effects in the pipeline.
PathResolver constraint has been removed from the upTo* functions in the pipeline due to being redundant.
I've added a lot of NFData instances. They are needed to force the full evaluation of Stored.ModuleInfo in each of the threads.
The Cache effect uses SharedState as opposed to LocalState. Perhaps we should provide different versions.
I've added a Cache handler that accepts a setup function. The setup is triggered when a miss is detected. It is used to lazily compile the modules in parallel.

Tests

I've adapted the smoke test suite to ignore the progress log in the stderr.
I've had to adapt tests/positive/Internal/Lambda.juvix. Due to laziness, a crash happening in this file was not being caught. The problem is that in this file we have a lambda function with different number of patterns in their clauses, which we currently do not support (Functions with clauses with differing number of patterns type-check but are not correctly compiled #1706).
I've had to comment out the definition
```
x : Box ((A : Type) → A → A) := box λ {A a := a};
```
From the test as it was causing a crash (Bug in typechecking (inference generates ill-scoped terms) #2247).

Future Work

It should be investigated how much performance we lose by fully evaluating the Stored.ModuleInfo, since some information in it will be discarded. It may be possible to be more fine-grained when forcing evaluation.
The scanning of imports to build the import tree is sequential. Now, we build the import tree from the entry point module and only the modules that are imported from it are in the tree. However, we have discussed that at some point we should make a distinction between juvix the compiler and juvix the build tool. When using juvix as a build tool it makes sense to typecheck/compile (to stored core) all modules in the project. When/if we do this, scanning imports in all modules in parallel becomes trivial.
The implementation of the ParallelTemplate uses low level primitives such as forkIO. At some point it should be refactored to use safer functions from the Effectful.Concurrent.Async module.
The number of cores and worker threads that we spawn is determined by the command line. Ideally, we could use to import tree to compute an upper bound to the ideal number of cores to use.
We could add an animation that displays which modules are being compiled in parallel and which have finished being compiled.

Benchmarks

On some benchmarks, I include the GHC runtime option -A, which sometimes makes a good impact on performance. Thanks to @paulcadman for pointing this out. I've figured a good combination of -N and -A through trial and error (but this oviously depends on the cpu and juvix projects).

Typecheck the standard library

Clean run (88% faster than main):

 hyperfine --warmup 1 --prepare 'juvix clean' 'juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432'  'juvix -N 4 typecheck Stdlib/Prelude.juvix' 'juvix-main typecheck Stdlib/Prelude.juvix'
Benchmark 1: juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432
  Time (mean ± σ):     444.1 ms ±   6.5 ms    [User: 1018.0 ms, System: 77.7 ms]
  Range (min … max):   432.6 ms … 455.9 ms    10 runs

Benchmark 2: juvix -N 4 typecheck Stdlib/Prelude.juvix
  Time (mean ± σ):     628.3 ms ±  23.9 ms    [User: 1227.6 ms, System: 69.5 ms]
  Range (min … max):   584.7 ms … 670.6 ms    10 runs

Benchmark 3: juvix-main typecheck Stdlib/Prelude.juvix
  Time (mean ± σ):     835.9 ms ±  12.3 ms    [User: 788.5 ms, System: 31.9 ms]
  Range (min … max):   816.0 ms … 853.6 ms    10 runs

Summary
  juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432 ran
    1.41 ± 0.06 times faster than juvix -N 4 typecheck Stdlib/Prelude.juvix
    1.88 ± 0.04 times faster than juvix-main typecheck Stdlib/Prelude.juvix

Cached run (43% faster than main):

hyperfine --warmup 1 'juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432'  'juvix -N 4 typecheck Stdlib/Prelude.juvix' 'juvix-main typecheck Stdlib/Prelude.juvix'
Benchmark 1: juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432
  Time (mean ± σ):     241.3 ms ±   7.3 ms    [User: 538.6 ms, System: 101.3 ms]
  Range (min … max):   231.5 ms … 251.3 ms    11 runs

Benchmark 2: juvix -N 4 typecheck Stdlib/Prelude.juvix
  Time (mean ± σ):     235.1 ms ±  12.0 ms    [User: 405.3 ms, System: 87.7 ms]
  Range (min … max):   216.1 ms … 253.1 ms    12 runs

Benchmark 3: juvix-main typecheck Stdlib/Prelude.juvix
  Time (mean ± σ):     336.7 ms ±  13.3 ms    [User: 269.5 ms, System: 67.1 ms]
  Range (min … max):   316.9 ms … 351.8 ms    10 runs

Summary
  juvix -N 4 typecheck Stdlib/Prelude.juvix ran
    1.03 ± 0.06 times faster than juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432
    1.43 ± 0.09 times faster than juvix-main typecheck Stdlib/Prelude.juvix

Typecheck the test suite of the containers library

At the moment this is the biggest juvix project that we have.

Clean run (105% faster than main)

hyperfine --warmup 1 --prepare 'juvix clean' 'juvix -N 6 typecheck Main.juvix +RTS -A67108864' 'juvix -N 4 typecheck Main.juvix' 'juvix-main typecheck Main.juvix'
Benchmark 1: juvix -N 6 typecheck Main.juvix +RTS -A67108864
  Time (mean ± σ):      1.006 s ±  0.011 s    [User: 2.171 s, System: 0.162 s]
  Range (min … max):    0.991 s …  1.023 s    10 runs

Benchmark 2: juvix -N 4 typecheck Main.juvix
  Time (mean ± σ):      1.584 s ±  0.046 s    [User: 2.934 s, System: 0.149 s]
  Range (min … max):    1.535 s …  1.660 s    10 runs

Benchmark 3: juvix-main typecheck Main.juvix
  Time (mean ± σ):      2.066 s ±  0.010 s    [User: 1.939 s, System: 0.089 s]
  Range (min … max):    2.048 s …  2.077 s    10 runs

Summary
  juvix -N 6 typecheck Main.juvix +RTS -A67108864 ran
    1.57 ± 0.05 times faster than juvix -N 4 typecheck Main.juvix
    2.05 ± 0.03 times faster than juvix-main typecheck Main.juvix

Cached run (54% faster than main)

hyperfine --warmup 1 'juvix -N 6 typecheck Main.juvix +RTS -A33554432'  'juvix -N 4 typecheck Main.juvix' 'juvix-main typecheck Main.juvix'
Benchmark 1: juvix -N 6 typecheck Main.juvix +RTS -A33554432
  Time (mean ± σ):     551.8 ms ±  13.2 ms    [User: 1419.8 ms, System: 199.4 ms]
  Range (min … max):   535.2 ms … 570.6 ms    10 runs

Benchmark 2: juvix -N 4 typecheck Main.juvix
  Time (mean ± σ):     636.7 ms ±  17.3 ms    [User: 1006.3 ms, System: 196.3 ms]
  Range (min … max):   601.6 ms … 655.3 ms    10 runs

Benchmark 3: juvix-main typecheck Main.juvix
  Time (mean ± σ):     847.2 ms ±  58.9 ms    [User: 710.1 ms, System: 126.5 ms]
  Range (min … max):   731.1 ms … 890.0 ms    10 runs

Summary
  juvix -N 6 typecheck Main.juvix +RTS -A33554432 ran
    1.15 ± 0.04 times faster than juvix -N 4 typecheck Main.juvix
    1.54 ± 0.11 times faster than juvix-main typecheck Main.juvix

This reverts commit 57c73a5.

paulcadman · 2024-05-30T15:23:57Z

package.yaml

@@ -193,6 +193,12 @@ executables:
      - string-interpolate == 0.3.*
    verbatim:
      default-language: GHC2021
+    ghc-options:
+      - -threaded
+      - -rtsopts


Is -rtsopts is required to be able to pass the -A flag? It's strange I seem to be able to set some RTS flags with the default option.

GHC makes a security warning about using this option so we need to be careful.

https://downloads.haskell.org/ghc/latest/docs/users_guide/phases.html#ghc-flag--rtsopts[=⟨none|some|all|ignore|ignoreAll⟩]

"""
In GHC 6.12.3 and earlier, the default was to process all RTS options. However, since RTS options can be used to write logging data to arbitrary files under the security context of the running program, there is a potential security problem. For this reason, GHC 7.0.1 and later default to -rtsopts=some.
"""

I've added a comment

paulcadman · 2024-05-30T18:20:03Z

src/Juvix/Data/NumThreads.hs

+  NumThreads i -> return i
+  NumThreadsAuto -> do
+    nc <- liftIO GHC.getNumCapabilities
+    return (max 1 (min 6 (nc - 2)))


Is this based on the benchmark experiments you've done?

It is. I have now refined it a bit based on our previous conversation. Basically, we'll use the minimum of number of the processors divided by two and 8 (magic number that I've found is an ok limit). @paulcadman mentioned that there is a bug in GHC where using more cores can lead to unexpected performance loss.

This might not have the same cause but here's the issue I found https://gitlab.haskell.org/ghc/ghc/-/issues/9221

This PR addresses a bug/missing case present since v0.6.2, introduced specifically by - PR #2779, That PR involves detecting imports in Juvix files before type checking, and that's the issue. Detecting/scanning imports is done by running a flat parser (which ignores the Juvix Markdown structure) and when it fails, it runs a Megaparser parse. So, for simplicity, we could just continue using the same Megaparser as before for Juvix Markdown files. --------- Co-authored-by: Jan Mas Rovira <[email protected]>

janmasrovira added the performance label May 13, 2024

janmasrovira self-assigned this May 13, 2024

janmasrovira force-pushed the parallel-pipeline branch 4 times, most recently from fb5cbf8 to 6fcb89a Compare May 19, 2024 17:33

janmasrovira force-pushed the parallel-pipeline branch 7 times, most recently from 5e083b7 to cd32788 Compare May 24, 2024 10:57

janmasrovira force-pushed the parallel-pipeline branch from 6892d9c to 124fea3 Compare May 29, 2024 08:11

janmasrovira added 16 commits May 29, 2024 15:52

add Experiment

cb7b9da

add stm dependency

5da8ddc

add deepseq to prelude

8955246

add template

35ed38a

add SharedState

729d41f

type HighlightBuilder = SharedState HighlightInput

68ca41d

rename

349c8f9

wip

441b235

tmp

4bce0b5

reset pipeline

46cb649

remove redundant PathResolver constraint

759cd4e

format pipeline effects vertically

4656cf7

add Concurrent to pipeline

b9d4228

qualify driver

7426530

par

4707ebd

add importTree to pipeline

e4d703f

janmasrovira force-pushed the parallel-pipeline branch from 56b98ea to 57c73a5 Compare May 29, 2024 13:52

janmasrovira added 8 commits May 29, 2024 20:23

add global flag to show thread id

752a7d4

remove debugging error

a463a6c

juvix dev import-tree print no longer ignores input file

4ca32e7

Reader ImportTree -> Cache ImportTree

8ca19fa

ormolu

a133b38

redundant imports

1cde0a2

Revert "Reader ImportTree -> Cache ImportTree"

ccd3a45

This reverts commit 57c73a5.

remove entry setup

f6a0142

janmasrovira force-pushed the parallel-pipeline branch from 013627e to f6a0142 Compare May 29, 2024 18:24

janmasrovira added 2 commits May 29, 2024 23:29

default -with-rtsopts=-N1

5c9a1cb

smoke

7936582

paulcadman self-requested a review May 30, 2024 09:17

janmasrovira added 3 commits May 30, 2024 13:17

HighlightBuilder should not use shared state

b9a9775

add comment for -with-rtsopts=-N1

7eaffb1

add -rtsopts

bf8996e

janmasrovira marked this pull request as ready for review May 30, 2024 13:33

delete experiment code

c5dd579

paulcadman reviewed May 30, 2024

View reviewed changes

janmasrovira added 3 commits May 31, 2024 11:50

add comment for -rtsopts

dadb557

refine --threads auto

a16ff27

set default to NumThreadsAuto

b9ab42e

janmasrovira requested a review from paulcadman May 31, 2024 10:34

paulcadman approved these changes May 31, 2024

View reviewed changes

paulcadman merged commit e9afdad into main May 31, 2024
4 checks passed

paulcadman deleted the parallel-pipeline branch May 31, 2024 11:41

janmasrovira mentioned this pull request Jun 4, 2024

Parallelize the pipeline #2749

Closed

jonaprieto mentioned this pull request Jul 29, 2024

Fix #2924. Use MegaParsec scanner for Markdown files #2925

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel pipeline #2779

Parallel pipeline #2779

janmasrovira commented May 13, 2024 •

edited

Loading

paulcadman May 30, 2024 •

edited

Loading

janmasrovira May 31, 2024

paulcadman May 30, 2024

janmasrovira May 31, 2024 •

edited

Loading

paulcadman May 31, 2024

Parallel pipeline #2779

Parallel pipeline #2779

Conversation

janmasrovira commented May 13, 2024 • edited Loading

Flags:

Parallel compilation

Code changes:

Tests

Future Work

Benchmarks

Typecheck the standard library

Clean run (88% faster than main):

Cached run (43% faster than main):

Typecheck the test suite of the containers library

Clean run (105% faster than main)

Cached run (54% faster than main)

paulcadman May 30, 2024 • edited Loading

Choose a reason for hiding this comment

janmasrovira May 31, 2024

Choose a reason for hiding this comment

paulcadman May 30, 2024

Choose a reason for hiding this comment

janmasrovira May 31, 2024 • edited Loading

Choose a reason for hiding this comment

paulcadman May 31, 2024

Choose a reason for hiding this comment

janmasrovira commented May 13, 2024 •

edited

Loading

paulcadman May 30, 2024 •

edited

Loading

janmasrovira May 31, 2024 •

edited

Loading