-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "Vertical Scalability" section #133
Comments
This is an excellent topic choice since this has a huge impact on Haskell's suitability in several corporate environments. Were you interested in writing this up? If not, I can also speak to this a little bit myself |
exactly
I'm rather interested in reading it to evaluate suitability of Haskell for my corporate needs :) Currently we use ancient legacy of Perl and C++ with some 32GB of mostly static heaps (because neither has GC and everybody in the house hate Java irrationally). And things like external sorting tend to require low-level features to avoid disk cache pollution etc. |
Alright, I'll dump some unassorted notes of mines soon (most likely tomorrow morning) for you to look over and then based on your feedback I'll write it up into a new section |
You can use this issue as a draft. |
First off, here are some notes from @j6carey, who has done a lot of performance work on our internal network protocol parser (a process with a large heap that consumes a high rate of network traffic):
|
My first draft of notes:
|
One more thing: If your parsers suspend pending further input, then beware of making them monadic, as doing so may lead to space leaks. Consider:
In the rather likely event that Perhaps with enough explicit evaluation you could avoid this problem, but in realistically large parsers we have found it very difficult to clean out every last way in which monadic bindings can trigger leaks. Now, the standard Another alternative is to keep the span of input that gave rise to the suspension, discard the suspension, and then re-parse the augmented input from the beginning once new data arrive. Of course, this approach involves redundant parsing, but for small records that can be quite acceptable. In fact, the memory footprint may actually be smaller, because most external data formats are reasonably compact in comparison with a parser suspension. |
@nponeccop: Did the initial notes answer your questions or do you still have remaining questions? |
They answer a different question: "What knobs are available to tune GHC on large machines" And I'd like to see "where tuned GHC stands compared to tuned Java on large machines". The knobs info is usable, but not in the Below is my version of the "first draft": Vertical scalabilityThis chapter covers single-server GHC performance consideration for "large" tasks, such as:
Heap scalability
Manycore scalability
A solution to both these problems is to run many processes with a message passing IPC. Network scalability Haskell's network stack is mostly tuned to the needs of the Web (HTTP and websocket servers with large number of connections but low per-connection bandwidth). But MPI bindings exist.
Disk scalability
|
What is this based on? I'm pretty sure GHC's concurrency runtime and garbage collection will scale past 4 cores. I could be wrong, though, because I haven't formally benchmarked this or studied this recently.
I think it's important to clarify here that "blocking" means that it only blocks the current green thread, not the underlying OS thread or the runtime Also, one thing that needs to be highlighted quite prominently is that, unlike Java or Go, the Haskell runtime can wrap userland FFI calls to C to be "safe" (meaning that they will also only block the current green thread) at the expense of less than 1 microsecond per call. I'm not aware of any other language that provides this guarantee and I think this is one of the killer features of Haskell in this domain since the absence of this feature is a common stability pitfall on the JVM. The only thing that will ever block an OS thread or the runtime is an "unsafe" C call (which has much lower overhead on the order of nanoseconds, designed for very quick calls).
I would suggest removing the reference to Finally, another thing that needs to be highlighted prominently is that Haskell's concurrency is preemptive, not cooperative, which is another big benefit for stability in this domain. |
My assumption was based on this quote:
Remember, I don't know what I'm talking about. It's just the type of things I'd like to see correctly described in the document (but without turning the document into a tutorial)
The API visible to end-users is blocking and threads. It's what I meant. There is no event loop/async monad style api because we have really good green threads. As for blocking all the threads - what happens if the number of green threads blocked by IO operations exceeds the number of RTS capabilities? Do calls that cannot be implemented without blocking use some thread pool independent of the capabilities or? It seems that the degree of non-blocking in GHC runtime cannot be explained here concisely, so we need to find a link.
Ah, it's a common misconception about https://www.kernel.org/doc/ols/2003/ols2003-pages-351-366.pdf Unfortunately a definitive guide on aio is hard to find: most people don't understand storage or syscalls and just assume that it "just works". Here is a suggestion that it helped in case of mysql: https://lists.mysql.com/benchmarks/154 (but the link from there is broken). So you need to find a hardcore C storage guru to see if aio is beneficial for your application. I didn't test aio on linux, but on Windows a similar facility improves performance for long disk queues (and less than 2 outstanding IOs is generally a bad idea even for sequential access).
I have always thought that sparks cannot be preempted in the middle. Can you find a reference?
I think green threads describe the spark scheduler better than the async monad. Also it seems that we need a link that describes the spark concurrency model, as it's pretty unique native code green threads solution. |
The following paper is essential reading for this discussion: The reason I believe GHC can scale past 4 cores is in the abstract of that paper:
The short answer to your question is that GHC can have many more blocking threads than RTS capabilities (i.e. you can have millions of threads blocking if you want), but the more detailed answer is to read the above paper. Also, note that there appears to be a Haskell binding to https://hackage.haskell.org/package/posix-realtime-0.0.0.4/docs/System-Posix-Realtime-Aio.html When I say that Haskell concurrency is preemptive I mean that (A) you don't need to insert explicit yield points in userland code and (B) it's rare for a thread to not automatically yield (the only scenario I'm aware of is a CPU-intensive loop that never allocates memory). The required reading here is the https://hackage.haskell.org/package/base/docs/Control-Concurrent.html Another useful resource is: I also think it is important to distinguish between sparks and green threads. Those are two separate abstractions and RTS features. Conceptually, everything in |
"Performance seems to suffer as the number of GHC RTS capabilities increases past about 4" was based on experience with only a single program--hence the "seems to". It is a real program doing lots of real work, but still, other programs would probably scale differently. Furthermore, when staying on a single NUMA node I have not (yet) seen more than a 13% throughput difference between 3 processes with 4 capabilities and 1 process with 12 capabilities. So if there is much difficulty in going to a multi-process scenario, or if it is difficult to balance the load evenly when doing so explicitly, then you might still be better off with a single process that can easily redirect capabilities to available work. Ideally one would experiment with the particular application in question to see what works best. Regarding green threads and blocking: according to the documentation and what I have seen in practice, each "capability" has a pool of threads, most of which are dormant at any given time. When the active thread blocks in a system call, another thread takes over for that capability, so that the CPUs stay busy. Now, if there is a way to use aio libraries to queue up more than one IO read at a time, so that when the first finishes the second starts without waiting for the kernel scheduler to schedule a user-level request for it, then that sounds rather interesting. Short of that, green threads should do everything, at least in principle. |
It's not kernel queue, but hardware queue in the disk controller - SAS TCQ or SATA NCQ. You can fill the hardware queue so the controller is always busy either with normal thread pool or with aio - it doesn't matter. The only difference is CPU utilization (and that aio is limited and provides neither caching nor full POSIX semantics). See http://dba-oracle.com/real_application_clusters_rac_grid/asynchronous.htm And yes, TCQ/NCQ is indispensable both for SSD and HDD, for different reasons, and its job cannot be done by the kernel side IO scheduler.
I didn't know that. So sparks (and parallel programming in general as opposed to mere IO concurrency) should be covered separately in the document. |
I don't know what is the right name for it, but currently there's no data in SOTU on Haskell performance on larger and more stressed servers. For example:
Bad IO-stack can hurt performance, and so far I've seen only the performance of socket listener and I think there were some works on NUMA-scalability of GHC at Facebook.
Another issue is that if you cannot buy enough RAM (and many guys assume that you always can have 4x ram than you actually need) many strange things happen:
The text was updated successfully, but these errors were encountered: