Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow serialization with large Array data (> 60 Mbytes) #269

Open
hernanmd opened this issue Feb 10, 2023 · 23 comments
Open

Slow serialization with large Array data (> 60 Mbytes) #269

hernanmd opened this issue Feb 10, 2023 · 23 comments
Labels
pinned Never mark this issue stale

Comments

@hernanmd
Copy link

It seems that Fuel takes a long time serializing an Array of Arrays containing multiple Strings each one. When serialization passes around 60 Mbytes, it starts writing data in chunks of 100Kbytes and very slowly:

To reproduce please download this Fuel serialization: https://filesender.renater.fr/?s=download&token=b3809275-6258-4ee4-8d93-3b8871b7a0bd

Materialize it in a fresh Pharo 11 image:

FLMaterializer 
	materializeFromFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized.fuel'.

and then from the Inspector window evaluate:

FLSerializer 
	serialize: self 
	toFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized_test.fuel'.

We tested in Pharo 11 under macOS.

@theseion
Copy link
Owner

Interesting. I don't see any reason for Fuel to do that. There must be another mechanism at play. I currently don't have to look into this tough, in a week or two maybe. @tinchodias, do you have some time to spare maybe?

@theseion
Copy link
Owner

Can you give me a rough timing estimate of what you mean by "slowly"?

@hernanmd
Copy link
Author

Can you give me a rough timing estimate of what you mean by "slowly"?

It was around 100K/sec

@theseion
Copy link
Owner

This is very interesting. I found the issue to be time spent in FLLargeIdentityDictionary. Replacing one particular use of FLLargeIdentityDictionary with IdentityDictionary cut time down to ~40 seconds. FLLargeIdentityDictionary was specifically designed to be faster for large collections, so it's very curious that this happens.

@theseion
Copy link
Owner

Still investigating but interesting find: in Pharo 11 the string hashes appear to be badly distributed.

@hernanmd
Copy link
Author

Probably @guillep will be interested

@theseion
Copy link
Owner

I'm not sure the hash distribution is bad per se, it's just that the hashes have only uneven numbers when using modulo 4096.

It looks like FLLargeIdentityDictionary has been slower than we wanted it to be since at least Pharo 6.

@guillep
Copy link

guillep commented Feb 13, 2023

Interesting finding :)

@guillep
Copy link

guillep commented Feb 23, 2023

I checked the usage of identity dictionaries.
I wrote a silly (and unfinished) identity dictionary implementation that keeps a hash table with hash tables inside. The outer hash table uses the modulo of the identity hash and benefits from them being evenly distributed. This silly implementation outperformed the other identity dictionaries for the larger charges (yellow in the plot below). It's 4x faster than FLLargeIdentityDictionary and 50x faster than the standard identity dictionary for the largest set of objects.

imagen

This kind of tells me that a better identity dictionary will drastically improve the performance of fuel.
I'll add this to the GSOC proposal here: https://gsoc.pharo.org/serialization

@theseion
Copy link
Owner

Brilliant, thanks!

@jecisc
Copy link

jecisc commented Mar 24, 2023

Oh nice, not just Fuel. The system relies a lot on IdentityDictionaries :)

@stale
Copy link

stale bot commented May 23, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@stale stale bot added the stale label May 23, 2023
@theseion theseion removed the stale label May 24, 2023
@stale
Copy link

stale bot commented Jul 23, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@stale stale bot added the stale label Jul 23, 2023
@theseion theseion removed the stale label Jul 24, 2023
@stale
Copy link

stale bot commented Sep 23, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@stale stale bot added the stale label Sep 23, 2023
@theseion theseion added pinned Never mark this issue stale and removed stale labels Sep 24, 2023
@seandenigris
Copy link
Collaborator

I recently had a 40MB serialized graph balloon to almost 400MB on disk, which translated into even more memory in the image. It turned out to be caused by a single OrderedCollection that had 99 million strings as items. Sounds potentially related...

By the way, are there any tricks to investigate the cause of such a problem? It took me many hours to track down by selectively nil-ing out parts of the model until I found the collection. I never would've guessed it because it was not part of the model that seemed likely to grow that large. I tried the debugger, but memory use grew to about 16GB before the image started to fail e.g. "file does not exist" errors

@theseion
Copy link
Owner

Just the standard "tricks":

  • count number of instances per class
  • check collection sizes

With v5 I had introduced "graph culling", so you could have configured the depth of the graph, and maybe tried to increase the depth continuously, until you started seeing the issue. Then you could have set a break point at that depth level to inspect candidates.

@tinchodias
Copy link
Collaborator

@seandenigris Did you use SpaceTally in your investigation?
Anyway, I don't see a clear relation between the too-big .fuel file in your case, and the badly distributed keys in the in the large identity dictionary discovered by Max in this issue.

@ELePors
Copy link

ELePors commented Sep 5, 2024

Hi i have the same problem with a 10mb fuel file... which takes less than 500ms to read, but almost 59seconds to produce...
STON produces a serialization of the same objects in less than 5 seconds... (10 times less).

I tried a workaround published here by Mariano : http://forum.world.st/Fuel-serialize-of-70MB-takes-forever-on-Linux-vs-Mac-td5061002.html

| set dict |
set := FLLargeIdentitySet.
dict := FLLargeIdentityDictionary.
Smalltalk at: #FLLargeIdentitySet put: IdentitySet.
Smalltalk at: #FLLargeIdentityDictionary put: IdentityDictionary.

my serialization with fuel of the same data is now done in 4,6 seconds...
ok ... ?

@ELePors
Copy link

ELePors commented Sep 5, 2024

I don't know the impact on my Pharo image with this kind of workaround... what should we do ?

Eric.

@theseion
Copy link
Owner

theseion commented Sep 6, 2024

Well, it looks like it's time to switch to the standard collections :)

I quickly checked and all tests pass using the standard collections instead of the Fuel custom ones. You should be good using the workaround you posted.

@ELePors
Copy link

ELePors commented Sep 6, 2024

i can propose a pull request on Pharo 12 branch

@seandenigris
Copy link
Collaborator

seandenigris commented Sep 6, 2024 via email

@theseion
Copy link
Owner

theseion commented Sep 9, 2024

i can propose a pull request on Pharo 12 branch

That would be nice.

What was the purpose of the custom collections vs. standard ones?

Unfortunately, the original research paper doesn't say anything about those. I presume that Fuel exhibited performance issues with large collections due to hash collisions. Meanwhile, the VMs use 64 bit words and longer hashes, which probably has mitigated the issue. Maybe @marianopeck or @tinchodias can share some more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned Never mark this issue stale
Projects
None yet
Development

No branches or pull requests

7 participants