Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a pagination to import data #74

Open
romainguerrero opened this issue Feb 1, 2023 · 2 comments
Open

Use a pagination to import data #74

romainguerrero opened this issue Feb 1, 2023 · 2 comments

Comments

@romainguerrero
Copy link
Member

Is your feature request related to a problem? Please describe.
Actually the command typesense:import parse all collections and for each gets all entities to send them to the typesense instance. But if we have a lot of entities in a collection, the call could return an error 413 Request Entity Too Large depending on the server configuration.

Describe the solution you'd like
To avoid this issue, I suggest to update the ImportCommand to send data in batchs of 100 documents for example like it's done by the FOSElasticaBundle. As it's done in this previous bundle in the import of the AsyncPagerPersister (see "here":https://github.com/FriendsOfSymfony/FOSElasticaBundle/blob/master/src/Persister/AsyncPagerPersister.php#L55), the best solution could be to use a (configurable ?) pager for requesting entities in the database and send them in batchs.

Describe alternatives you've considered
An easier solution could be to still get all entities as done already but send data in batchs of 100 documents (or any configurable batch size ?) in the foreach loop. But I fear it could result on a memory limit error in case of huge entity number or size.

Additional context
Just for information here's the list of available options in the foselastica populate command :

  • max_per_page - Integer. Tells how many objects should be processed by a single worker at a time.
  • first_page - Integer. Tells from what page to start rebuilding the index.
  • last_page - Integer. Tells on what page to stop rebuilding the index.
@npotier
Copy link
Member

npotier commented Feb 4, 2023

Hello @romainguerrero and thank you for this wonderfull and well documented issue.

it is indeed a good idea to have more argument when using the import command, to choose the number of object to be processed or choose the collection to index.

I've created a PR in order to add these arguments : #75

I need some tests before merging it. A good start would be to add cases in tests/Functional/TypesenseInteractionsTest.php

@N2oo
Copy link

N2oo commented Mar 22, 2023

Hi @npotier, i was trying to import about 400k entries from my database and while executing : about 20 seconds after validating the command
symfony typesense:import
The process throwed a fatal error : OutOfMemoryError from Doctrine's classes.
The test have been made the 22/03/2023.

I tried to import using 2 versions of the bundle :
dev-master
v0.7.1

I've made some change locally and the problem seem's to be solved so I thought it would be great to share it with you.
The changes have been made on the dev-master version of the bundle.

Here is the fact : You must detach Objects from Doctrine by clearing the entity manager.
$this->em->clear()
Here is the ressource that made me think about it.
https://www.doctrine-project.org/projects/doctrine-orm/en/2.14/reference/batch-processing.html

Here is the change i made from the ImportCommand class :

private function populateIndex(InputInterface $input, OutputInterface $output, string $index)
   {
       /*...*/

       for ($i = $firstPage; $i <= $lastPage; ++$i) {
           $q = $this->em->createQuery('select e from '.$class.' e')
               ->setFirstResult(($i - 1) * $maxPerPage)
               ->setMaxResults($maxPerPage)
           ;

           if ($io->isDebug()) {
               $io->text('<info>Running request : </info>'.$q->getSQL());
           }

           $entities = $q->toIterable();

           $data = [];
           foreach ($entities as $entity) {
               $data[] = $this->transformer->convert($entity);
           }

           $io->text('Import <info>['.$collectionName.'] '.$class.'</info> Page '.$i.' of '.$lastPage.' ('.count($data).' items)');

           $result = $this->documentManager->import($collectionName, $data, $action);


           if ($this->printErrors($io, $result)) {
               $this->isError = true;

               throw new \Exception('Error happened during the import of the collection : '.$collectionName.' (you can see them with the option -v)');
           }

           $populated += count($data);
           if($i % 25 == 0){
               $this->em->clear();//clear cache every 25 iterations
           }
       }
       $this->em->clear();//clear cache after processing all data

       $io->newLine();
       return $populated;
   }

Hope it could help,
Have good day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants