-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler stopped importing data #1112
Comments
Maybe this is a coincidence, but the data stopped being imported on the date when I ran the group update which you know failed. |
One more comment. xxxx/yyyy - not sure why crawler is trying to get this repo. I did execute both REST and GraphQL requests and this repo is not returned by GitHub. This repo is not shown in Monocle WEB interface. And we have never had this repo. |
Had to reindex :( |
Looks like we have a similar case again. After some "event" it stops importing data :(. symptoms similar to what described before. What I remember that the repository the crawler complains about did not exist ... this time I also cannot find this report. Last time the only solution was to completely rebuild the index but I am afraid this is not a good option. Any idea how we can troubleshoot it. Here is another error message I see regularly in the log
|
To me, taking into account that it refers to non existing repo, some internal cache? maybe corrupted. So maybe it is possible to clean it and then I can reset the date to re-scan data? I really do not want to re-index it again, plus now it happened for the second time so probably will happen again :( |
another question about last date. |
Yes there is a cache. The CLI does not provide a way to clear such entries for no longer existing repositories. |
I am trying to find the index in the elastic where you store this metadata
and cannot find it. it is not visible in Kibana - or I am doing something
wrong
…On Wed, Mar 13, 2024 at 5:32 PM Fabien Boucher ***@***.***> wrote:
Yes there is a cache. The CLI does not provide a way to clear such entries
for no longer existing repositories.
Perhaps then you could try to remove the related state object in the
Elasticsearch DB
https://github.com/change-metrics/monocle/blob/master/src/Monocle/Backend/Index.hs#L204
—
Reply to this email directly, view it on GitHub
<#1112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A433DJMU7EEMJQD2NYRXA23YYBIPFAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGQYTKOJVGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think because that's an object without the usual date field. So you need to select the right parameter in the kibana index pattern creation. |
but what is the index name? it is not the index where all workspace data is
stored?
…On Thu, Mar 14, 2024 at 2:12 PM Fabien Boucher ***@***.***> wrote:
I think because that's an object without the usual date field. So you need
to select the right parameter in the kibana index pattern creation.
—
Reply to this email directly, view it on GitHub
<#1112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A433DJO44XGXJR6G4ZTLLSDYYFZ2DAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGA4TGNBVGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
same index |
Fabien, sorry for the trouble. But I cannot find the required information.
What is the document type to cache crawler information?
…On Thu, Mar 14, 2024 at 3:09 PM Fabien Boucher ***@***.***> wrote:
same index
—
Reply to this email directly, view it on GitHub
<#1112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A433DJNURLYV7QEXSB2D3R3YYGAOXAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGE4TGOJZGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Was not very attentive looking at the index. removing a repo from Elastic
looks like solved the problem.
Should I register a bug for it?
Leonid
…On Thu, Mar 14, 2024 at 11:09 AM Fabien Boucher ***@***.***> wrote:
same index
—
Reply to this email directly, view it on GitHub
<#1112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A433DJNURLYV7QEXSB2D3R3YYGAOXAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGE4TGOJZGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hi, thanks you to have confirmed this. We can just keep that issue for us to investigate the fact that the crawler stop when a no longer existing is still in the "cache" Such objects can stay in the cache but should not prevent the crawler to process the rest of the repo. |
Looking at it, it looks like:
I think we could:
|
thank you, the most important is to make this error "non-fatal" so a
crawler continues running. And all your 3 points make sense.
…On Sun, Mar 17, 2024 at 12:48 PM Tristan de Cacqueray < ***@***.***> wrote:
Looking at it, it looks like:
- the error comes from (when the repositorty is empty):
https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Lentille/GitHub/PullRequests.hs#L83
- and this stops the PR crawler because the postResult considered this
is a fatal error:
https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Macroscope/Worker.hs#L208-L217
I think we could:
- add a new EntityRemoved error to
https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Lentille.hs#L113-L114
- cleanup the crawler metadata when it happens in the Worker module
- ignore it to keep the crawler running in
https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Macroscope/Worker.hs#L113-L119
Note that the comment above is not correct, it should says This is likely
an error we *can* recover
—
Reply to this email directly, view it on GitHub
<#1112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A433DJJ23J47PYIXZE37LZLYYWGKRAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGQ2TCNRRGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Any chance to fix this error? The problem is that we are dropping "old" repos and I have to manually remove it from cache every single time :( |
What's the easiest way to clear the cache? |
I also ran into this issue.
For me using elasticvue to connect to Elastic allowed me to search for any documents containing the deleted repo (ie the cached repo info as well as all crawler errors for this repo) and mass deleting them. This allowed the crawler to resume crawling all my repos. |
I connect to Elastic using Kibana, search for repo - for example
Then I usually reset the date using this command docker-compose -p monocle -f docker-compose.prod.yaml run --rm --no-deps api monocle janitor set-crawler-commit-date --elastic elastic:9200 --config /etc/monocle/config.yaml --workspace xxx --crawler-name --commit-date 2024-07-20 you obviously need to adjust your parameters |
I have noticed that a crawler stopped importing data. I see the following errors in the log
Actually, the repository which cannot be found does not exist.
I thought it could be cached so I have restarted services but looks like still have the problem
any suggestions?
The text was updated successfully, but these errors were encountered: