Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler stopped importing data #1112

Open
leonid-deriv opened this issue Jan 31, 2024 · 20 comments
Open

Crawler stopped importing data #1112

leonid-deriv opened this issue Jan 31, 2024 · 20 comments

Comments

@leonid-deriv
Copy link

leonid-deriv commented Jan 31, 2024

I have noticed that a crawler stopped importing data. I see the following errors in the log

2024-01-31 14:30:00 INFO    Macroscope.Worker:183: Looking for oldest entity {"index":"demo","crawler":"xxxx-monocle-demo","stream":"TaskDatas","offset":0}
2024-01-31 14:30:00 INFO    Macroscope.Worker:199: Processing {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","entity":{"contents":"xxxx/yyyy","tag":"Project"},"age":"2023-11-29T01:41:16Z"}
2024-01-31 14:30:00 WARNING Lentille.GitHub.RateLimit:66: Repository not found. Will not retry. {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes"}
2024-01-31 14:30:00 INFO    Lentille.GraphQL:232: Fetched from current page {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","count":0,"total":0,"pageInfo":{"endCursor":null,"hasNextPage":false,"totalCount":null},"ratelimit":null}
2024-01-31 14:30:00 WARNING Lentille.GraphQL:276: Fetched partial result {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","err":[{"locations":[{"column":7,"line":8}],"message":"Could not resolve to a Repository with the name 'xxxx/yyyy'.","path":["repository"],"type":"NOT_FOUND"}]}
2024-01-31 14:30:00 INFO    Macroscope.Worker:204: Posting documents {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","count":2}
2024-01-31 14:30:00 INFO    Macroscope.Worker:189: Unable to find entity to update {"index":"demo","crawler":"xxxx-monocle-demo","stream":"TaskDatas"}
2024-01-31 14:30:00 INFO    Macroscope.Worker:183: Looking for oldest entity {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","offset":0}
2024-01-31 14:30:00 WARNING Macroscope.Worker:167: Stream produced a fatal error {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","err":["2024-01-31T14:30:00.901152636Z",{"contents":["Unknown GetProjectPullRequests response: GetProjectPullRequests {rateLimit = Just (GetProjectPullRequestsRateLimit {used = 183, remaining = 4817, resetAt = DateTime \"2024-01-31T14:53:42Z\"}), repository = Nothing}"],"tag":"DecodeError"}]}

Actually, the repository which cannot be found does not exist.
I thought it could be cached so I have restarted services but looks like still have the problem
any suggestions?

@leonid-deriv
Copy link
Author

Maybe this is a coincidence, but the data stopped being imported on the date when I ran the group update which you know failed.

@leonid-deriv
Copy link
Author

leonid-deriv commented Jan 31, 2024

One more comment. xxxx/yyyy - not sure why crawler is trying to get this repo. I did execute both REST and GraphQL requests and this repo is not returned by GitHub. This repo is not shown in Monocle WEB interface. And we have never had this repo.

@leonid-deriv
Copy link
Author

Had to reindex :(

@leonid-deriv
Copy link
Author

Looks like we have a similar case again. After some "event" it stops importing data :(. symptoms similar to what described before. What I remember that the repository the crawler complains about did not exist ... this time I also cannot find this report. Last time the only solution was to completely rebuild the index but I am afraid this is not a good option. Any idea how we can troubleshoot it. Here is another error message I see regularly in the log

2024-03-07 19:16:09 WARNING Macroscope.Worker:167: Stream produced a fatal error {"index":"xxxx","crawler":"xxx-monocle-xxxx","stream":"Changes","err":["2024-03-07T19:16:09.507202765Z",{"contents":["Unknown GetProjectPullRequests response: GetProjectPullRequests {rateLimit = Just (GetProjectPullRequestsRateLimit {used = 130, remaining = 4870, resetAt = DateTime \"2024-03-07T19:32:37Z\"}), repository = Nothing}"],"tag":"DecodeError"}]}

@leonid-deriv
Copy link
Author

To me, taking into account that it refers to non existing repo, some internal cache? maybe corrupted. So maybe it is possible to clean it and then I can reset the date to re-scan data? I really do not want to re-index it again, plus now it happened for the second time so probably will happen again :(

@leonid-deriv
Copy link
Author

another question about last date. Monocle crawlers keep track of the last date (commit date) when a successful document fetch happened.
Where crawler stores this data.

@morucci
Copy link
Collaborator

morucci commented Mar 13, 2024

Yes there is a cache. The CLI does not provide a way to clear such entries for no longer existing repositories.
Perhaps then you could try to remove the related state object in the Elasticsearch DB https://github.com/change-metrics/monocle/blob/master/src/Monocle/Backend/Index.hs#L204

@leonid-deriv
Copy link
Author

leonid-deriv commented Mar 14, 2024 via email

@morucci
Copy link
Collaborator

morucci commented Mar 14, 2024

I think because that's an object without the usual date field. So you need to select the right parameter in the kibana index pattern creation.

@leonid-deriv
Copy link
Author

leonid-deriv commented Mar 14, 2024 via email

@morucci
Copy link
Collaborator

morucci commented Mar 14, 2024

same index

@leonid-deriv
Copy link
Author

leonid-deriv commented Mar 14, 2024 via email

@leonid-deriv
Copy link
Author

leonid-deriv commented Mar 16, 2024 via email

@morucci
Copy link
Collaborator

morucci commented Mar 17, 2024

Hi, thanks you to have confirmed this. We can just keep that issue for us to investigate the fact that the crawler stop when a no longer existing is still in the "cache" Such objects can stay in the cache but should not prevent the crawler to process the rest of the repo.

@TristanCacqueray
Copy link
Contributor

Looking at it, it looks like:

  • the error comes from (when the repositorty is empty):
    _anyOtherResponse ->
  • and this stops the PR crawler because the postResult considered this is a fatal error:
    case catMaybes postResult of
    [] -> do
    -- Post the commit date
    res <- httpRetry "api/commit" $ commitTimestamp entity
    case res of
    Just err -> pure [CommitError err]
    Nothing -> do
    logInfo_ "Continuing on next entity"
    go
    xs -> pure xs

I think we could:

  • add a new EntityRemoved error to

    monocle/src/Lentille.hs

    Lines 113 to 114 in d2232bf

    data LentilleErrorKind
    = DecodeError [Text]
  • cleanup the crawler metadata when it happens in the Worker module
  • ignore it to keep the crawler running in
    let addStreamError :: [Maybe ProcessError] -> [Maybe ProcessError]
    addStreamError = case edoc of
    Right _ -> id
    -- This is likely an error we can't recover, so don't add stream error
    Left (LentilleError _ (PartialErrors _)) -> id
    -- Every other 'LentilleError' are fatal$
    Left err -> (Just (StreamError err) :)

Note that the comment above is not correct, it should says This is likely an error we *can* recover

@leonid-deriv
Copy link
Author

leonid-deriv commented Mar 17, 2024 via email

@leonid-deriv
Copy link
Author

Any chance to fix this error? The problem is that we are dropping "old" repos and I have to manually remove it from cache every single time :(

@gekitsuu
Copy link

gekitsuu commented Sep 5, 2024

What's the easiest way to clear the cache?

@christophe-kamphaus-jemmic

I also ran into this issue.

What's the easiest way to clear the cache?

For me using elasticvue to connect to Elastic allowed me to search for any documents containing the deleted repo (ie the cached repo info as well as all crawler errors for this repo) and mass deleting them. This allowed the crawler to resume crawling all my repos.

@leonid-deriv
Copy link
Author

leonid-deriv commented Dec 4, 2024

I connect to Elastic using Kibana, search for repo - for example

  • crawler_metadata.crawler_type_value:"owner/repo"
  • get the _id of the doc
  • delete the doc (for example from Kibana devtools)

Then I usually reset the date using this command

docker-compose -p monocle -f docker-compose.prod.yaml run --rm --no-deps api monocle janitor set-crawler-commit-date --elastic elastic:9200 --config /etc/monocle/config.yaml --workspace xxx --crawler-name --commit-date 2024-07-20

you obviously need to adjust your parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants