Crawler stopped importing data #1112

leonid-deriv · 2024-01-31T14:39:31Z

I have noticed that a crawler stopped importing data. I see the following errors in the log

2024-01-31 14:30:00 INFO    Macroscope.Worker:183: Looking for oldest entity {"index":"demo","crawler":"xxxx-monocle-demo","stream":"TaskDatas","offset":0}
2024-01-31 14:30:00 INFO    Macroscope.Worker:199: Processing {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","entity":{"contents":"xxxx/yyyy","tag":"Project"},"age":"2023-11-29T01:41:16Z"}
2024-01-31 14:30:00 WARNING Lentille.GitHub.RateLimit:66: Repository not found. Will not retry. {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes"}
2024-01-31 14:30:00 INFO    Lentille.GraphQL:232: Fetched from current page {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","count":0,"total":0,"pageInfo":{"endCursor":null,"hasNextPage":false,"totalCount":null},"ratelimit":null}
2024-01-31 14:30:00 WARNING Lentille.GraphQL:276: Fetched partial result {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","err":[{"locations":[{"column":7,"line":8}],"message":"Could not resolve to a Repository with the name 'xxxx/yyyy'.","path":["repository"],"type":"NOT_FOUND"}]}
2024-01-31 14:30:00 INFO    Macroscope.Worker:204: Posting documents {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","count":2}
2024-01-31 14:30:00 INFO    Macroscope.Worker:189: Unable to find entity to update {"index":"demo","crawler":"xxxx-monocle-demo","stream":"TaskDatas"}
2024-01-31 14:30:00 INFO    Macroscope.Worker:183: Looking for oldest entity {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","offset":0}
2024-01-31 14:30:00 WARNING Macroscope.Worker:167: Stream produced a fatal error {"index":"demo","crawler":"xxxx-monocle-demo","stream":"Changes","err":["2024-01-31T14:30:00.901152636Z",{"contents":["Unknown GetProjectPullRequests response: GetProjectPullRequests {rateLimit = Just (GetProjectPullRequestsRateLimit {used = 183, remaining = 4817, resetAt = DateTime \"2024-01-31T14:53:42Z\"}), repository = Nothing}"],"tag":"DecodeError"}]}

Actually, the repository which cannot be found does not exist.
I thought it could be cached so I have restarted services but looks like still have the problem
any suggestions?

The text was updated successfully, but these errors were encountered:

leonid-deriv · 2024-01-31T15:01:57Z

Maybe this is a coincidence, but the data stopped being imported on the date when I ran the group update which you know failed.

leonid-deriv · 2024-01-31T21:34:16Z

One more comment. xxxx/yyyy - not sure why crawler is trying to get this repo. I did execute both REST and GraphQL requests and this repo is not returned by GitHub. This repo is not shown in Monocle WEB interface. And we have never had this repo.

leonid-deriv · 2024-02-02T15:54:45Z

Had to reindex :(

leonid-deriv · 2024-03-11T12:17:39Z

Looks like we have a similar case again. After some "event" it stops importing data :(. symptoms similar to what described before. What I remember that the repository the crawler complains about did not exist ... this time I also cannot find this report. Last time the only solution was to completely rebuild the index but I am afraid this is not a good option. Any idea how we can troubleshoot it. Here is another error message I see regularly in the log

2024-03-07 19:16:09 WARNING Macroscope.Worker:167: Stream produced a fatal error {"index":"xxxx","crawler":"xxx-monocle-xxxx","stream":"Changes","err":["2024-03-07T19:16:09.507202765Z",{"contents":["Unknown GetProjectPullRequests response: GetProjectPullRequests {rateLimit = Just (GetProjectPullRequestsRateLimit {used = 130, remaining = 4870, resetAt = DateTime \"2024-03-07T19:32:37Z\"}), repository = Nothing}"],"tag":"DecodeError"}]}

leonid-deriv · 2024-03-11T12:20:58Z

To me, taking into account that it refers to non existing repo, some internal cache? maybe corrupted. So maybe it is possible to clean it and then I can reset the date to re-scan data? I really do not want to re-index it again, plus now it happened for the second time so probably will happen again :(

leonid-deriv · 2024-03-11T12:51:40Z

another question about last date. Monocle crawlers keep track of the last date (commit date) when a successful document fetch happened.
Where crawler stores this data.

morucci · 2024-03-13T13:32:13Z

Yes there is a cache. The CLI does not provide a way to clear such entries for no longer existing repositories.
Perhaps then you could try to remove the related state object in the Elasticsearch DB https://github.com/change-metrics/monocle/blob/master/src/Monocle/Backend/Index.hs#L204

leonid-deriv · 2024-03-14T05:56:29Z

I am trying to find the index in the elastic where you store this metadata and cannot find it. it is not visible in Kibana - or I am doing something wrong

…

On Wed, Mar 13, 2024 at 5:32 PM Fabien Boucher ***@***.***> wrote: Yes there is a cache. The CLI does not provide a way to clear such entries for no longer existing repositories. Perhaps then you could try to remove the related state object in the Elasticsearch DB https://github.com/change-metrics/monocle/blob/master/src/Monocle/Backend/Index.hs#L204 — Reply to this email directly, view it on GitHub <#1112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A433DJMU7EEMJQD2NYRXA23YYBIPFAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGQYTKOJVGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

morucci · 2024-03-14T10:12:27Z

I think because that's an object without the usual date field. So you need to select the right parameter in the kibana index pattern creation.

leonid-deriv · 2024-03-14T10:25:47Z

but what is the index name? it is not the index where all workspace data is stored?

…

On Thu, Mar 14, 2024 at 2:12 PM Fabien Boucher ***@***.***> wrote: I think because that's an object without the usual date field. So you need to select the right parameter in the kibana index pattern creation. — Reply to this email directly, view it on GitHub <#1112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A433DJO44XGXJR6G4ZTLLSDYYFZ2DAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGA4TGNBVGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

morucci · 2024-03-14T11:09:10Z

same index

leonid-deriv · 2024-03-14T14:07:21Z

Fabien, sorry for the trouble. But I cannot find the required information. What is the document type to cache crawler information?

…

On Thu, Mar 14, 2024 at 3:09 PM Fabien Boucher ***@***.***> wrote: same index — Reply to this email directly, view it on GitHub <#1112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A433DJNURLYV7QEXSB2D3R3YYGAOXAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGE4TGOJZGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

leonid-deriv · 2024-03-16T13:27:38Z

Was not very attentive looking at the index. removing a repo from Elastic looks like solved the problem. Should I register a bug for it? Leonid

…

On Thu, Mar 14, 2024 at 11:09 AM Fabien Boucher ***@***.***> wrote: same index — Reply to this email directly, view it on GitHub <#1112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A433DJNURLYV7QEXSB2D3R3YYGAOXAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGE4TGOJZGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

morucci · 2024-03-17T12:08:05Z

Hi, thanks you to have confirmed this. We can just keep that issue for us to investigate the fact that the crawler stop when a no longer existing is still in the "cache" Such objects can stay in the cache but should not prevent the crawler to process the rest of the repo.

TristanCacqueray · 2024-03-17T12:48:18Z

Looking at it, it looks like:

the error comes from (when the repositorty is empty):

monocle/src/Lentille/GitHub/PullRequests.hs

Line 83 in d2232bf

_anyOtherResponse ->

and this stops the PR crawler because the postResult considered this is a fatal error:

monocle/src/Macroscope/Worker.hs

Lines 208 to 217 in d2232bf

    
           case catMaybes postResult of 
        
             [] -> do 
        
               -- Post the commit date 
        
               res <- httpRetry "api/commit" $ commitTimestamp entity 
        
               case res of 
        
                 Just err -> pure [CommitError err] 
        
                 Nothing -> do 
        
                   logInfo_ "Continuing on next entity" 
        
                   go 
        
             xs -> pure xs

I think we could:

add a new EntityRemoved error to

monocle/src/Lentille.hs

Lines 113 to 114 in d2232bf

data LentilleErrorKind

= DecodeError [Text]
cleanup the crawler metadata when it happens in the Worker module

ignore it to keep the crawler running in

monocle/src/Macroscope/Worker.hs

Lines 113 to 119 in d2232bf

    
           let addStreamError :: [Maybe ProcessError] -> [Maybe ProcessError] 
        
               addStreamError = case edoc of 
        
                 Right _ -> id 
        
                 -- This is likely an error we can't recover, so don't add stream error 
        
                 Left (LentilleError _ (PartialErrors _)) -> id 
        
                 -- Every other 'LentilleError' are fatal$ 
        
                 Left err -> (Just (StreamError err) :)

Note that the comment above is not correct, it should says This is likely an error we *can* recover

leonid-deriv · 2024-03-17T13:56:26Z

thank you, the most important is to make this error "non-fatal" so a crawler continues running. And all your 3 points make sense.

…

On Sun, Mar 17, 2024 at 12:48 PM Tristan de Cacqueray < ***@***.***> wrote: Looking at it, it looks like: - the error comes from (when the repositorty is empty): https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Lentille/GitHub/PullRequests.hs#L83 - and this stops the PR crawler because the postResult considered this is a fatal error: https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Macroscope/Worker.hs#L208-L217 I think we could: - add a new EntityRemoved error to https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Lentille.hs#L113-L114 - cleanup the crawler metadata when it happens in the Worker module - ignore it to keep the crawler running in https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Macroscope/Worker.hs#L113-L119 Note that the comment above is not correct, it should says This is likely an error we *can* recover — Reply to this email directly, view it on GitHub <#1112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A433DJJ23J47PYIXZE37LZLYYWGKRAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGQ2TCNRRGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

leonid-deriv · 2024-04-03T09:00:09Z

Any chance to fix this error? The problem is that we are dropping "old" repos and I have to manually remove it from cache every single time :(

gekitsuu · 2024-09-05T13:21:21Z

What's the easiest way to clear the cache?

christophe-kamphaus-jemmic · 2024-12-04T21:21:39Z

I also ran into this issue.

What's the easiest way to clear the cache?

For me using elasticvue to connect to Elastic allowed me to search for any documents containing the deleted repo (ie the cached repo info as well as all crawler errors for this repo) and mass deleting them. This allowed the crawler to resume crawling all my repos.

leonid-deriv · 2024-12-04T22:10:11Z

I connect to Elastic using Kibana, search for repo - for example

crawler_metadata.crawler_type_value:"owner/repo"
get the _id of the doc
delete the doc (for example from Kibana devtools)

Then I usually reset the date using this command

docker-compose -p monocle -f docker-compose.prod.yaml run --rm --no-deps api monocle janitor set-crawler-commit-date --elastic elastic:9200 --config /etc/monocle/config.yaml --workspace xxx --crawler-name --commit-date 2024-07-20

you obviously need to adjust your parameters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler stopped importing data #1112

Crawler stopped importing data #1112

leonid-deriv commented Jan 31, 2024 •

edited

Loading

leonid-deriv commented Jan 31, 2024

leonid-deriv commented Jan 31, 2024 •

edited

Loading

leonid-deriv commented Feb 2, 2024

leonid-deriv commented Mar 11, 2024

leonid-deriv commented Mar 11, 2024

leonid-deriv commented Mar 11, 2024

morucci commented Mar 13, 2024 •

edited

Loading

leonid-deriv commented Mar 14, 2024 via email

morucci commented Mar 14, 2024

leonid-deriv commented Mar 14, 2024 via email

morucci commented Mar 14, 2024

leonid-deriv commented Mar 14, 2024 via email

leonid-deriv commented Mar 16, 2024 via email

morucci commented Mar 17, 2024

TristanCacqueray commented Mar 17, 2024

leonid-deriv commented Mar 17, 2024 via email

leonid-deriv commented Apr 3, 2024

gekitsuu commented Sep 5, 2024

christophe-kamphaus-jemmic commented Dec 4, 2024

leonid-deriv commented Dec 4, 2024 •

edited

Loading

Crawler stopped importing data #1112

Crawler stopped importing data #1112

Comments

leonid-deriv commented Jan 31, 2024 • edited Loading

leonid-deriv commented Jan 31, 2024

leonid-deriv commented Jan 31, 2024 • edited Loading

leonid-deriv commented Feb 2, 2024

leonid-deriv commented Mar 11, 2024

leonid-deriv commented Mar 11, 2024

leonid-deriv commented Mar 11, 2024

morucci commented Mar 13, 2024 • edited Loading

leonid-deriv commented Mar 14, 2024 via email

morucci commented Mar 14, 2024

leonid-deriv commented Mar 14, 2024 via email

morucci commented Mar 14, 2024

leonid-deriv commented Mar 14, 2024 via email

leonid-deriv commented Mar 16, 2024 via email

morucci commented Mar 17, 2024

TristanCacqueray commented Mar 17, 2024

leonid-deriv commented Mar 17, 2024 via email

leonid-deriv commented Apr 3, 2024

gekitsuu commented Sep 5, 2024

christophe-kamphaus-jemmic commented Dec 4, 2024

leonid-deriv commented Dec 4, 2024 • edited Loading

leonid-deriv commented Jan 31, 2024 •

edited

Loading

leonid-deriv commented Jan 31, 2024 •

edited

Loading

morucci commented Mar 13, 2024 •

edited

Loading

leonid-deriv commented Dec 4, 2024 •

edited

Loading