HTTPS via a Proxy #64

PsypherPunk · 2016-08-24T09:53:55Z

I've trying to crawl a HTTPS site through a Squid proxy and keep seeing errors like these:

java.io.IOException: RIS already open for ToeThread #12: https://XXX/robots.txt
   at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84)
   at org.archive.util.Recorder.inputWrap(Recorder.java:185)
   at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:649)
   at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131)

HTTP sites are fine but HTTPS just doesn't seem to work. The problem seems to be down to the RecordingInputStream and RecordingOutputStream, both of which throw an IOException if the underlying Stream is != null.

If, however, I comment out those checks, the HTTPS crawl works perfectly (as far as I can tell...). I'm not sure whether this is the webarchive-commons library being overly cautious or heritrix3 failing to do something for HTTPS sites.

The text was updated successfully, but these errors were encountered:

kris-sigur · 2016-08-24T12:40:02Z

First thought is that when crawling HTTPS via proxy, Heritrix fails to properly close the RecordingInputStream (these are thread local).

marhop mentioned this issue Jan 31, 2019

"RIS already open for ToeThread..." exception during https pages crawl over proxy internetarchive/heritrix3#191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTPS via a Proxy #64

HTTPS via a Proxy #64

PsypherPunk commented Aug 24, 2016

kris-sigur commented Aug 24, 2016

HTTPS via a Proxy #64

HTTPS via a Proxy #64

Comments

PsypherPunk commented Aug 24, 2016

kris-sigur commented Aug 24, 2016