You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've trying to crawl a HTTPS site through a Squid proxy and keep seeing errors like these:
java.io.IOException: RIS already open for ToeThread #12: https://XXX/robots.txt
at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84)
at org.archive.util.Recorder.inputWrap(Recorder.java:185)
at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:649)
at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131)
HTTP sites are fine but HTTPS just doesn't seem to work. The problem seems to be down to the RecordingInputStream and RecordingOutputStream, both of which throw an IOException if the underlying Stream is != null.
If, however, I comment out those checks, the HTTPS crawl works perfectly (as far as I can tell...). I'm not sure whether this is the webarchive-commons library being overly cautious or heritrix3 failing to do something for HTTPS sites.
The text was updated successfully, but these errors were encountered:
I've trying to crawl a HTTPS site through a Squid proxy and keep seeing errors like these:
HTTP sites are fine but HTTPS just doesn't seem to work. The problem seems to be down to the
RecordingInputStream
andRecordingOutputStream
, both of which throw anIOException
if the underlying Stream is!= null
.If, however, I comment out those checks, the HTTPS crawl works perfectly (as far as I can tell...). I'm not sure whether this is the
webarchive-commons
library being overly cautious orheritrix3
failing to do something for HTTPS sites.The text was updated successfully, but these errors were encountered: