You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks to @gerhardgossen's pull request #36 the most important problems with the redir field are now fixed. Investigating one of our crawls in more depth, I found further redir values that break the CDX file format due to spaces (I anonymized the mail addresses):
mailto: [email protected]
mailto:john.doe @Informatik.Uni-Oldenburg.DE
mailto:john.doe@blicher Tbinger Anhang
mailto:[email protected]?subject=Antrag auf SAP Zugang
E:/SmartSource Data Collector/util/content/wt_dcs.gif
ttp://find.galegroup.com/bncn/infomark.do?serQuery=Locale%28en%2C%2C%29%3AFQE%3D%28JX%2CNone%2C16%29%22Dublin Gazette%22%24&queryType=PH&type=pubIssues&prodId=BBCN&version=1.0&source=library
So the main reasons I found are
spaces in e-mail addresses (in all parts),
links to local files (without protocol), and
broken protocol names
which can be summarized by broken URIs can cause broken CDX files which I think should not be the case.
Another issue I found was a CDX line that did not contain a MIME type column which causes similar problems.
The text was updated successfully, but these errors were encountered:
Thanks to @gerhardgossen's pull request #36 the most important problems with the
redir
field are now fixed. Investigating one of our crawls in more depth, I found furtherredir
values that break the CDX file format due to spaces (I anonymized the mail addresses):So the main reasons I found are
which can be summarized by broken URIs can cause broken CDX files which I think should not be the case.
Another issue I found was a CDX line that did not contain a MIME type column which causes similar problems.
The text was updated successfully, but these errors were encountered: