-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
requestBetter() and requestLikeBrowser() functions #353
Conversation
I think that we can also give an option to use the Mozzila enhanced by google bot UA ("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)). Most sites will probably allow to index public content with Google. |
It can be an option, but I'm not sure how much it will help. There's an advanced Google bot verification mechanism that might be used by many websites - https://support.google.com/webmasters/answer/80553?hl=en |
What do you think about this package https://www.npmjs.com/package/iltorb ? I need something that supports stream decompression for brotli. I think that the stream implementation should be faster than using the promise one. |
@jancurn There's are also other methods to detect the use of the browser in Headless mode. For example these 2 sites detect & block Chrome Headless. : To do this, he uses the DataDome Real-Time Bot Protection service. |
@LeMoussel Thanks for the tip! However, in this PR, the requests are made using HTTP client that pretends to be a browser by a request headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great! I have some comments, but it's nothing major.
test/utils_request.js
Outdated
}); | ||
|
||
describe('Apify.requestAsBrowser', async () => { | ||
it('passes crunchbase.com non browser request blocking', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this will work on the CI, but let's see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd not do this we need to keep tests as stable as possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use something external it must be something absolutely stable. If CrunchBase deploys some anti scraping protection then it breaks our tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it. I think that in general, it will be good to set up a kind of lab with the most common protection strategies and test our approaches on how to bypass them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely a good idea but it could be a separate project or group of tests somewhere. As we don't want unit tests to fail when some website changes its protection.
src/utils_request.js
Outdated
} | ||
// Errors are often sent as JSON, so attempt to parse them, | ||
// despite Accept header being set to something different. | ||
if (type === 'application/json') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior should be described in function comments, actually it would be good to have a short section about errors there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error handling should be separated into another function. It's quite complicated already. Having it separate might actually point us to the necessary configuration options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But at least the function should have the option throwOnHttpErrors
, it's super simple functionality and useful
… 400+ error codes by default. Added better contentType check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks real good. Besides the other comments:
-
I don't see the fix that's in
CheerioCrawler
that takes care of the out of scope tunnel agent error. It's in the_suppressTunnelAgentAssertError()
. Without it, request will crash the process occasionally. -
We should really think about the error handling. If this is to be used in
CheerioCrawler
, it needs some much better errors than just plain errors with a message. We would want to inspect the response to see what to do next. Or at least get some error object with status code or something.
test/utils_request.js
Outdated
import zlib from 'zlib'; | ||
import express from 'express'; | ||
import { compress } from 'iltorb'; | ||
import { requestBetter, requestLikeBrowser } from '../build/utils_request'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a lot of utils
files now. Perhaps we should think about a different structure. Utils really doesn't say anything about anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but let's not complicate this PR any further, we can do this in another PR.
.on('response', async (res) => { | ||
const shouldAbort = opts.abortFunction(res); | ||
if (shouldAbort) { | ||
request.abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps missing return
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does the request.abort()
destroy the res
stream as well when using the response event listener?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it does not. I added the res.destroy()
src/utils_request.js
Outdated
} | ||
|
||
// No need to catch invalid content header - it is already caught by request | ||
const { type, encoding } = contentType.parse(res); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the request
and contentType
validations are different? We still need to handle this because it's an error in a callback and it could kill the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I thought that the request package uses the content-type package, but I was wrong. The error that put me on the wrong path was coming from express.
src/utils_request.js
Outdated
throw new Error(`requestLikeBrowser: Resource ${options.url} is not available in HTML format. Skipping resource.`); | ||
} | ||
|
||
if (type.toLowerCase() !== 'text/html') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, pointless to download the wrong content type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is same as problem as in
It should be in the requestBetter. However, I am not sure if requestBetter should use HTML specific things. I might check the request Accept header first, and if it is HTML, I could make this check.
src/utils_request.js
Outdated
|
||
const { body } = response; | ||
|
||
if (!body) throw new Error('The response body is empty'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that we should throw on empty string. Does Request throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I 100% positive it does not, because no body could be the 204 response code. However, in this context, I think that the requestLikeBrowser
wants to get some kind of body since we are checking whether the content-type
is HTML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it in browser and it just shows an empty screen. No error in console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true. I wanted to make a different point here. The purpose of the requestLikeBrowser
is to get HTML for the page, so I would consider returning everything except the HTML as a error. If we would be returning the empty or no body, this would cause having a check for a no body everywhere the function is used in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty content with a 200-OK is still content when you're scraping. The crawlers should not retry and then mark the request as failed just because some page had empty HTML I think.
src/utils_request.js
Outdated
// Handle situations where the server explicitly states that | ||
// it will not serve the resource as text/html by skipping. | ||
if (response.statusCode === 406) { | ||
throw new Error(`requestLikeBrowser: Resource ${options.url} is not available in HTML format. Skipping resource.`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping resource
should not be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be in the requestBetter
. However, I am not sure if requestBetter
should use HTML specific things. I might check the request Accept
header first, and if it is HTML, I could make this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to do it somehow in CheerioCrawler
so I guess having an option in requestBetter
to abort certain content-type
could be handy. This could then be used in requestAsBrowser
for text/html
specifically and CheerioCrawler
would just use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I could do this in the abortFunction
💡
test/utils_request.js
Outdated
}); | ||
|
||
describe('Apify.requestAsBrowser', async () => { | ||
it('passes crunchbase.com non browser request blocking', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd not do this we need to keep tests as stable as possible!
test/utils_request.js
Outdated
}); | ||
|
||
describe('Apify.requestAsBrowser', async () => { | ||
it('passes crunchbase.com non browser request blocking', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use something external it must be something absolutely stable. If CrunchBase deploys some anti scraping protection then it breaks our tests
According to the |
@petrpatek |
@mnmkng Should I attach/umount the listener every time the request is made? or just initialize it once and let it be? |
src/utils_request.js
Outdated
|
||
if (status >= 400 && throwOnHttpError) { | ||
const error = await getMoreErrorInfo(res, cType); | ||
reject(error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return
is missing here
src/utils_request.js
Outdated
} | ||
|
||
if (type === 'application/json') { | ||
const errorResponse = JSON.parse(body); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we should have a try-catch here and if it throws then use the truncated message as error msg. Because it's quite common that in a case of network error the body might be incomplete and cannot be parsed. The error that gets thrown then will be confusing.
See #255