Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requestBetter() and requestLikeBrowser() functions #353

Merged
merged 12 commits into from
Apr 15, 2019

Conversation

jancurn
Copy link
Member

@jancurn jancurn commented Mar 28, 2019

See #255

@petrpatek
Copy link
Contributor

I think that we can also give an option to use the Mozzila enhanced by google bot UA ("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)). Most sites will probably allow to index public content with Google.

@jancurn
Copy link
Member Author

jancurn commented Apr 2, 2019

It can be an option, but I'm not sure how much it will help. There's an advanced Google bot verification mechanism that might be used by many websites - https://support.google.com/webmasters/answer/80553?hl=en

@petrpatek
Copy link
Contributor

What do you think about this package https://www.npmjs.com/package/iltorb ? I need something that supports stream decompression for brotli. I think that the stream implementation should be faster than using the promise one.

@LeMoussel
Copy link

@jancurn There's are also other methods to detect the use of the browser in Headless mode.
Despite these tips described in this article, It seems possible to detect Chrome Headless.

For example these 2 sites detect & block Chrome Headless. :
https://programmetv.ouest-france.fr/1
https://www.orexad.com/collier-bande-pleine-largeur-9-mm-w1/p-0445-8013441

To do this, he uses the DataDome Real-Time Bot Protection service.
I have no idea how he proceeds to detect Chrome Headless.

@petrpatek
Copy link
Contributor

@LeMoussel Thanks for the tip! However, in this PR, the requests are made using HTTP client that pretends to be a browser by a request headers.

@petrpatek petrpatek changed the title [WIP] requestBetter() and requestLikeBrowser() functions requestBetter() and requestLikeBrowser() functions Apr 3, 2019
@petrpatek petrpatek requested review from mnmkng and mtrunkat April 3, 2019 15:28
Copy link
Member Author

@jancurn jancurn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great! I have some comments, but it's nothing major.

test/_helper.js Show resolved Hide resolved
});

describe('Apify.requestAsBrowser', async () => {
it('passes crunchbase.com non browser request blocking', async () => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this will work on the CI, but let's see

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not do this we need to keep tests as stable as possible!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use something external it must be something absolutely stable. If CrunchBase deploys some anti scraping protection then it breaks our tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it. I think that in general, it will be good to set up a kind of lab with the most common protection strategies and test our approaches on how to bypass them.

Copy link
Member

@mtrunkat mtrunkat Apr 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely a good idea but it could be a separate project or group of tests somewhere. As we don't want unit tests to fail when some website changes its protection.

src/utils_request.js Outdated Show resolved Hide resolved
src/utils_request.js Outdated Show resolved Hide resolved
}
// Errors are often sent as JSON, so attempt to parse them,
// despite Accept header being set to something different.
if (type === 'application/json') {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior should be described in function comments, actually it would be good to have a short section about errors there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error handling should be separated into another function. It's quite complicated already. Having it separate might actually point us to the necessary configuration options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But at least the function should have the option throwOnHttpErrors, it's super simple functionality and useful

src/utils_request.js Outdated Show resolved Hide resolved
src/utils_request.js Outdated Show resolved Hide resolved
src/utils_request.js Outdated Show resolved Hide resolved
src/utils_request.js Outdated Show resolved Hide resolved
Copy link
Member

@mnmkng mnmkng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks real good. Besides the other comments:

  1. I don't see the fix that's in CheerioCrawler that takes care of the out of scope tunnel agent error. It's in the _suppressTunnelAgentAssertError(). Without it, request will crash the process occasionally.

  2. We should really think about the error handling. If this is to be used in CheerioCrawler, it needs some much better errors than just plain errors with a message. We would want to inspect the response to see what to do next. Or at least get some error object with status code or something.

src/utils_request.js Outdated Show resolved Hide resolved
import zlib from 'zlib';
import express from 'express';
import { compress } from 'iltorb';
import { requestBetter, requestLikeBrowser } from '../build/utils_request';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a lot of utils files now. Perhaps we should think about a different structure. Utils really doesn't say anything about anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but let's not complicate this PR any further, we can do this in another PR.

.on('response', async (res) => {
const shouldAbort = opts.abortFunction(res);
if (shouldAbort) {
request.abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps missing return ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does the request.abort() destroy the res stream as well when using the response event listener?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not. I added the res.destroy()

}

// No need to catch invalid content header - it is already caught by request
const { type, encoding } = contentType.parse(res);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the request and contentType validations are different? We still need to handle this because it's an error in a callback and it could kill the process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I thought that the request package uses the content-type package, but I was wrong. The error that put me on the wrong path was coming from express.

src/utils_request.js Outdated Show resolved Hide resolved
throw new Error(`requestLikeBrowser: Resource ${options.url} is not available in HTML format. Skipping resource.`);
}

if (type.toLowerCase() !== 'text/html') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, pointless to download the wrong content type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is same as problem as in

It should be in the requestBetter. However, I am not sure if requestBetter should use HTML specific things. I might check the request Accept header first, and if it is HTML, I could make this check.

src/utils_request.js Outdated Show resolved Hide resolved
src/utils_request.js Outdated Show resolved Hide resolved

const { body } = response;

if (!body) throw new Error('The response body is empty');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that we should throw on empty string. Does Request throw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I 100% positive it does not, because no body could be the 204 response code. However, in this context, I think that the requestLikeBrowser wants to get some kind of body since we are checking whether the content-type is HTML.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it in browser and it just shows an empty screen. No error in console.

Copy link
Contributor

@petrpatek petrpatek Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. I wanted to make a different point here. The purpose of the requestLikeBrowser is to get HTML for the page, so I would consider returning everything except the HTML as a error. If we would be returning the empty or no body, this would cause having a check for a no body everywhere the function is used in my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty content with a 200-OK is still content when you're scraping. The crawlers should not retry and then mark the request as failed just because some page had empty HTML I think.

// Handle situations where the server explicitly states that
// it will not serve the resource as text/html by skipping.
if (response.statusCode === 406) {
throw new Error(`requestLikeBrowser: Resource ${options.url} is not available in HTML format. Skipping resource.`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping resource should not be there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be in the requestBetter. However, I am not sure if requestBetter should use HTML specific things. I might check the request Accept header first, and if it is HTML, I could make this check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to do it somehow in CheerioCrawler so I guess having an option in requestBetter to abort certain content-type could be handy. This could then be used in requestAsBrowser for text/html specifically and CheerioCrawler would just use that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I could do this in the abortFunction 💡

src/utils_request.js Outdated Show resolved Hide resolved
src/utils_request.js Outdated Show resolved Hide resolved
src/utils_request.js Outdated Show resolved Hide resolved
src/utils_request.js Show resolved Hide resolved
});

describe('Apify.requestAsBrowser', async () => {
it('passes crunchbase.com non browser request blocking', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not do this we need to keep tests as stable as possible!

});

describe('Apify.requestAsBrowser', async () => {
it('passes crunchbase.com non browser request blocking', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use something external it must be something absolutely stable. If CrunchBase deploys some anti scraping protection then it breaks our tests

@petrpatek petrpatek requested review from mtrunkat and mnmkng April 8, 2019 13:05
@petrpatek
Copy link
Contributor

petrpatek commented Apr 8, 2019

Thanks, this looks real good. Besides the other comments:

  1. I don't see the fix that's in CheerioCrawler that takes care of the out of scope tunnel agent error. It's in the _suppressTunnelAgentAssertError(). Without it, request will crash the process occasionally.
  2. We should really think about the error handling. If this is to be used in CheerioCrawler, it needs some much better errors than just plain errors with a message. We would want to inspect the response to see what to do next. Or at least get some error object with status code or something.

According to the _suppressTunnelAgentAssertError(). Should we move it somewhere else, so it could be shared by both parts?

@mnmkng
Copy link
Member

mnmkng commented Apr 9, 2019

According to the _suppressTunnelAgentAssertError(). Should we move it somewhere else, so it could be shared by both parts?

@petrpatek
I think that it should be integrated directly into the requestExtended because it actually fixes an existing issue with the request package.

@petrpatek
Copy link
Contributor

According to the _suppressTunnelAgentAssertError(). Should we move it somewhere else, so it could be shared by both parts?

@petrpatek
I think that it should be integrated directly into the requestExtended because it actually fixes an existing issue with the request package.

@mnmkng Should I attach/umount the listener every time the request is made? or just initialize it once and let it be?


if (status >= 400 && throwOnHttpError) {
const error = await getMoreErrorInfo(res, cType);
reject(error);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return is missing here

}

if (type === 'application/json') {
const errorResponse = JSON.parse(body);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should have a try-catch here and if it throws then use the truncated message as error msg. Because it's quite common that in a case of network error the body might be incomplete and cannot be parsed. The error that gets thrown then will be confusing.

@mnmkng mnmkng merged commit 73a23e7 into develop Apr 15, 2019
@mnmkng mnmkng deleted the feature/request-like-browser branch April 15, 2019 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants