requestBetter() and requestLikeBrowser() functions #353

jancurn · 2019-03-28T15:03:08Z

…tions

petrpatek · 2019-04-02T09:34:22Z

I think that we can also give an option to use the Mozzila enhanced by google bot UA ("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)). Most sites will probably allow to index public content with Google.

jancurn · 2019-04-02T09:41:05Z

It can be an option, but I'm not sure how much it will help. There's an advanced Google bot verification mechanism that might be used by many websites - https://support.google.com/webmasters/answer/80553?hl=en

petrpatek · 2019-04-02T14:11:24Z

What do you think about this package https://www.npmjs.com/package/iltorb ? I need something that supports stream decompression for brotli. I think that the stream implementation should be faster than using the promise one.

LeMoussel · 2019-04-03T05:47:58Z

@jancurn There's are also other methods to detect the use of the browser in Headless mode.
Despite these tips described in this article, It seems possible to detect Chrome Headless.

For example these 2 sites detect & block Chrome Headless. :
https://programmetv.ouest-france.fr/1
https://www.orexad.com/collier-bande-pleine-largeur-9-mm-w1/p-0445-8013441

To do this, he uses the DataDome Real-Time Bot Protection service.
I have no idea how he proceeds to detect Chrome Headless.

petrpatek · 2019-04-03T11:21:48Z

@LeMoussel Thanks for the tip! However, in this PR, the requests are made using HTTP client that pretends to be a browser by a request headers.

jancurn

It looks great! I have some comments, but it's nothing major.

test/_helper.js

jancurn · 2019-04-04T15:48:39Z

test/utils_request.js

+    });
+
+    describe('Apify.requestAsBrowser', async () => {
+        it('passes crunchbase.com non browser request blocking', async () => {


Not sure this will work on the CI, but let's see

I'd not do this we need to keep tests as stable as possible!

When we use something external it must be something absolutely stable. If CrunchBase deploys some anti scraping protection then it breaks our tests

I was thinking about it. I think that in general, it will be good to set up a kind of lab with the most common protection strategies and test our approaches on how to bypass them.

That's definitely a good idea but it could be a separate project or group of tests somewhere. As we don't want unit tests to fail when some website changes its protection.

src/utils_request.js

jancurn · 2019-04-04T15:54:58Z

src/utils_request.js

+                    }
+                    // Errors are often sent as JSON, so attempt to parse them,
+                    // despite Accept header being set to something different.
+                    if (type === 'application/json') {


This behavior should be described in function comments, actually it would be good to have a short section about errors there.

I think the error handling should be separated into another function. It's quite complicated already. Having it separate might actually point us to the necessary configuration options.

But at least the function should have the option throwOnHttpErrors, it's super simple functionality and useful

src/utils_request.js

… 400+ error codes by default. Added better contentType check.

src/utils_request.js

mnmkng

Thanks, this looks real good. Besides the other comments:

I don't see the fix that's in CheerioCrawler that takes care of the out of scope tunnel agent error. It's in the _suppressTunnelAgentAssertError(). Without it, request will crash the process occasionally.
We should really think about the error handling. If this is to be used in CheerioCrawler, it needs some much better errors than just plain errors with a message. We would want to inspect the response to see what to do next. Or at least get some error object with status code or something.

src/utils_request.js

mnmkng · 2019-04-06T17:09:32Z

test/utils_request.js

+import zlib from 'zlib';
+import express from 'express';
+import { compress } from 'iltorb';
+import { requestBetter, requestLikeBrowser } from '../build/utils_request';


We have a lot of utils files now. Perhaps we should think about a different structure. Utils really doesn't say anything about anything.

Sure, but let's not complicate this PR any further, we can do this in another PR.

mnmkng · 2019-04-06T17:10:53Z

src/utils_request.js

+            .on('response', async (res) => {
+                const shouldAbort = opts.abortFunction(res);
+                if (shouldAbort) {
+                    request.abort();


Perhaps missing return ?

Also, does the request.abort() destroy the res stream as well when using the response event listener?

No, it does not. I added the res.destroy()

mnmkng · 2019-04-06T17:12:36Z

src/utils_request.js

+                }
+
+                // No need to catch invalid content header - it is already caught by request
+                const { type, encoding } = contentType.parse(res);


What if the request and contentType validations are different? We still need to handle this because it's an error in a callback and it could kill the process.

You are right. I thought that the request package uses the content-type package, but I was wrong. The error that put me on the wrong path was coming from express.

src/utils_request.js

mnmkng · 2019-04-06T17:46:00Z

src/utils_request.js

+        throw new Error(`requestLikeBrowser: Resource ${options.url} is not available in HTML format. Skipping resource.`);
+    }
+
+    if (type.toLowerCase() !== 'text/html') {


Same here, pointless to download the wrong content type.

It is same as problem as in

It should be in the requestBetter. However, I am not sure if requestBetter should use HTML specific things. I might check the request Accept header first, and if it is HTML, I could make this check.

src/utils_request.js

mnmkng · 2019-04-06T17:50:24Z

src/utils_request.js

+
+    const { body } = response;
+
+    if (!body) throw new Error('The response body is empty');


Not sure that we should throw on empty string. Does Request throw?

Actually, I 100% positive it does not, because no body could be the 204 response code. However, in this context, I think that the requestLikeBrowser wants to get some kind of body since we are checking whether the content-type is HTML.

Tried it in browser and it just shows an empty screen. No error in console.

That is true. I wanted to make a different point here. The purpose of the requestLikeBrowser is to get HTML for the page, so I would consider returning everything except the HTML as a error. If we would be returning the empty or no body, this would cause having a check for a no body everywhere the function is used in my opinion.

Empty content with a 200-OK is still content when you're scraping. The crawlers should not retry and then mark the request as failed just because some page had empty HTML I think.

mnmkng · 2019-04-06T17:57:03Z

src/utils_request.js

+    // Handle situations where the server explicitly states that
+    // it will not serve the resource as text/html by skipping.
+    if (response.statusCode === 406) {
+        throw new Error(`requestLikeBrowser: Resource ${options.url} is not available in HTML format. Skipping resource.`);


Skipping resource should not be there.

It should be in the requestBetter. However, I am not sure if requestBetter should use HTML specific things. I might check the request Accept header first, and if it is HTML, I could make this check.

We'll need to do it somehow in CheerioCrawler so I guess having an option in requestBetter to abort certain content-type could be handy. This could then be used in requestAsBrowser for text/html specifically and CheerioCrawler would just use that.

I think that I could do this in the abortFunction 💡

src/utils_request.js

mtrunkat · 2019-04-06T17:59:19Z

test/utils_request.js

+    });
+
+    describe('Apify.requestAsBrowser', async () => {
+        it('passes crunchbase.com non browser request blocking', async () => {


I'd not do this we need to keep tests as stable as possible!

mtrunkat · 2019-04-06T18:00:54Z

test/utils_request.js

+    });
+
+    describe('Apify.requestAsBrowser', async () => {
+        it('passes crunchbase.com non browser request blocking', async () => {


When we use something external it must be something absolutely stable. If CrunchBase deploys some anti scraping protection then it breaks our tests

petrpatek · 2019-04-08T13:15:49Z

Thanks, this looks real good. Besides the other comments:

I don't see the fix that's in CheerioCrawler that takes care of the out of scope tunnel agent error. It's in the _suppressTunnelAgentAssertError(). Without it, request will crash the process occasionally.

We should really think about the error handling. If this is to be used in CheerioCrawler, it needs some much better errors than just plain errors with a message. We would want to inspect the response to see what to do next. Or at least get some error object with status code or something.

According to the _suppressTunnelAgentAssertError(). Should we move it somewhere else, so it could be shared by both parts?

mnmkng · 2019-04-09T07:54:27Z

According to the _suppressTunnelAgentAssertError(). Should we move it somewhere else, so it could be shared by both parts?

@petrpatek
I think that it should be integrated directly into the requestExtended because it actually fixes an existing issue with the request package.

petrpatek · 2019-04-09T12:52:37Z

According to the _suppressTunnelAgentAssertError(). Should we move it somewhere else, so it could be shared by both parts?

@petrpatek
I think that it should be integrated directly into the requestExtended because it actually fixes an existing issue with the request package.

@mnmkng Should I attach/umount the listener every time the request is made? or just initialize it once and let it be?

mtrunkat · 2019-04-15T12:14:57Z

src/utils_request.js

+
+                if (status >= 400 && throwOnHttpError) {
+                    const error = await getMoreErrorInfo(res, cType);
+                    reject(error);


return is missing here

mtrunkat · 2019-04-15T12:17:56Z

src/utils_request.js

+    }
+
+    if (type === 'application/json') {
+        const errorResponse = JSON.parse(body);


IMHO we should have a try-catch here and if it throws then use the truncated message as error msg. Because it's quite common that in a case of network error the body might be incomplete and cannot be parsed. The error that gets thrown then will be confusing.

Added specification for requestBetter() and requestLikeBrowser() func…

e24d47f

…tions

jancurn mentioned this pull request Mar 28, 2019

Create requestLikeBrowser function #255

Closed

petrpatek added 2 commits April 3, 2019 16:53

Added request better and requestLikeBrowser functions with tests.

3ac8c76

fixed lint comments.

b6b8f2f

petrpatek changed the title ~~[WIP] requestBetter() and requestLikeBrowser() functions~~ requestBetter() and requestLikeBrowser() functions Apr 3, 2019

petrpatek requested review from mnmkng and mtrunkat April 3, 2019 15:28

jancurn commented Apr 4, 2019

View reviewed changes

petrpatek added 2 commits April 5, 2019 14:48

Added param throwOnHttpError. Fixed error handling to do not throw on…

b046730

… 400+ error codes by default. Added better contentType check.

Added abortFunction.

c5e7708

mtrunkat reviewed Apr 6, 2019

View reviewed changes

src/utils_request.js Outdated Show resolved Hide resolved

mnmkng requested changes Apr 6, 2019

View reviewed changes

mtrunkat requested changes Apr 6, 2019

View reviewed changes

petrpatek added 4 commits April 7, 2019 16:24

Fixed default values

12304fb

Improved tests.

bfe0fe5

Added response stream destroy.

e966a7a

Improved requestExtended and requestLikeBrowser.

68414a0

petrpatek requested review from mtrunkat and mnmkng April 8, 2019 13:05

Added tunnel agent fix and improved code.

5e93a7a

mnmkng approved these changes Apr 12, 2019

View reviewed changes

RequestLikeBrowser does not throw on empty body now.

93bcd95

mtrunkat reviewed Apr 15, 2019

View reviewed changes

Improved error handling while parsing JSON and added missing return.

2b1507c

mnmkng merged commit 73a23e7 into develop Apr 15, 2019

mnmkng deleted the feature/request-like-browser branch April 15, 2019 12:51


		const { body } = response;

		if (!body) throw new Error('The response body is empty');

requestBetter() and requestLikeBrowser() functions #353

requestBetter() and requestLikeBrowser() functions #353

Conversation

jancurn commented Mar 28, 2019

petrpatek commented Apr 2, 2019

jancurn commented Apr 2, 2019

petrpatek commented Apr 2, 2019

LeMoussel commented Apr 3, 2019

petrpatek commented Apr 3, 2019

jancurn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrunkat Apr 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mnmkng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrpatek Apr 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrpatek commented Apr 8, 2019 • edited Loading

mnmkng commented Apr 9, 2019

petrpatek commented Apr 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrunkat Apr 6, 2019 •

edited

Loading

petrpatek Apr 9, 2019 •

edited

Loading

petrpatek commented Apr 8, 2019 •

edited

Loading