Create requestLikeBrowser function #255

jancurn · 2018-12-12T09:48:23Z

It will download HTML using the request package, but it will emulate HTTP headers of normal browser to reduce the chance of bot detection. Once done and tested, we should use this function in CheerioCrawler by default.

In the first version, let's just emulate Firefox with the latest user agent. In the future, we could support other browsers and user agents, so make the function the way that its functionality might be extended in the future, e.g. have there some options param.

Here's a code snippet that can be used for start.

const gzip = Promise.promisify(zlib.gzip, { context: zlib });
const gunzip = Promise.promisify(zlib.gunzip, { context: zlib });
const deflate = Promise.promisify(zlib.deflate, { context: zlib });

const reqOpts = {
            url,
            // Emulate Firefox HTTP headers
            // TODO: We should move this to apify-js or apify-shared-js
            headers: {
                Host: parsedUrlModified.host,
                'User-Agent': useMobileVersion ? FIREFOX_MOBILE_USER_AGENT : FIREFOX_DESKTOP_USER_AGENT,
                Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language': languageCode ? `${languageCode}-${countryCode},${languageCode};q=0.5` : '*', // TODO: get this from country !
                'Accept-Encoding': 'gzip, deflate, br',
                DNT: '1',
                Connection: 'keep-alive',
                'Upgrade-Insecure-Requests': '1',
            },
            // Return response as raw Buffer
            encoding: null,
        };
       
        const result = await utils.requestPromised(reqOpts, false);
        let body;

        try {
            // eslint-disable-next-line prefer-destructuring
            body = result.body;

            // Decode response body
            const contentEncoding = result.response.headers['content-encoding'];
            switch (contentEncoding) {
                case 'br':
                    body = await brotli.decompress(body);
                    break;
                case 'gzip':
                    body = await gunzip(body);
                    break;
                case 'deflate':
                    body = await deflate(body);
                    break;
                case 'identity':
                case null:
                case undefined:
                    break;
                default:
                    throw new Error(`Received unexpected Content-Encoding: ${contentEncoding}`);
            }
            body = body.toString('utf8');

            const { statusCode } = result;
            if (statusCode !== 200) {
                throw new Error(`Received HTTP error response status ${statusCode}`);
            }

            const contentType = result.response.headers['content-type'];
            if (contentType !== 'text/html; charset=UTF-8') {
                throw new Error(`Received unexpected Content-Type: ${contentType}`);
            }

            if (!body) throw new Error('The response body is empty');

The text was updated successfully, but these errors were encountered:

jancurn · 2018-12-19T15:10:23Z

The function should have also an option to abort downloading responses with a content-type not matching a specific selected ones. Then we can use this class in CheerioCrawler to avoid downloading non-HTML content, and greatly simplify CheerioCrawler

mnmkng · 2018-12-19T16:38:27Z

Maybe we can create a HttpClient class to encapsulate this and also make it more usable by other SDK users.

jancurn · 2018-12-19T16:48:24Z

Sure, although I wouldn't over-engineer this. Developers are familiar with the request-style functions.

jancurn · 2018-12-24T09:27:49Z

During this task, we might also try to fix bug #266

jancurn · 2019-01-04T11:40:12Z

BTW the downloadListOfUrls function should also use this new approach, to reduce blocking e.g. when we download sitemaps. For example, fetching URLs from https://beachwaver.com/sitemap_products_1.xml doesn't work by default.

jancurn · 2019-01-04T15:44:15Z

Also, the function should be able to tell what is the final URL (after all redirects)

jancurn · 2019-02-14T16:08:23Z

It would be great if the new function also addressed the issue where SSL connections over proxy leak sockets in CLOSE_WAIT state, which eventually leads to EMFILE errors. See request/request#2440 for details

mnmkng · 2019-02-15T09:43:57Z

Considering this would be a second monkey patch of the request package we need to do, maybe we could explore other options too and switch to a more maintained HTTP client.

jancurn · 2019-03-28T15:06:21Z

@petrpatek @mnmkng @mtrunkat I put together the specification for this new function. It's in a form of a pull request where the new functions are defined and commented. See #353

IMHO the best way to get this done it is to first implement requestLikeBrowser() using the new requestBetter() function, to see whether the interface of requestBetter() is defined correctly. If yes, then we can implement requestBetter(), write tests and it's done.

metalwarrior665 · 2019-09-04T09:48:57Z

Isn't this already done by @petrpatek ? Can we close?

mnmkng · 2019-09-04T11:12:07Z

Yes, it's in latest already. Development is still ongoing though.

jancurn added the feature Issues that represent new features or improvements to existing features. label Dec 12, 2018

mnmkng added the low priority label Dec 12, 2018

jancurn mentioned this issue Dec 16, 2018

CheerioCrawler is downloading uncompressed data! #261

Closed

jancurn added high priority and removed low priority labels Dec 21, 2018

jancurn mentioned this issue Mar 28, 2019

requestBetter() and requestLikeBrowser() functions #353

Merged

mnmkng closed this as completed Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create requestLikeBrowser function #255

Create requestLikeBrowser function #255

jancurn commented Dec 12, 2018

jancurn commented Dec 19, 2018

mnmkng commented Dec 19, 2018

jancurn commented Dec 19, 2018

jancurn commented Dec 24, 2018

jancurn commented Jan 4, 2019

jancurn commented Jan 4, 2019

jancurn commented Feb 14, 2019

mnmkng commented Feb 15, 2019

jancurn commented Mar 28, 2019

metalwarrior665 commented Sep 4, 2019

mnmkng commented Sep 4, 2019

Create requestLikeBrowser function #255

Create requestLikeBrowser function #255

Comments

jancurn commented Dec 12, 2018

jancurn commented Dec 19, 2018

mnmkng commented Dec 19, 2018

jancurn commented Dec 19, 2018

jancurn commented Dec 24, 2018

jancurn commented Jan 4, 2019

jancurn commented Jan 4, 2019

jancurn commented Feb 14, 2019

mnmkng commented Feb 15, 2019

jancurn commented Mar 28, 2019

metalwarrior665 commented Sep 4, 2019

mnmkng commented Sep 4, 2019