Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create requestLikeBrowser function #255

Closed
jancurn opened this issue Dec 12, 2018 · 11 comments
Closed

Create requestLikeBrowser function #255

jancurn opened this issue Dec 12, 2018 · 11 comments
Labels
feature Issues that represent new features or improvements to existing features.

Comments

@jancurn
Copy link
Member

jancurn commented Dec 12, 2018

It will download HTML using the request package, but it will emulate HTTP headers of normal browser to reduce the chance of bot detection. Once done and tested, we should use this function in CheerioCrawler by default.

In the first version, let's just emulate Firefox with the latest user agent. In the future, we could support other browsers and user agents, so make the function the way that its functionality might be extended in the future, e.g. have there some options param.

Here's a code snippet that can be used for start.

const gzip = Promise.promisify(zlib.gzip, { context: zlib });
const gunzip = Promise.promisify(zlib.gunzip, { context: zlib });
const deflate = Promise.promisify(zlib.deflate, { context: zlib });

const reqOpts = {
            url,
            // Emulate Firefox HTTP headers
            // TODO: We should move this to apify-js or apify-shared-js
            headers: {
                Host: parsedUrlModified.host,
                'User-Agent': useMobileVersion ? FIREFOX_MOBILE_USER_AGENT : FIREFOX_DESKTOP_USER_AGENT,
                Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language': languageCode ? `${languageCode}-${countryCode},${languageCode};q=0.5` : '*', // TODO: get this from country !
                'Accept-Encoding': 'gzip, deflate, br',
                DNT: '1',
                Connection: 'keep-alive',
                'Upgrade-Insecure-Requests': '1',
            },
            // Return response as raw Buffer
            encoding: null,
        };
       
        const result = await utils.requestPromised(reqOpts, false);
        let body;

        try {
            // eslint-disable-next-line prefer-destructuring
            body = result.body;

            // Decode response body
            const contentEncoding = result.response.headers['content-encoding'];
            switch (contentEncoding) {
                case 'br':
                    body = await brotli.decompress(body);
                    break;
                case 'gzip':
                    body = await gunzip(body);
                    break;
                case 'deflate':
                    body = await deflate(body);
                    break;
                case 'identity':
                case null:
                case undefined:
                    break;
                default:
                    throw new Error(`Received unexpected Content-Encoding: ${contentEncoding}`);
            }
            body = body.toString('utf8');

            const { statusCode } = result;
            if (statusCode !== 200) {
                throw new Error(`Received HTTP error response status ${statusCode}`);
            }

            const contentType = result.response.headers['content-type'];
            if (contentType !== 'text/html; charset=UTF-8') {
                throw new Error(`Received unexpected Content-Type: ${contentType}`);
            }

            if (!body) throw new Error('The response body is empty');
@jancurn jancurn added the feature Issues that represent new features or improvements to existing features. label Dec 12, 2018
@jancurn
Copy link
Member Author

jancurn commented Dec 19, 2018

The function should have also an option to abort downloading responses with a content-type not matching a specific selected ones. Then we can use this class in CheerioCrawler to avoid downloading non-HTML content, and greatly simplify CheerioCrawler

@mnmkng
Copy link
Member

mnmkng commented Dec 19, 2018

Maybe we can create a HttpClient class to encapsulate this and also make it more usable by other SDK users.

@jancurn
Copy link
Member Author

jancurn commented Dec 19, 2018

Sure, although I wouldn't over-engineer this. Developers are familiar with the request-style functions.

@jancurn
Copy link
Member Author

jancurn commented Dec 24, 2018

During this task, we might also try to fix bug #266

@jancurn
Copy link
Member Author

jancurn commented Jan 4, 2019

BTW the downloadListOfUrls function should also use this new approach, to reduce blocking e.g. when we download sitemaps. For example, fetching URLs from https://beachwaver.com/sitemap_products_1.xml doesn't work by default.

@jancurn
Copy link
Member Author

jancurn commented Jan 4, 2019

Also, the function should be able to tell what is the final URL (after all redirects)

@jancurn
Copy link
Member Author

jancurn commented Feb 14, 2019

It would be great if the new function also addressed the issue where SSL connections over proxy leak sockets in CLOSE_WAIT state, which eventually leads to EMFILE errors. See request/request#2440 for details

@mnmkng
Copy link
Member

mnmkng commented Feb 15, 2019

Considering this would be a second monkey patch of the request package we need to do, maybe we could explore other options too and switch to a more maintained HTTP client.

@jancurn
Copy link
Member Author

jancurn commented Mar 28, 2019

@petrpatek @mnmkng @mtrunkat I put together the specification for this new function. It's in a form of a pull request where the new functions are defined and commented. See #353

IMHO the best way to get this done it is to first implement requestLikeBrowser() using the new requestBetter() function, to see whether the interface of requestBetter() is defined correctly. If yes, then we can implement requestBetter(), write tests and it's done.

@metalwarrior665
Copy link
Member

Isn't this already done by @petrpatek ? Can we close?

@mnmkng
Copy link
Member

mnmkng commented Sep 4, 2019

Yes, it's in latest already. Development is still ongoing though.

@mnmkng mnmkng closed this as completed Sep 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features.
Projects
None yet
Development

No branches or pull requests

3 participants