Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read proxy list from an URL #62

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Read proxy list from an URL #62

wants to merge 1 commit into from

Conversation

datawookie
Copy link

@datawookie datawookie commented Oct 1, 2021

Hi!

We build a lot of web scrapers using Scrapy and I've been using your package for a while now. It's great for managing our multi-proxy setup.

We have been developing a proxy system that shares the proxy list via a URL. I have been dumping the contents of that URL to a file so that I can read it in via ROTATING_PROXY_LIST_PATH but this has become a bit of a pain. It occurred to me that it should be possible to read the proxy list from an URL.

The merge request includes a simple change to the RotatingProxyMiddleware.from_crawler() method to make that possible.

Example: Sharing proxy list at http://127.0.0.1:8800.

image

In settings.py I then have:

ROTATING_PROXY_LIST_PATH = 'http://127.0.0.1:8800'

For context, here's a blog post about the proxy system that we are using in conjunction with scrapy-rotating-proxies.

Best regards,
Andrew.

@kaybeudeker
Copy link

The link to your blog post should be: https://datawookie.dev/blog/2021/10/medusa-multi-headed-tor-proxy/ (instead of pointing to localhost) ;) Great work btw!

@datawookie
Copy link
Author

datawookie commented Nov 28, 2021

Thanks, @kaybeudeker, I've updated the URL. Appreciate you bringing that to my attention.

Have you tried this out? I'd really appreciate any feedback.

@SashiDareddy
Copy link

SashiDareddy commented Feb 20, 2022

I had a similar use case to read proxies from an URL (specifically an API call to a third party which returns a list of proxies - exactly like you have) - I created a small utility function which uses requests.get to fetch the proxies and assigns the result to ROTATING_PROXY_LIST_PATH in settings.py.

utility function:

`def get_proxies(proxy_json_end_point: str) -> List[str]:
r = requests.get(proxy_json_end_point)
proxies = r.json()

proxy_urls = [
    f"http://{user}:{pwd}@{host_port}"
    for (host_port, user, pwd) in [p.split(";") for p in proxies]
]
random.shuffle(proxy_urls)
print("Proxies:", proxy_urls)
return proxy_urls`

settings.py

ROTATING_PROXY_LIST = get_act_proxies(os.getenv("PROXY_JSON_ENDPOINT"))

note - the PROXY_JSON_ENDPOINT env variable points to the third-party's API endpoint which returns the proxies. I used a similar approach to even fetch proxies listed in text file hosted in S3.

@datawookie
Copy link
Author

Hi @TeamHG-Memex, any progress on this? This PR has been languishing for a few months now. Thanks, Andrew.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants