Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selenium+PhantomJS+Xpath抓取网页JS内容 #40

Open
lovecn opened this issue Nov 20, 2016 · 1 comment
Open

Selenium+PhantomJS+Xpath抓取网页JS内容 #40

lovecn opened this issue Nov 20, 2016 · 1 comment

Comments

@lovecn
Copy link
Owner

lovecn commented Nov 20, 2016

from http://www.zhidaow.com/post/selenium-phantomjs-xpath
sudo pip install selenium
sudo apt-get install PhantomJS

Selenium下载地址:https://pypi.python.org/pypi/selenium#downloads
PhantomJS下载地址:http://phantomjs.org/download.html
PhantomJs可以看作一个没有页面的浏览器,有渲染引擎(QtWebkit)和JS引擎(JavascriptCore)。PhantomJs有DOM渲染,JS运行,网络访问,网页截图等多个功能。

使用PhantomJS,而不用Chromedriver和firefox,主要是因为PhantomJS的静默方式(后台运行,不打开浏览器)。
from selenium import webdriver

browser = webdriver.PhantomJS('D:\phantomjs.exe') #浏览器初始化;Win下需要设置phantomjs路径,linux下置空即可
url = 'http://www.zhidaow.com' # 设置访问路径
browser.get(url) # 打开网页
title = browser.find_elements_by_xpath('//h2') # 用xpath获取元素

for t in title: # 遍历输出
print t.text # 输出其中文本
print t.get_attribute('class') # 输出属性值

browser.quit() # 关闭浏览器。当出现异常时记得在任务浏览器中关闭PhantomJS,因为会有多个PhantomJS在运行状态,影响电脑性能

from selenium import webdriver

browser = webdriver.PhantomJS('D:\phantomjs.exe')
url = 'http://www.aizhan.com/siteall/tuniu.com/'
browser.get(url)
table = browser.find_elements_by_xpath('//*[@id="history1"]/table/tbody/tr[1]') # 用Xpath获取table元素

for t in table:
print t.text

browser.quit()

@lovecn
Copy link
Owner Author

lovecn commented Nov 20, 2016

PHP蜘蛛爬虫开发文档https://doc.phpspider.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant