Web-scraping JavaScript page with Python(使用 Python 抓取网页的 JavaScript 页面)
问题描述
I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results.
For example, if some JavaScript code adds some text, I can't see it, because when I call
response = urllib2.urlopen(request)
I get the original text without the added one (because JavaScript is executed in the client).
So, I'm looking for some ideas to solve this problem.
EDIT Sept 2021: phantomjs
isn't maintained any more, either
EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.
dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.
Once you have installed Phantom JS, make sure the phantomjs
binary is available in the current path:
phantomjs --version
# result:
2.1.1
#Example To give an example, I created a sample page with following HTML code. (link):
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Javascript scraping test</title>
</head>
<body>
<p id='intro-text'>No javascript support</p>
<script>
document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
</script>
</body>
</html>
without javascript it says: No javascript support
and with javascript: Yay! Supports javascript
#Scraping without JS support:
import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>
#Scraping with JS support:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'
You can also use Python library dryscrape to scrape javascript driven websites.
#Scraping with JS support:
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>
这篇关于使用 Python 抓取网页的 JavaScript 页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:使用 Python 抓取网页的 JavaScript 页面
基础教程推荐
- 使用 Google App Engine (Python) 将文件上传到 Google Cloud Storage 2022-01-01
- 哪些 Python 包提供独立的事件系统? 2022-01-01
- 将 YAML 文件转换为 python dict 2022-01-01
- 症状类型错误:无法确定关系的真值 2022-01-01
- 如何在Python中绘制多元函数? 2022-01-01
- Python 的 List 是如何实现的? 2022-01-01
- 如何在 Python 中检测文件是否为二进制(非文本)文 2022-01-01
- 合并具有多索引的两个数据帧 2022-01-01
- 使用Python匹配Stata加权xtil命令的确定方法? 2022-01-01
- 使 Python 脚本在 Windows 上运行而不指定“.py";延期 2022-01-01