Beautiful Soup findAll doesn#39;t find them all(Beautiful Soup findAll 没有找到它们)
问题描述
I'm trying to parse a website and get some info with the find_all()
method, but it doesn't find them all.
This is the code:
#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())
manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)
for manga in manga_img:
print (manga['href'])
It only prints half of them...
Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml
parser is not dealing very well with it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18
The standard library html.parser
has less trouble with this specific page:
>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44
Translating that to your specific code sample using urllib
, you would specify the parser thus:
soup = BeautifulSoup(page, 'html.parser') # BeatifulSoup can do the reading
这篇关于Beautiful Soup findAll 没有找到它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Beautiful Soup findAll 没有找到它们
基础教程推荐
- 我什么时候应该在导入时使用方括号 2022-01-01
- 在for循环中使用setTimeout 2022-01-01
- 在 JS 中获取客户端时区(不是 GMT 偏移量) 2022-01-01
- Karma-Jasmine:如何正确监视 Modal? 2022-01-01
- 当用户滚动离开时如何暂停 youtube 嵌入 2022-01-01
- 有没有办法使用OpenLayers更改OpenStreetMap中某些要素 2022-09-06
- 悬停时滑动输入并停留几秒钟 2022-01-01
- 动态更新多个选择框 2022-01-01
- 角度Apollo设置WatchQuery结果为可用变量 2022-01-01
- 响应更改 div 大小保持纵横比 2022-01-01