python

超轻量级php框架startmvc

Python实现爬取百度贴吧帖子所有楼层图片的爬虫示例

更新时间:2020-05-30 20:54:01 作者:startmvc
本文实例讲述了Python实现爬取百度贴吧帖子所有楼层图片的爬虫。分享给大家供大家参考,

本文实例讲述了Python实现爬取百度贴吧帖子所有楼层图片的爬虫。分享给大家供大家参考,具体如下:

下载百度贴吧帖子图片,好好看

python2.7版本:


#coding=utf-8
import re
import requests
import urllib
from bs4 import BeautifulSoup
import time
time1=time.time()
def getHtml(url):
 page = requests.get(url)
 html =page.text
 return html
def getImg(html):
 soup = BeautifulSoup(html, 'html.parser')
 img_info = soup.find_all('img', class_='BDE_Image')
 global index
 for index,img in enumerate(img_info,index+1):
 print ("正在下载第{}张图片".format(index))
 urllib.urlretrieve(img.get("src"),'C:/pic4/%s.jpg' % index)
def getMaxPage(url):
 html = getHtml(url)
 reg = re.compile(r'max-page="(\d+)"')
 page = re.findall(reg,html)
 page = int(page[0])
 return page
if __name__=='__main__':
 url = "https://tieba.baidu.com/p/5113603072"
 page = getMaxPage(url)
 index = 0
 for i in range(1,page):
 url = "%s%s" % ("https://tieba.baidu.com/p/5113603072?pn=",str(i))
 html = getHtml(url)
 getImg(html)
 print ("OK!All DownLoad!")
 time2=time.time()
 print u'总共耗时:' + str(time2 - time1) + 's'

PS:这里再为大家提供2款非常方便的正则表达式工具供大家参考使用:

JavaScript正则表达式在线测试工具: http://tools.jb51.net/regex/javascript

正则表达式在线生成工具: http://tools.jb51.net/regex/create_reg

Python 爬取 百度贴吧 帖子 图片 爬虫