关于“Selenium”配置,可以查看bstation爬虫教程
(https://www . bilibili . com/video/bv 1 I 54y 1h 75 w?P=85)
第一,打开百度进行搜索
打开百度:
From import Chrome
web=chrome()
web . get(';)。
在开发人员工具中找到输入框。
输入要查询的值,然后按enter键。
input _ BTN=web . find _ element _ by _ id(' kw ')
in(' Jackie chan ',Keys)。ENTER)
代码和测试总数:
From import Chrome
From .common.keys import Keys
web=chrome()
web . get(';)。
Web.maximize_window()
input _ BTN=web . find _ element _ by _ id(' kw ')
in(' Jackie chan ',Keys)。ENTER)
第二,爬上来说人的名言
为了登网页,必须爬前5页的名言和相关名人,并保存在CSV文件中。
1.爬上一页
先爬第一页进行测试。
在开发人员工具中,您可以看到每个名言(名言名人)都在class='quote '的div中,没有其他class='quote '标签。
之后可以看出名言句子是div下面的class='text '的span标签,作者是class='author '的small标签中的:
因此,攀登第一页代码如下:
div _ list=web . find _ elements _ by _ class _ name(' quote ')
打印(len (div _ list))
For div in div_list:
saying=div . find _ element _ by _ class _ name(' text ')。text
author=div . find _ element _ by _ class _ name(' author ')。text
Info=[saying,author]
打印(信息)
结果如下:
2.爬网5页
爬完一页后,要点击翻页,也就是翻页按钮。
“下一步”按钮只是href特性,找不到。和第一页
只有下一页按钮,之后的页数有上一页和下一页按钮,则也无法通过xpath定位:而其子元素span(即箭头)在第一页中的属性aria-hidden是唯一的,在之后的页数中存在aria-hidden该属性,但Next的箭头总是最后一个。
因此可以通过查找最后一个有aria-hidden属性的span标签,进行点击以跳转到下一页
web.find_elements_by_CSS_selector('[aria-hidden]')[-1].click()
测试:
n = 5 for i in range(0, n): div_list = web.find_elements_by_class_name('quote') print(len(div_list)) for div in div_list: saying = div.find_element_by_class_name('text').text author = div.find_element_by_class_name('author').text info = [saying, author] print(info) if i == n-1: break web.find_elements_by_css_selector('[aria-hidden]')[-1].click() (2)
成功翻页:
3. 数据储存
sayingAndAuthor = [] n = 5 for i in range(0, n): div_list = web.find_elements_by_class_name('quote') for div in div_list: saying = div.find_element_by_class_name('text').text author = div.find_element_by_class_name('author').text info = [saying, author] (info) print('成功爬取第' + (i+1) + '页') if i == n-1: break web.find_elements_by_css_selector('[aria-hidden]')[-1].click() (2) with open('名人名言.csv', 'w', encoding='utf-8')as fp: fileWrite = c(fp) (['名言', '名人']) # 写入表头 s(sayingAndAuthor)
4. 总代码
from import Chrome import time import csv web = Chrome() web.get('') sayingAndAuthor = [] n = 5 for i in range(0, n): div_list = web.find_elements_by_class_name('quote') for div in div_list: saying = div.find_element_by_class_name('text').text author = div.find_element_by_class_name('author').text info = [saying, author] (info) print('成功爬取第' + str(i + 1) + '页') if i == n-1: break web.find_elements_by_css_selector('[aria-hidden]')[-1].click() (2) with open('名人名言.csv', 'w', encoding='utf-8')as fp: fileWrite = c(fp) (['名言', '名人']) # 写入表头 s(sayingAndAuthor) web.close()
爬取结果:
三、爬取京东书籍信息
爬取某个关键字书籍的前三页书籍信息,本文以科幻小说为例
1. 爬取第一页
- 进入京东并搜索科幻小说:
from import Chrome from .common.keys import Keys web = Chrome() web.get(';) web.maximize_window() web.find_element_by_id('key').send_keys('科幻小说', Keys.ENTER) # 找到输入框输入,回车
在 开发者工具 可发现每一个商品信息都存在于class包含"gl-item"的li中(鼠标覆盖在某个li上时其class变为gl-item hover):
刚进入时可发现有30个li,但往下滑到页面低后,li数量变为60个,因此需要先往下滑动:
因此获取所有li如下:
web.execute_script('window.scrollTo(0, document.body.scrollHeight);') (2) page_text = web.page_source tree = e(page_text) li_list = ('//li[contains(@class,"gl-item")]') print(len(li_list))
之后获取每件书籍的信息,在循环中获取:
for li in li_list: pass
获取书名和价格:
book_name = ''.join('.//div[@class="p-name"]/a/em/text()')) price = '¥' + li.xpath('.//div[@class="p-price"]/strong/i/text()')[0]
对于作者,有的书没有作者,则记为无:
author_span = li.xpath('.//span[@class="p-bi-name"]/a/text()') if len(author_span) > 0: author = author_span[0] else: author = '无'
对于获取出版社,则和作者一样:
store_span = li.xpath('.//span[@class="p-bi-store"]/a[1]/text()') if len(store_span) > 0: store = store_span[0] else: store = '无'
对于书本图片地址,有的在src属性中,有的在data-lazy-img属性中:
因此获取书本图片地址如下:
img_url_a = li.xpath('.//div[@class="p-img"]/a/img')[0] if len('./@src')) > 0: img_url = 'https' + img_url_a.xpath('./@src')[0] # 书本图片地址 else: img_url = 'https' + img_url_a.xpath('./@data-lazy-img')[0]
因此爬取一页信息如下:
# 爬取一页 def get_onePage_info(web): web.execute_script('window.scrollTo(0, document.body.scrollHeight);') (2) page_text = web.page_source # 进行解析 tree = e(page_text) li_list = ('//li[contains(@class,"gl-item")]') book_infos = [] for li in li_list: book_name = ''.join('.//div[@class="p-name"]/a/em/text()')) # 书名 price = '¥' + li.xpath('.//div[@class="p-price"]/strong/i/text()')[0] # 价格 author_span = li.xpath('.//span[@class="p-bi-name"]/a/text()') if len(author_span) > 0: # 作者 author = author_span[0] else: author = '无' store_span = li.xpath('.//span[@class="p-bi-store"]/a[1]/text()') # 出版社 if len(store_span) > 0: store = store_span[0] else: store = '无' img_url_a = li.xpath('.//div[@class="p-img"]/a/img')[0] if len('./@src')) > 0: img_url = 'https' + img_url_a.xpath('./@src')[0] # 书本图片地址 else: img_url = 'https' + img_url_a.xpath('./@data-lazy-img')[0] one_book_info = [book_name, price, author, store, img_url] book_in(one_book_info) return book_infos
2. 爬取3页
点击下一页:
web.find_element_by_class_name('pn-next').click() # 点击下一页
爬取三页:
for i in range(0, 3): all_book_infos += get_onePage_info(web) web.find_element_by_class_name('pn-next').click() # 点击下一页 (2)
3. 数据存储
with open('京东-科幻小说.csv', 'w', encoding='utf-8')as fp: writer = c(fp) wri(['书名', '价格', '作者', '出版社', '预览图片地址']) wris(all_book_info)
4. 总代码
from import Chrome from .common.keys import Keys import time from lxml import etree import csv # 爬取一页 def get_onePage_info(web): web.execute_script('window.scrollTo(0, document.body.scrollHeight);') (2) page_text = web.page_source # with open('3-.html', 'w', encoding='utf-8')as fp: # (page_text) # 进行解析 tree = e(page_text) li_list = ('//li[contains(@class,"gl-item")]') book_infos = [] for li in li_list: book_name = ''.join('.//div[@class="p-name"]/a/em/text()')) # 书名 price = '¥' + li.xpath('.//div[@class="p-price"]/strong/i/text()')[0] # 价格 author_span = li.xpath('.//span[@class="p-bi-name"]/a/text()') if len(author_span) > 0: # 作者 author = author_span[0] else: author = '无' store_span = li.xpath('.//span[@class="p-bi-store"]/a[1]/text()') # 出版社 if len(store_span) > 0: store = store_span[0] else: store = '无' img_url_a = li.xpath('.//div[@class="p-img"]/a/img')[0] if len('./@src')) > 0: img_url = 'https' + img_url_a.xpath('./@src')[0] # 书本图片地址 else: img_url = 'https' + img_url_a.xpath('./@data-lazy-img')[0] one_book_info = [book_name, price, author, store, img_url] book_in(one_book_info) return book_infos def main(): web = Chrome() web.get(';) web.maximize_window() web.find_element_by_id('key').send_keys('科幻小说', Keys.ENTER) # 找到输入框输入,回车 (2) all_book_info = [] for i in range(0, 3): all_book_info += get_onePage_info(web) print('爬取第' + str(i+1) + '页成功') web.find_element_by_class_name('pn-next').click() # 点击下一页 (2) with open('京东-科幻小说.csv', 'w', encoding='utf-8')as fp: writer = c(fp) wri(['书名', '价格', '作者', '出版社', '预览图片地址']) wris(all_book_info) if __name__ == '__main__': main()
结果:
四、总结
selenium对于爬取动态数据十分方便,不过速度相对较慢。
参考
CSS 属性 选择器()
python中selenium操作下拉滚动条方法汇总()
1.《【百度的人名言】python celenium练习(百度、名言、京东)》援引自互联网,旨在传递更多网络信息知识,仅代表作者本人观点,与本网站无关,侵删请联系页脚下方联系方式。
2.《【百度的人名言】python celenium练习(百度、名言、京东)》仅供读者参考,本网站未对该内容进行证实,对其原创性、真实性、完整性、及时性不作任何保证。
3.文章转载时请保留本站内容来源地址,https://www.lu-xu.com/jiaoyu/2597917.html