https://webscraping.pro/extract-browsers-local-storage-with-python/ https://github.com/Python3WebSpider/PyppeteerTest https://chengjun.github.io/mybook/04-crawler-pyppeteer.html https://blog.csdn.net/weixin_45961774/article/details/112848584 https://bbs.huaweicloud.com/blogs/232462
-
启动浏览器并访问页面
browser = await launch() page = await browser.newPage() await page.goto('https://www.sina.com.cn')
launch
的可选参数:headless=False
,显示浏览器界面 -
模拟点击事件 在网页中找到准备点击的部件,右键inspect,然后在HTML对应的代码上点Copy->Copy Selector 例如选中的是
#close_donation_button > a
await page.waitForSelector('#close_donation_button > a') await page.click('#close_donation_button > a')
-
执行JavaScript代码:
- 使用Page对象的evaluate方法
- 例如读取localStorage的内容
local_storage = await page.evaluate('''() => Object.assign({}, window.localStorage)''')
-
截获AJAX数据
import asyncio
import json
from pyppeteer import launch
async def intercept_network_response(response):
# In this example, we care only about responses returning JSONs
if "application/json" in response.headers.get("content-type", ""):
# Print some info about the responses
print("URL:", response.url)
print("Method:", response.request.method)
print("Response headers:", response.headers)
print("Request Headers:", response.request.headers)
print("Response status:", response.status)
# Print the content of the response
try:
# await response.json() returns the response as Python object
print("Content: ", await response.json())
except json.decoder.JSONDecodeError:
# NOTE: Use await response.text() if you want to get raw response text
print("Failed to decode JSON from", await response.text())
async def main():
browser = await launch()
page = await browser.newPage()
page.on('response', lambda response: asyncio.ensure_future(intercept_network_response))
await page.goto('https://instagram.com')
await browser.close()
asyncio.get_event_loop().run_until_complete(main())