https://webscraping.pro/extract-browsers-local-storage-with-python/ https://github.com/Python3WebSpider/PyppeteerTest https://chengjun.github.io/mybook/04-crawler-pyppeteer.html https://blog.csdn.net/weixin_45961774/article/details/112848584 https://bbs.huaweicloud.com/blogs/232462

  • 启动浏览器并访问页面

    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://www.sina.com.cn')

    launch的可选参数:headless=False,显示浏览器界面

  • 模拟点击事件 在网页中找到准备点击的部件,右键inspect,然后在HTML对应的代码上点Copy->Copy Selector 例如选中的是#close_donation_button > a

    await page.waitForSelector('#close_donation_button > a')
    await page.click('#close_donation_button > a')
  • 执行JavaScript代码:

    • 使用Page对象的evaluate方法
    • 例如读取localStorage的内容 local_storage = await page.evaluate('''() => Object.assign({}, window.localStorage)''')
  • 截获AJAX数据

import asyncio
import json
from pyppeteer import launch
 
async def intercept_network_response(response):
    # In this example, we care only about responses returning JSONs
    if "application/json" in response.headers.get("content-type", ""):
        # Print some info about the responses
        print("URL:", response.url)
        print("Method:", response.request.method)
        print("Response headers:", response.headers)
        print("Request Headers:", response.request.headers)
        print("Response status:", response.status)
        # Print the content of the response
        try:
            # await response.json() returns the response as Python object
            print("Content: ", await response.json())
        except json.decoder.JSONDecodeError:
            # NOTE: Use await response.text() if you want to get raw response text
            print("Failed to decode JSON from", await response.text())
 
async def main():
    browser = await launch()
    page = await browser.newPage()
    
    page.on('response', lambda response: asyncio.ensure_future(intercept_network_response))
            
    await page.goto('https://instagram.com')
    await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())