playwright 异步 获取网页动态内容

记录对于playwright的学习过程,使用异步方式模拟浏览器行为获取动态网页内容。

虚假的异步写法

async def as_get_url_list(browser, i):
    global url_list
    page = await browser.new_page()
    await page.goto(f"https://spa2.scrape.center/page/{i}")
    print(f"page goto https://spa2.scrape.center/page/{i}")
    await page.wait_for_selector('a.name')
    elements = await page.query_selector_all('a.name')

    for e in elements:
        href = await e.get_attribute('href')
        url_list.append(f'https://spa2.scrape.center{href}')

async def pre_main():
    async with async_playwright() as playwright:
        chrom = playwright.chromium
        browser = await chrom.launch()

        for i in range(1, 11):
            await as_get_url_list(browser, i)

if name == "__main__":
    url_list = []
    start_time = time.time()

    # 异步获取
    asyncio.run(pre_main())

    with open('urlList.tsv', 'w') as f:
        f.write('\n'.join(url_list))

    print(f'获取url_list时间:{time.time() - start_time}')

pre_main函数中创建了浏览器,在异步函数as_get_url_list中获取所有10页页面,在每个页面使用函数as_get_url_list中去获取具体电影的网页链接。

实际上这里的pre_main函数并不能发挥异步的性能,在每次迭代中都需要待页面加载完成,而没有充分利用异步优势。原因出在这个for循环。

真实的异步

async def pre_main():
    async with async_playwright() as playwright:
        chrom = playwright.chromium
        browser = await chrom.launch()
        tasks = []

        for i in range(1, 11):
            task = asyncio.create_task(as_get_url_list(browser, i))
            tasks.append(task)

        await asyncio.gather(*tasks)

修改pre_main函数,创建task任务列表,再使用asyncio.gather提交所有任务,才能实现真实的异步。

原文链接:https://blog.csdn.net/sakura553/article/details/132148718