playwright 异步 获取网页动态内容
记录对于playwright的学习过程,使用异步方式模拟浏览器行为获取动态网页内容。
虚假的异步写法
async def as_get_url_list(browser, i):
global url_list
page = await browser.new_page()
await page.goto(f"https://spa2.scrape.center/page/{i}")
print(f"page goto https://spa2.scrape.center/page/{i}")
await page.wait_for_selector('a.name')
elements = await page.query_selector_all('a.name')
for e in elements:
href = await e.get_attribute('href')
url_list.append(f'https://spa2.scrape.center{href}')
async def pre_main():
async with async_playwright() as playwright:
chrom = playwright.chromium
browser = await chrom.launch()
for i in range(1, 11):
await as_get_url_list(browser, i)
if name == "__main__":
url_list = []
start_time = time.time()
# 异步获取
asyncio.run(pre_main())
with open('urlList.tsv', 'w') as f:
f.write('\n'.join(url_list))
print(f'获取url_list时间:{time.time() - start_time}')
在pre_main
函数中创建了浏览器,在异步函数as_get_url_list
中获取所有10页页面,在每个页面使用函数as_get_url_list
中去获取具体电影的网页链接。
实际上这里的pre_main
函数并不能发挥异步的性能,在每次迭代中都需要待页面加载完成,而没有充分利用异步优势。原因出在这个for循环。
真实的异步
async def pre_main():
async with async_playwright() as playwright:
chrom = playwright.chromium
browser = await chrom.launch()
tasks = []
for i in range(1, 11):
task = asyncio.create_task(as_get_url_list(browser, i))
tasks.append(task)
await asyncio.gather(*tasks)
修改pre_main
函数,创建task任务列表,再使用asyncio.gather
提交所有任务,才能实现真实的异步。
原文链接:https://blog.csdn.net/sakura553/article/details/132148718