Pyppeteer Introduction

1. Introduction

Puppeteer is a tool developed by Google based on Node.js, which allows us to control some of the actions of Chrome via JavaScript and of course as a web crawler, with a very sophisticated and powerful API.

And what is Pyppeteer? It’s actually a Python implementation of Puppeteer, but it’s not developed by Google, it’s an unofficial version developed by a Japanese engineer based on some of Puppeteer’s features.

In Pyppetter, there is actually a Chromium browser behind it that performs some actions to render the web pages. Let’s start first with the origin of Chrome and Chromium.

Chromium is a project started by Google to develop Chrome and is completely open source. Both are built on the same source code and all new features in Chrome are first implemented in All new features in Chrome are implemented on Chromium first and then ported once they have been verified as stable, so Chromium is updated more frequently and will include many new features, but as a standalone browser, Chromium has a much more niche user base. The two browsers have the same “roots” and share the same logo, but with different colour schemes, Chrome is made up of four colours: blue, red, green and yellow, and Chromium is made up of different shades of blue.

Pyppeteer relies on Chromium as a browser to run. With Pyppeteer, we can eliminate the need to configure the environment. If the Chromium browser is not installed when you first run it, the application will automatically install and configure it for you, eliminating the need to configure the environment.

So let’s take a look at how Pyppeteer is used.

2. Installation

The first step is installation. Since Pyppeteer uses Python’s async mechanism, it requires Python version 3.5 and above to run.

The installation is very simple:

pip3 install pyppeteer

Once the installation is complete we test it in the terminal or cmd:

import pyppeteer

If no errors are reported, then the installation has been successful.

Let’s try it with Pyppeteer, the code can be written as follows:

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq

async def main():
   browser = await launch()
   page = await browser.newPage()
   await page.goto('https://dynamic2.scrape.cuiqingcai.com/')
   await page.waitForSelector('.item .name')
   doc = pq(await page.content())
   names = [item.text() for item in doc('.item .name').items()]
   print('Names:', names)
   await browser.close()

asyncio.get_event_loop().run_until_complete(main())

So what exactly is happening in this process? Let’s look at it line by line.

The launch method creates a new Browser object, which is then executed and eventually gets a Browser object, which is then assigned to the browser.
The browser then calls the newPage method, which creates a new tab in the browser and a new Page object, a new tab is launched but no page is accessed yet, the browser is still blank.
The Page object then calls the goto method which is the equivalent of typing the URL into the browser and the browser jumps to the corresponding page to load it.
The Page object calls the waitForSelector method, passing in the selector, then the page waits for the node information corresponding to the selector to load, returning immediately if it does, otherwise it continues to wait until it times out. At this point, if all goes well, the page will be loaded successfully.
Once the page has loaded, the content method is called to get the source code of the current browser page, which is the result of the JavaScript rendering.
We then take this a step further and use pyquery to parse and extract the movie name from the page to get the final result.

3. Detailed Usage

Almost all the functions of Pyppeteer can be found in the API Reference in the official documentation at https://miyakogi.github.io/pyppeteer/reference.html. Just look up the methods you use here.

3.1 launch

The first step in using Pyppeteer is to launch the browser. First we will look at how to launch a browser, which is actually the same as clicking on the browser icon on the desktop to get it running. To do the same with Pyppeteer, you just need to call the launch method.

pyppeteer.launcher.launch(options: dict = None, **kwargs) → pyppeteer.browser.Browser

You can see that it is in the launcher module, the parameters are not specified in the declaration, and the return type is the Browser object in the browser module.

Next look at its parameters:

ignoreHTTPSErrors (bool): Whether to ignore HTTPS errors. Defaults to False.
headless (bool): Whether to run browser in headless mode. Defaults to True unless appMode or devtools options is True.
executablePath (str): Path to a Chromium or Chrome executable to run instead of default bundled Chromium.
slowMo (int|float): Slow down pyppeteer operations by the specified amount of milliseconds.
args (List[str]): Additional arguments (flags) to pass to the browser process.
ignoreDefaultArgs (bool): Do not use pyppeteer’s default args. This is dangerous option; use with care.
handleSIGINT (bool): Close the browser process on Ctrl+C. Defaults to True.
handleSIGTERM (bool): Close the browser process on SIGTERM. Defaults to True.
handleSIGHUP (bool): Close the browser process on SIGHUP. Defaults to True.
dumpio (bool): Whether to pipe the browser process stdout and stderr into process.stdout and process.stderr. Defaults to False.
userDataDir (str): Path to a user data directory.
env (dict): Specify environment variables that will be visible to the browser. Defaults to same as python process.
devtools (bool): Whether to auto-open a DevTools panel for each tab. If this option is True, the headless option will be set False.
logLevel (int|str): Log level to print logs. Defaults to same as the root logger.
autoClose (bool): Automatically close browser process when script completed. Defaults to True.
loop (asyncio.AbstractEventLoop): Event loop (experimental).
appMode (bool): Deprecated.

Well, knowing these parameters, we can try it out first.

3.2 Headless

The first thing to try is the most commonly used parameter, headless, which, if set to True or not set by default, will not show any interface at startup. If we set it to False, then we will see the interface at startup. This is usually set to False for debugging purposes, but can be set to True for production environments.

3.3 Devtools

In addition, we can also enable devtools, for example, when writing crawlers will often need to analyze the structure of the web page and network requests, so it is necessary to enable devtools, we can set the devtools parameter to True, so that every time you open an interface will pop up a devtools window, very convenient, the example is as follows:

import asyncio
from pyppeteer import launch
 
async def main():
   browser = await launch(devtools=True)
   page = await browser.newPage()
   await page.goto('https://www.baidu.com')
   await asyncio.sleep(100)
 
asyncio.get_event_loop().run_until_complete(main())

3.4 Block infobar

At this point we can see a message above: “Chrome is being controlled by automated testing software”, which is a bit annoying. This is where the args parameter comes into play, which can be disabled as follows:

browser = await launch(headless=False, args=['--disable-infobars'])

3.5 Prevent Detection

If you just turn off the prompt, some sites will still detect a WebDriver.

The Pyppeteer Page object has a method called evaluateOnNewDocument, which means that it executes a certain statement every time the page is loaded, so here we can run a command that hides the WebDriver:

import asyncio
from pyppeteer import launch
 
async def main():
   browser = await launch(headless=False, args=['--disable-infobars'])
   page = await browser.newPage()
   await page.evaluateOnNewDocument('Object.defineProperty(navigator, "webdriver", {get: () => undefined})')
   await page.goto('https://antispider1.scrape.cuiqingcai.com/')
   await asyncio.sleep(100)
 
asyncio.get_event_loop().run_until_complete(main())

3.6 Set Page Size

import asyncio
from pyppeteer import launch
 
width, height = 1920, 1080
 
async def main():
   browser = await launch(headless=False, args=['--disable-infobars', f'--window-size={width},{height}'])
   page = await browser.newPage()
   await page.setViewport({'width': width, 'height': height})
   await page.evaluateOnNewDocument('Object.defineProperty(navigator, "webdriver", {get: () => undefined})')
   await page.goto('https://antispider1.scrape.cuiqingcai.com/')
   await asyncio.sleep(100)
 
asyncio.get_event_loop().run_until_complete(main())

3.7 User Data Persistence

Every time we open Pyppeteer there is a new blank browser. And if we encounter a page that requires us to log in, if we log in this time, the next time we launch it, it is blank again and we have to log in again, which is a real problem.

For example, when we shop on Taobao, in many cases we close the browser and open it again, but Taobao is still logged in. This is because some of Taobao’s key cookies have been saved locally and can be read directly the next time you log in and remain logged in.

So where is this information stored? In fact, it is saved in the user directory, which contains not only the basic configuration information of the browser, but also some Cache, cookies and other information.

This solves a problem: in many cases when you start Selenium or Pyppeteer, it is always a brand new browser, and the reason for this is that the user directory is not set.

So how do you do this? It’s as simple as setting the userDataDir at startup, for example:

import asyncio
from pyppeteer import launch
 
async def main():
   browser = await launch(headless=False, userDataDir='./userdata', args=['--disable-infobars'])
   page = await browser.newPage()
   await page.goto('https://www.taobao.com')
   await asyncio.sleep(100)
 
asyncio.get_event_loop().run_until_complete(main())

For details, see the official instructions at https://chromium.googlesource.com/chromium/src/+/master/docs/user_data_dir.md, which describes the userdatadir.

3.8 Browser

Above we learned about the launch method, which returns a Browser object, i.e. a browser object, which we would normally assign to a browser variable, but which is actually an instance of the Browser class.

Let’s look at the definition of the Browser class:

class pyppeteer.browser.Browser(connection: pyppeteer.connection.Connection, contextIds: List[str], ignoreHTTPSErrors: bool, setDefaultViewport: bool, process: Optional[subprocess.Popen] = None, closeCallback: Callable[[], Awaitable[None]] = None, **kwargs)

Here we can see that the constructor method has many parameters, but in most cases we can just use the launch method or connect method to create it.

As an object, the browser naturally has a number of methods for manipulating the browser itself, so let’s pick out some of the more useful ones and introduce them.

3.8.1 Incognito Window

We know that Chrome has a Incognito mode, which has the advantage of a cleaner environment and does not share Cache, cookies, etc. with other browser examples, and can be enabled by the createIncognitoBrowserContext method, as shown in the following example:

import asyncio
from pyppeteer import launch
 
width, height = 1200, 768
 
async def main():
   browser = await launch(headless=False,
                          args=['--disable-infobars', f'--window-size={width},{height}'])
   context = await browser.createIncognitoBrowserContext()
   page = await context.newPage()
   await page.setViewport({'width': width, 'height': height})
   await page.goto('https://www.baidu.com')
   await asyncio.sleep(100)
 
asyncio.get_event_loop().run_until_complete(main())

3.8.2 Close

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
 
async def main():
   browser = await launch()
   page = await browser.newPage()
   await page.goto('https://dynamic2.scrape.cuiqingcai.com/')
   await browser.close()

asyncio.get_event_loop().run_until_complete(main())

3.9 Page

3.9.1 Selector

The Page object has a number of built-in selector methods for selecting nodes, such as the J method which passes a selector and returns the first node that matches, equivalent to querySelector, or the JJ method which returns a list of nodes that match the Selector, similar to querySelectorAll.

Let’s look at the usage and results of the following example:

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
 
async def main():
   browser = await launch()
   page = await browser.newPage()
   await page.goto('https://dynamic2.scrape.cuiqingcai.com/')
   await page.waitForSelector('.item .name')
   j_result1 = await page.J('.item .name')
   j_result2 = await page.querySelector('.item .name')
   jj_result1 = await page.JJ('.item .name')
   jj_result2 = await page.querySelectorAll('.item .name')
   print('J Result1:', j_result1)
   print('J Result2:', j_result2)
   print('JJ Result1:', jj_result1)
   print('JJ Result2:', jj_result2)
   await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Here we have called the four methods J, querySelector, JJ and querySelectorAll to observe the effect and the type of results returned, the results of which are as follows

J Result1: <pyppeteer.element_handle.ElementHandle object at 0x1166f7dd0>
J Result2: <pyppeteer.element_handle.ElementHandle object at 0x1166f07d0>
JJ Result1: [<pyppeteer.element_handle.ElementHandle object at 0x11677df50>, <pyppeteer.element_handle.ElementHandle object at 0x1167857d0>, <pyppeteer.element_handle.ElementHandle object at 0x116785110>,
...
<pyppeteer.element_handle.ElementHandle object at 0x11679db10>, <pyppeteer.element_handle.ElementHandle object at 0x11679dbd0>]
JJ Result2: [<pyppeteer.element_handle.ElementHandle object at 0x116794f10>, <pyppeteer.element_handle.ElementHandle object at 0x116794d10>, <pyppeteer.element_handle.ElementHandle object at 0x116794f50>,
...
<pyppeteer.element_handle.ElementHandle object at 0x11679f690>, <pyppeteer.element_handle.ElementHandle object at 0x11679f750>]

Here we can see that J, querySelector, like J, returns a single matched node of type ElementHandle object, while JJ, querySelectorAll returns a list of nodes, a list of ElementHandles.

3.9.2 Tab Operation

We have already demonstrated several times the operation of the new tab, that is, the newPage method, then how to get and switch after the new one, let’s look at an example:

import asyncio
from pyppeteer import launch
 
async def main():
   browser = await launch(headless=False)
   page = await browser.newPage()
   await page.goto('https://www.baidu.com')
   page = await browser.newPage()
   await page.goto('https://www.bing.com')
   pages = await browser.pages()
   print('Pages:', pages)
   page1 = pages[1]
   await page1.bringToFront()
   await asyncio.sleep(100)
 
asyncio.get_event_loop().run_until_complete(main())

Here we have started Pyppeteer, called the newPage method to create two new tabs and visited two websites. So if we want to switch tabs, we can simply call the pages method to get all the pages and then select a page and call its bringToFront method to switch to the tab corresponding to that page.

3.9.3 Other Common Operations

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
 
async def main():
   browser = await launch(headless=False)
   page = await browser.newPage()
   await page.goto('https://dynamic1.scrape.cuiqingcai.com/')
   await page.goto('https://dynamic2.scrape.cuiqingcai.com/')
   # go back
   await page.goBack()
   # go forward
   await page.goForward()
   # reload page
   await page.reload()
   # save PDF
   await page.pdf()
   # screenshot
   await page.screenshot()
   # set page HTML document
   await page.setContent('<h2>Hello World</h2>')
   # set User-Agent
   await page.setUserAgent('Python')
   # set Headers
   await page.setExtraHTTPHeaders(headers={})
   # close
   await page.close()
   await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

3.9.4 Click

Pyppeteer can also simulate clicks by calling its click method.

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
 
async def main():
   browser = await launch(headless=False)
   page = await browser.newPage()
   await page.goto('https://dynamic2.scrape.cuiqingcai.com/')
   await page.waitForSelector('.item .name')
   await page.click('.item .name', options={
       'button': 'right',
       'clickCount': 1,  # 1 or 2
       'delay': 3000,  # milliseconds
   })
   await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Here the first parameter to the click method is the selector, i.e. where to operate. The second parameter is several configurations:

button: the mouse button, you can choose left, middle and right.
clickCount: the number of clicks, e.g. double click, click, etc.
delay: delay the click.

3.9.5 Input Text

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
 
async def main():
   browser = await launch(headless=False)
   page = await browser.newPage()
   await page.goto('https://www.taobao.com')

   await page.type('#q', 'iPad')

   await asyncio.sleep(10)
   await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

3.9.6 Get Page Infomation

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
 
async def main():
   browser = await launch(headless=False)
   page = await browser.newPage()
   await page.goto('https://dynamic2.scrape.cuiqingcai.com/')
   print('HTML:', await page.content())
   print('Cookies:', await page.cookies())
   await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

3.9.7 Execute

import asyncio
from pyppeteer import launch
 
width, height = 1366, 768
 
async def main():
   browser = await launch()
   page = await browser.newPage()
   await page.setViewport({'width': width, 'height': height})
   await page.goto('https://dynamic2.scrape.cuiqingcai.com/')
   await page.waitForSelector('.item .name')
   await asyncio.sleep(2)
   await page.screenshot(path='example.png')
   dimensions = await page.evaluate('''() => {
       return {
           width: document.documentElement.clientWidth,
           height: document.documentElement.clientHeight,
           deviceScaleFactor: window.devicePixelRatio,
       }
   }''')

   print(dimensions)
   await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Here we have executed JavaScript with the evaluate method and obtained the corresponding result. The exposeFunction, evaluateOnNewDocument and evaluateHandle methods are also available.

3.9.8 Waiting

At the beginning of this lesson we demonstrated the use of waitForSelector, which allows the page to wait for certain eligible nodes to be loaded before returning.

In this case waitForSelector is a CSS selector that is passed in and returns the result immediately if found, otherwise it waits until it times out.

In addition to the waitForSelector method, there are a number of other wait methods, which are described below:

waitForFunction: waits for a JavaScript method to finish executing or return a result.
waitForNavigation: waits for a page to jump, if it doesn’t load it will report an error.
waitForRequest: waits for a specific request to be made.
waitForResponse: waits for a specific request to be responded to.
waitFor: a generic wait method.
waitForSelector: waits for nodes that match the selector to be loaded.
waitForXPath: waits for nodes that match the XPath to be loaded.

By waiting for conditions, we can control how the page loads.

4. More

In addition Pyppeteer has many other functions, such as keyboard events, mouse events, dialog events and so on, which I won’t go into here. For more information you can refer to the official documentation for case notes: https://miyakogi.github.io/pyppeteer/reference.html

Published by jamie on 8 April 20238 April 2023