Jansiel Notes

都2024年了?你还不会使用Puppeteer?

前言:

众所周知在开发的过程中,数据一直是推动整个业务链条的重要一环,通过爬虫进行数据的爬取和更新也是日常的操作,目前支持爬虫的语言很多:Python、Java、Ruby 还有Nodejs ,也就是今天主角 Puppeteer, 它是由 Google Chrome 官方团队维护以Node.js 为基础的开源工具,主要用于控制和自动化谷歌浏览器(Google Chrome)或其他兼容的浏览器操作。

废话不多说,下面让我们从浅到深一步一步带领大家走进爬虫的世界~

puppeteer的简介

Puppeteer是一个由Google开发的Node.js库,它提供了一套用于控制headless Chrome或Chromium浏览器的API。它可以模拟用户在浏览器中的操作行为,如点击、填写表单、截图等,同时还可以让开发者获取到浏览器渲染后的HTML内容。它提供了一套高级的 API,使得浏览器操作变得简单和可靠。主要包括: 自动化控制、页面操控、网络请求拦截、页面截图和 PDF 生成、自动化测试 等一系列操作

总而言之,Puppeteer 是一个功能强大、易用且灵活的浏览器自动化工具,能够帮助开发者完成各种浏览器操作和自动化任务。

环境搭建

puppeteer从 v1.7.0 开始支持两个包:puppeteerpuppeteer-core

  • puppeteer: 一个完整的包会下载一个可执行的Chromium浏览器。整个体积很大(适合本地调试)
  • puppeteer-core: 不会下载一个可执行的Chromium浏览器、体积很小、配置的浏览器需要自己手动更新(适合部署在生产环境)

支持的版本Node版本 >= v16.20.0

1npm i puppeteer or npm i puppeteer -g   // 最新版本:V21.7.0
2

Puppeteer的基础API

使用Headless模式

Puppeteer默认启动的是无头模式进行开发, 可以通过 headless 进行配置关闭,本地调试建议开启,

1const browser = await puppeteer.launch();
2// Equivalent to
3const browser = await puppeteer.launch({headless: false}); // 本地调试
4

需要注意的是Chrome 112 推出了新的 Headless 模式,可以通过新的参数调整

1const browser = await puppeteer.launch({headless: 'new'});
2

使用Puppeteer-core

在生产环境部署的时候使用puppeteer-core要注意版本,目测在 v16.2.0 这个版本是没问题 ,最新v21.7.0 在部署线上的时候有点问题

1// const puppeteer = require("puppeteer");
2const puppeteer = require("puppeteer-core");
3const browser = await puppeteer.launch({
4// executablePath: "/usr/bin/google-chrome", // 生产环境
5  executablePath:
6    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
7});
8return browser;
9

关于 executablePath 如何可以访问: chrome://version/可执行文件路径进行查看」

设置浏览器实例的其他命令行参数

可以设置args来执行当前运行的浏览器实例一些命令行加以限制,具体可以参考Chromium命令行开关列表

 1const puppeteer = require("puppeteer-core");
 2const browser = await puppeteer.launch({
 3  executablePath:
 4    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
 5  args: [
 6    "--no-sandbox", // 使用沙盒模式
 7    "--disable-setuid-sandbox", // 禁用setuid沙盒(仅限Linux)
 8    "--disable-extensions", // 禁用扩展
 9    "--incognito", // 禁用GPU硬件加速
10    "--disable-gpu", // 以隐身模式运行
11    "--no-zygote", // 禁用 Zygote 进程模型,启动时不创建一个共享的子进程来提高性能。
12  ],
13});
14return browser;
15

设置浏览器视口分辨率

可以通过 defaultViewport 进行PC端的设置默认的视口分辨率

 1const puppeteer = require("puppeteer-core");
 2const browser = await puppeteer.launch({
 3  executablePath:
 4    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
 5  defaultViewport: {
 6    height: 1080,
 7    width: 1920,
 8  },
 9});
10return browser;
11

指定移动端设备访问

 1const puppeteer = require("puppeteer");
 2const iPhone = puppeteer.devices["iPhone 6"];
 3
 4(async () => {
 5  const browser = await puppeteer.launch({
 6    headless: false,
 7  });
 8  await page.emulate(iPhone);
 9
10})();
11

其他

如果使用Docker部署可以参考相关资源

简单case上手

介绍完了以上的基础的API下面通过三个小例子来看一下它是如何工作的。

模拟设备截图

通过puppeteer模拟iPhone6进行访问百度的域名,进行当前网页的截图

 1const puppeteer = require("puppeteer");
 2const iPhone = puppeteer.devices["iPhone 6"];
 3
 4(async () => {
 5  const browser = await puppeteer.launch({
 6    headless: false,
 7  });
 8  const page = await browser.newPage();
 9  await page.emulate(iPhone);
10  await page.goto("https://baidu.com/");
11  await page.screenshot({
12    path: "full.png",
13    fullPage: true,
14  });
15  console.log(await page.title());
16  await browser.close();
17})();
18

使用用户搜索截图

通过puppeteer进行百度搜索Puppeteer,进行截图保存到本地

 1// baidu search
 2const puppeteer = require("puppeteer");
 3const screenshot = "baidu.png";
 4try {
 5  (async () => {
 6    const browser = await puppeteer.launch({
 7      headless: false,
 8    });
 9    const page = await browser.newPage();
10    await page.goto("https://baidu.com");
11    await page.type("#kw", "puppeteer");
12    await page.click("#su");
13    await page.waitForTimeout(2000);
14    await page.screenshot({ path: screenshot });
15    await browser.close();
16  })();
17} catch (err) {
18  console.error(err);
19}
20

通过puppeteer进行打开paypal进行cookie的种植,达到用户名的渲染

 1// set cookie
 2const cookie = {
 3  name: "login_email",
 4  value: "set_by_cookie@domain.com",
 5  domain: ".paypal.com",
 6  url: "https://www.paypal.com/",
 7  path: "/",
 8  httpOnly: true,
 9  secure: true,
10};
11
12const puppeteer = require("puppeteer");
13(async () => {
14  const browser = await puppeteer.launch({ headless: false });
15  const page = await browser.newPage();
16  await page.setCookie(cookie);
17  await page.goto("https://www.paypal.com/signin");
18  await page.screenshot({ path: "paypal_login.png" });
19  await browser.close();
20
21})();
22

实战案例

相信通过以上的简单的示例分析,大家对整个流程有了一个初步的认识。下面让围绕目前主流爬取的方式来逐一攻破它的工作原理

解析HTML

下面以codashop 为例,通过解析HTML的方式把相关DOM节点元素进行筛选和过滤,抽离出SKU(商品)的「价格、商品名称」等数据
Jansiel_Essay_1708935939326

实例代码如下:

  1// codashop
  2const puppeteer = require("puppeteer");
  3const ua =
  4  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";
  5
  6const config = {
  7  url: "https://www.codashop.com/en-my/pubg-mobile-uc-redeem-code",
  8  gameName: "pubgm",
  9  currency: "RM",
 10  country: "my",
 11  thous_separator: ",",
 12  decimal_point_separator: ".",
 13};
 14
 15// 延时
 16const waitFor = async (t) => {
 17  return new Promise((r) => setTimeout(r, t));
 18};
 19
 20try {
 21  const run = async () => {
 22    const roots = ".form-section__denom-group";
 23    const browser = await puppeteer.launch({
 24      headless: false,
 25      defaultViewport: {
 26        height: 1080,
 27        width: 1920,
 28      },
 29      args: ["--no-sandbox"],
 30    });
 31    const page = await browser.newPage();
 32    // 设置页面默认超时时间
 33    page.setDefaultTimeout(100000);
 34
 35    // 设置页面的默认导航超时时间
 36    page.setDefaultNavigationTimeout(50000);
 37
 38    // 设置user-agent
 39    ua && (await page.setUserAgent(ua));
 40
 41    await page.goto(config.url, { waitUntil: "domcontentloaded" });
 42
 43    const section__denom = await page.waitForSelector(roots);
 44
 45    if (!section__denom) return [];
 46
 47    const params = { ...config, platform: "codashop" };
 48    const _waitFor = waitFor.toString();
 49
 50    // 进行DOM操作
 51    const jsons = await page.evaluate(
 52      async (args, _waitFor, _roots) => {
 53        const _wait = eval("(" + _waitFor + ")");
 54        await _wait(1000);
 55
 56        let price, sku_name;
 57
 58        let _lis =
 59          Array.from(
 60            document.querySelectorAll(".form-section__denom-group li")
 61          ) || [];
 62
 63        if (_lis && _lis.length === 0) return [];
 64
 65        const games = _lis.map((item) => {
 66          const sku_name_dom =
 67            item.querySelector(".form-section__denom-data-section") || null;
 68          const sku_price_dom =
 69            item.querySelector(".starting-price-value") || null;
 70
 71          if (sku_name_dom) {
 72            sku_name = sku_name_dom.innerText || "SKU_NAME";
 73          }
 74
 75          if (sku_price_dom) {
 76            price = sku_price_dom.innerText;
 77          }
 78
 79          return {
 80            price,
 81            sku_name,
 82            currency: args.currency,
 83            platform: args.platform,
 84            game: args.gameName,
 85            country: args.country,
 86          };
 87        });
 88
 89        return !!(games && games.length) ? games : [];
 90      },
 91      params,
 92      _waitFor,
 93      roots
 94    );
 95    console.log(jsons);
 96    /**
 97     * [
 98          {
 99            price: 'RM4.50',
100            sku_name: '60 UC',
101            currency: 'RM',
102            platform: 'codashop',
103            game: 'pubgm',
104            country: 'my'
105          },
106          {
107            price: 'RM22.50',
108            sku_name: '325 UC',
109            currency: 'RM',
110            platform: 'codashop',
111            game: 'pubgm',
112            country: 'my'
113          },
114          {
115            price: 'RM45.00',
116            sku_name: '660 UC',
117            currency: 'RM',
118            platform: 'codashop',
119            game: 'pubgm',
120            country: 'my'
121          },
122          {
123            price: 'RM112.50',
124            sku_name: '1800 UC',
125            currency: 'RM',
126            platform: 'codashop',
127            game: 'pubgm',
128            country: 'my'
129          },
130          {
131            price: 'RM225.00',
132            sku_name: '3850 UC',
133            currency: 'RM',
134            platform: 'codashop',
135            game: 'pubgm',
136            country: 'my'
137          },
138          {
139            price: 'RM450.00',
140            sku_name: '8100 UC',
141            currency: 'RM',
142            platform: 'codashop',
143            game: 'pubgm',
144            country: 'my'
145          }
146        ]
147     */
148    await browser.close();
149  };
150  run();
151} catch (err) {
152  console.error(err);
153}
154

解析SSR渲染数据

jollymax为例,通过查看当前的源码可以得到两个信息:使用的框架和是否为SSR渲染从而定位到数据的位置,下面是使用nuxtjs的SSR渲染,如图所示:
Jansiel_Essay_1708935982984

实例代码如下:

  1// jollymax
  2const puppeteer = require("puppeteer");
  3const ua =
  4  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";
  5
  6const config = {
  7  url: "https://www.jollymax.com/ru/PUBG",
  8  gameName: "pubgm",
  9  currency: "RUB",
 10  country: "ru",
 11  thous_separator: "", // 千位分隔符
 12  decimal_point_separator: ".", // 小数分隔符
 13};
 14
 15try {
 16  const run = async () => {
 17    const browser = await puppeteer.launch({
 18      headless: false,
 19      defaultViewport: {
 20        height: 1080,
 21        width: 1920,
 22      },
 23      args: ["--no-sandbox"],
 24    });
 25    // Create a page
 26    const page = await browser.newPage();
 27
 28    // 设置页面默认超时时间
 29    page.setDefaultTimeout(100000);
 30
 31    // 设置页面的默认导航超时时间
 32    page.setDefaultNavigationTimeout(50000);
 33
 34    // 设置user-agent
 35    ua && (await page.setUserAgent(ua));
 36
 37    // 拦截请求
 38    await page.setRequestInterception(true);
 39
 40    page.on("request", async (request) => {
 41      // 对一些不必要的资源、进行终止增加加载速度
 42      if (
 43        request.resourceType() == "image" ||
 44        request.resourceType() == "font" ||
 45        request.resourceType() == "stylesheet"
 46      ) {
 47        await request.abort();
 48      } else {
 49        await request.continue();
 50      }
 51    });
 52
 53    await page.goto(config.url, { waitUntil: "domcontentloaded" });
 54
 55    // 等待整个DOM加载完成
 56    await page.waitForSelector(".content-right-part");
 57
 58    const params = { ...config, platform: "jollymax" };
 59
 60    const result = await page.evaluate(async (args) => {
 61      let filterResults = [];
 62
 63      if (
 64        window &&
 65        window.__NUXT__ &&
 66        window.__NUXT__.data &&
 67        window.__NUXT__.data.length
 68      ) {
 69        const _serverData = window.__NUXT__.data[0]?.serverData;
 70
 71        if ("pageData" in _serverData) {
 72          const glist = _serverData.pageData.pageInfo.goodsList;
 73          if (!(glist && glist.length)) return filterResults;
 74
 75          const getPrice = (item) => {
 76            let result = "0";
 77            if (item.payTypeList.length) {
 78              result = item.payTypeList[0].amount;
 79            }
 80            return result.toString();
 81          };
 82
 83          //   默认拿取第一个支付通道的价格
 84          return glist.map((item) => {
 85            const price = getPrice(item);
 86            return {
 87              currency: item.currency || args.currency,
 88              platform: args.platform,
 89              game: args.gameName,
 90              country: args.country,
 91              price,
 92              sku_name: item?.goodsName || "SKU_NAME",
 93            };
 94          });
 95        }
 96      }
 97
 98      return [];
 99    }, params);
100
101    console.log(result);
102    /**
103         * [
104         {
105            currency: 'RUB',
106            platform: 'jollymax',
107            game: 'pubgm',
108            country: 'ru',
109            price: '91',
110            sku_name: '60 UC'
111        },
112        {
113            currency: 'RUB',
114            platform: 'jollymax',
115            game: 'pubgm',
116            country: 'ru',
117            price: '440',
118            sku_name: '325 UC'
119        },
120        {
121            currency: 'RUB',
122            platform: 'jollymax',
123            game: 'pubgm',
124            country: 'ru',
125            price: '910',
126            sku_name: '660 UC'
127        },
128        {
129            currency: 'RUB',
130            platform: 'jollymax',
131            game: 'pubgm',
132            country: 'ru',
133            price: '2248',
134            sku_name: '1800 UC'
135        },
136        {
137            currency: 'RUB',
138            platform: 'jollymax',
139            game: 'pubgm',
140            country: 'ru',
141            price: '4500',
142            sku_name: '3850 UC'
143        },
144        {
145            currency: 'RUB',
146            platform: 'jollymax',
147            game: 'pubgm',
148            country: 'ru',
149            price: '9100',
150            sku_name: '8100 UC'
151        },
152        {
153            currency: 'RUB',
154            platform: 'jollymax',
155            game: 'pubgm',
156            country: 'ru',
157            price: '1442',
158            sku_name: 'RP Upgrade Pack-A3'
159        },
160        {
161            currency: 'RUB',
162            platform: 'jollymax',
163            game: 'pubgm',
164            country: 'ru',
165            price: '3608',
166            sku_name: 'Elite RP Upgrade Pack-A3'
167        }
168        ]
169     */
170
171    await browser.close();
172  };
173  run();
174} catch (err) {
175  console.error(err);
176}
177

HTTP劫持\请求

razer为例,在请求中找到渲染当前页面的关系,通过拦截当前游戏名称的请求进行数据分析,获取当前的商品名称和价格等信息。
Jansiel_Essay_1708935764547

实例代码如下:

  1// razer
  2const puppeteer = require("puppeteer");
  3const ua =
  4  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";
  5
  6const config = {
  7  url: "https://gold.razer.com/my/en/gold/catalog/pubgm",
  8  gameName: "pubgm",
  9  currency: "RM",
 10  country: "my",
 11  thous_separator: ",",
 12  decimal_point_separator: ".",
 13};
 14
 15try {
 16  const run = async () => {
 17    const browser = await puppeteer.launch({
 18      headless: false,
 19      defaultViewport: {
 20        height: 1080,
 21        width: 1920,
 22      },
 23      args: ["--no-sandbox"],
 24    });
 25    const page = await browser.newPage();
 26    // 设置页面默认超时时间
 27    page.setDefaultTimeout(100000);
 28
 29    // 设置页面的默认导航超时时间
 30    page.setDefaultNavigationTimeout(50000);
 31
 32    // 设置user-agent
 33    ua && (await page.setUserAgent(ua));
 34
 35    // 拦截请求
 36    await page.setRequestInterception(true);
 37
 38    page.on("request", async (request) => {
 39      // 对一些不必要的资源、进行终止增加加载速度
 40      if (
 41        request.resourceType() == "image" ||
 42        request.resourceType() == "font"
 43      ) {
 44        await request.abort();
 45      } else {
 46        await request.continue();
 47      }
 48    });
 49
 50    function getResValue() {
 51      return new Promise((resolve) => {
 52        let result = [];
 53
 54        page.on("response", async (response) => {
 55          const url = response.url();
 56          const headers = response.headers();
 57          const contentType = headers["content-type"];
 58          const _url =
 59            url && url.indexOf("/") !== -1 ? url.split("/").pop() : "";
 60
 61          if (_url && contentType.includes("application/json")) {
 62            const jsons = await response.json();
 63
 64            if (jsons && jsons.gameSkus && jsons.gameSkus.length) {
 65              const _gameSkus = jsons.gameSkus || [];
 66              result = _gameSkus.map((item) => {
 67                const price = item.unitGold || item.unitBaseGold || 0;
 68                const sku_name =
 69                  item.productName || item.vanityName || "SKU_NAME";
 70
 71                return {
 72                  currency: config.currency,
 73                  country: config.country,
 74                  platform: "razer",
 75                  game: _url,
 76                  price: price.toString(),
 77                  sku_name,
 78                };
 79              });
 80              resolve(result);
 81            }
 82          }
 83        });
 84      });
 85    }
 86    await page.goto(config.url);
 87    const result = await getResValue();
 88    console.log(result);
 89    /**
 90     * [
 91        {
 92            currency: 'RM',
 93            country: 'my',
 94            platform: 'razer',
 95            game: 'pubgm',
 96            price: '5',
 97            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM5)'
 98        },
 99        {
100            currency: 'RM',
101            country: 'my',
102            platform: 'razer',
103            game: 'pubgm',
104            price: '10',
105            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM10)'
106        },
107        {
108            currency: 'RM',
109            country: 'my',
110            platform: 'razer',
111            game: 'pubgm',
112            price: '20',
113            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM20)'
114        },
115        {
116            currency: 'RM',
117            country: 'my',
118            platform: 'razer',
119            game: 'pubgm',
120            price: '30',
121            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM30)'
122        },
123        {
124            currency: 'RM',
125            country: 'my',
126            platform: 'razer',
127            game: 'pubgm',
128            price: '40',
129            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM40)'
130        },
131        {
132            currency: 'RM',
133            country: 'my',
134            platform: 'razer',
135            game: 'pubgm',
136            price: '50',
137            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM50)'
138        },
139        {
140            currency: 'RM',
141            country: 'my',
142            platform: 'razer',
143            game: 'pubgm',
144            price: '100',
145            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM100)'
146        },
147        {
148            currency: 'RM',
149            country: 'my',
150            platform: 'razer',
151            game: 'pubgm',
152            price: '200',
153            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM200)'
154        },
155        {
156            currency: 'RM',
157            country: 'my',
158            platform: 'razer',
159            game: 'pubgm',
160            price: '300',
161            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM300)'
162        }
163    ]
164     */
165    await browser.close();
166  };
167  run();
168} catch (err) {
169  console.error(err);
170}
171

模拟用户点击

razer为例,找到网页的商品的锚点的DOM元素进行模拟点击操作,根据不同商品请求对应的价格的通道数据
Jansiel_Essay_1708936253378

实例代码如下:

  1// razer 模拟用户点击
  2const puppeteer = require("puppeteer");
  3const ua =
  4  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";
  5
  6const config = {
  7  url: "https://gold.razer.com/my/en/gold/catalog/pubgm",
  8  gameName: "pubgm",
  9  currency: "RM",
 10  country: "my",
 11  thous_separator: ",",
 12  decimal_point_separator: ".",
 13};
 14
 15const waitFor = async (t) => {
 16  return new Promise((r) => setTimeout(r, t));
 17};
 18const gameSkuList = [];
 19
 20try {
 21  const run = async () => {
 22    const browser = await puppeteer.launch({
 23      headless: false,
 24      defaultViewport: {
 25        height: 1080,
 26        width: 1920,
 27      },
 28      args: ["--no-sandbox"],
 29    });
 30    const page = await browser.newPage();
 31    // 设置页面默认超时时间
 32    page.setDefaultTimeout(100000);
 33
 34    // 设置页面的默认导航超时时间
 35    page.setDefaultNavigationTimeout(50000);
 36
 37    // 设置user-agent
 38    ua && (await page.setUserAgent(ua));
 39
 40    await page.goto(config.url);
 41
 42    await waitFor(3000);
 43
 44    const webshopStepSku = await page.waitForSelector("#webshop_step_sku");
 45    if (!webshopStepSku) {
 46      throw new Error("当前的IP被封禁了!!!");
 47    }
 48
 49    const skuItem = await page.$$("#webshop_step_sku .sku-list__item");
 50    const darkFilter = await page.$(".onetrust-pc-dark-filter");
 51
 52    // 自定义弹窗默认关闭
 53    await page.evaluateHandle((element) => {
 54      element && (element.style.display = "none");
 55    }, darkFilter);
 56
 57    const params = { ...config };
 58    const getCards = async (dList, args) => {
 59      for (let d of dList) {
 60        const sku_name = await page.evaluate((element) => {
 61          const res = element.querySelector(".selection-tile__text") || null;
 62          if (!res) return {};
 63
 64          return res?.innerText || "";
 65        }, d);
 66        await d.click();
 67        await waitFor(1500);
 68
 69        const price_text = await page.evaluate(() => {
 70          const channels =
 71            document.querySelector("#webshop_step_payment_channels") || null;
 72          if (!channels) return {};
 73
 74          // 优先获取其他支付通道
 75          let _details =
 76            channels.querySelectorAll(".selection-tile-promos__details")[1] ||
 77            null;
 78
 79          // 兜底钱包
 80          if (!_details) {
 81            _details =
 82              channels.querySelectorAll(".selection-tile-promos__details")[0] ||
 83              null;
 84
 85            if (!_details) return {};
 86          }
 87
 88          const _card =
 89            _details.querySelector(".align-self-center.text-right") || null;
 90
 91          if (!_card) return {};
 92
 93          return _card?.innerText || "0";
 94        });
 95
 96        const jons = {
 97          sku_name,
 98          price: price_text,
 99          currency: args.currency,
100          platform: "pubgm",
101          game: args.gameName,
102          country: args.country,
103        };
104
105        gameSkuList.push(jons);
106      }
107
108      return gameSkuList;
109    };
110    const result = await getCards(skuItem, params);
111
112    console.log(result);
113    /**
114     * [
115        {
116            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM5)',
117            price: 'RM 5.00',
118            currency: 'RM',
119            platform: 'pubgm',
120            game: 'pubgm',
121            country: 'my'
122        },
123        {
124            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM10)',
125            price: 'RM 10.00',
126            currency: 'RM',
127            platform: 'pubgm',
128            game: 'pubgm',
129            country: 'my'
130        },
131        {
132            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM20)',
133            price: 'RM 20.00',
134            currency: 'RM',
135            platform: 'pubgm',
136            game: 'pubgm',
137            country: 'my'
138        },
139        {
140            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM30)',
141            price: 'RM 30.00',
142            currency: 'RM',
143            platform: 'pubgm',
144            game: 'pubgm',
145            country: 'my'
146        },
147        {
148            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM40)',
149            price: 'RM 40.00',
150            currency: 'RM',
151            platform: 'pubgm',
152            game: 'pubgm',
153            country: 'my'
154        },
155        {
156            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM50)',
157            price: 'RM 50.00',
158            currency: 'RM',
159            platform: 'pubgm',
160            game: 'pubgm',
161            country: 'my'
162        },
163        {
164            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM100)',
165            price: 'RM 100.00',
166            currency: 'RM',
167            platform: 'pubgm',
168            game: 'pubgm',
169            country: 'my'
170        },
171        {
172            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM200)',
173            price: 'RM 200.00',
174            currency: 'RM',
175            platform: 'pubgm',
176            game: 'pubgm',
177            country: 'my'
178        },
179        {
180            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM300)',
181            price: 'RM 300.00',
182            currency: 'RM',
183            platform: 'pubgm',
184            game: 'pubgm',
185            country: 'my'
186        }
187     */
188    await browser.close();
189  };
190  run();
191} catch (err) {
192  console.error(err);
193}
194

高级应用

在实际情况中不同的网站都有一些不同或者说特殊的场景,比如: 如何爬取多个页面?绕过验证码校验?破解机器人检测等,下面就让我们解锁Puppeteer更强大的功能!

绕过机器检测

我们可以通过检测机器人的网址进行测试,左真实的用户右侧是puppeteer访问,可以明显的看出在右侧的WebDriver标记为红色; Tips: 不同的浏览器可能表现不一致
Jansiel_Essay_1708935802273
Jansiel_Essay_1708936311216

我们可以使用到插件puppeteer-extra-plugin-stealth,它属于puppeteer-extra 全家桶的一个,访问右图片就明显看到没有报错了。

 1// 绕过爬虫检测
 2const puppeteer = require("puppeteer-extra");
 3const StealthPlugin = require("puppeteer-extra-plugin-stealth");
 4puppeteer.use(StealthPlugin());
 5(async () => {
 6  const browser = await puppeteer.launch({
 7    headless: false,
 8  });
 9  const page = await browser.newPage();
10  await page.goto("https://bot.sannysoft.com/");
11  await browser.close();
12})();
13

绕过验证码检测

对一些网站的验证码的校验,例如下图的google的人机验证,其实可以借助puppeteer-extra-plugin-recaptcha 进行破解处理来完成后续数据的操作,实例代码如下:

Tips: 知识需要付费哦
Jansiel_Essay_1708936550738

实例代码如下:

 1const puppeteer = require("puppeteer-extra");
 2const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
 3puppeteer.use(
 4  RecaptchaPlugin({
 5    provider: {
 6      id: "2captcha",
 7      token: "xxxxx", // 知识需要付费
 8    },
 9    visualFeedback: true,
10  })
11);
12const waitFor = async (t) => {
13  return new Promise((r) => setTimeout(r, t));
14};
15puppeteer.launch({ headless: false }).then(async (browser) => {
16  const page = await browser.newPage();
17  await page.goto("https://www.google.com/recaptcha/api2/demo");
18
19  await page.solveRecaptchas();
20
21  await Promise.all([
22    page.waitForNavigation(),
23    page.click(`#recaptcha-demo-submit`),
24  ]);
25    await page.screenshot({ path: 'response.png', fullPage: true })
26    await browser.close()
27});
28

开始多进程

很多场景我们会同时爬取多个网址,为了在性能上得到保证可以采用puppeteer-cluster来管理多个线程进行不同网站的处理,降低性能的损耗

实例代码如下:

 1const { Cluster } = require("puppeteer-cluster");
 2
 3(async () => {
 4  // Create a cluster with 2 workers
 5  const cluster = await Cluster.launch({
 6    concurrency: Cluster.CONCURRENCY_CONTEXT,
 7    maxConcurrency: 3,
 8    puppeteerOptions: {
 9      headless: false,
10    },
11  });
12
13  // Define a task (in this case: screenshot of page)
14  await cluster.task(async ({ page, data: url }) => {
15    await page.goto(url);
16
17    const path = url.replace(/[^a-zA-Z]/g, "_") + ".png";
18    await page.screenshot({ path });
19    console.log(`Screenshot of ${url} saved: ${path}`);
20  });
21
22  // Add some pages to queue
23  cluster.queue("https://www.baidu.com");
24  cluster.queue("https://www.bing.com/?mkt=zh-CN");
25  cluster.queue("https://github.com/");
26
27  // Shutdown after everything is done
28  await cluster.idle();
29  await cluster.close();
30})();
31

总结

puppeteer可以帮助我们完成一些自动化操作的同时也要注意他的优缺点,在进行一些内存消耗较大的任务的时候会导致占用的内存特别高,同时要启动一个 真实的Chrome实例 会对一些需要 快速执行的应用 造成影响。

总体来说,Puppeteer是一个功能强大且易于使用的浏览器自动化工具,适用于各种场景。然而,在选择是否使用Puppeteer时,需要考虑到其对系统资源的消耗和启动时间较慢这两个缺点。