都2024年了?你还不会使用Puppeteer?
前言:
众所周知在开发的过程中,数据一直是推动整个业务链条的重要一环,通过爬虫进行数据的爬取和更新也是日常的操作,目前支持爬虫的语言很多:Python、Java、Ruby 还有Nodejs ,也就是今天主角 Puppeteer, 它是由 Google Chrome 官方团队维护以Node.js 为基础的开源工具,主要用于控制和自动化谷歌浏览器(Google Chrome)或其他兼容的浏览器操作。
废话不多说,下面让我们从浅到深一步一步带领大家走进爬虫的世界~
puppeteer的简介
Puppeteer是一个由Google开发的Node.js库,它提供了一套用于控制headless Chrome或Chromium浏览器的API。它可以模拟用户在浏览器中的操作行为,如点击、填写表单、截图等,同时还可以让开发者获取到浏览器渲染后的HTML内容。它提供了一套高级的 API,使得浏览器操作变得简单和可靠。主要包括: 自动化控制、页面操控、网络请求拦截、页面截图和 PDF 生成、自动化测试 等一系列操作
总而言之,Puppeteer 是一个功能强大、易用且灵活的浏览器自动化工具,能够帮助开发者完成各种浏览器操作和自动化任务。
环境搭建
puppeteer从 v1.7.0 开始支持两个包:puppeteer、puppeteer-core,
- puppeteer: 一个完整的包会下载一个可执行的Chromium浏览器。整个体积很大(适合本地调试)
- puppeteer-core: 不会下载一个可执行的Chromium浏览器、体积很小、配置的浏览器需要自己手动更新(适合部署在生产环境)
支持的版本Node版本 >= v16.20.0
1npm i puppeteer or npm i puppeteer -g // 最新版本:V21.7.0
2
Puppeteer的基础API
使用Headless模式
Puppeteer默认启动的是无头模式进行开发, 可以通过 headless
进行配置关闭,本地调试建议开启,
1const browser = await puppeteer.launch();
2// Equivalent to
3const browser = await puppeteer.launch({headless: false}); // 本地调试
4
需要注意的是Chrome 112 推出了新的 Headless 模式,可以通过新的参数调整
1const browser = await puppeteer.launch({headless: 'new'});
2
使用Puppeteer-core
在生产环境部署的时候使用puppeteer-core要注意版本,目测在 v16.2.0 这个版本是没问题 ,最新v21.7.0 在部署线上的时候有点问题
1// const puppeteer = require("puppeteer");
2const puppeteer = require("puppeteer-core");
3const browser = await puppeteer.launch({
4// executablePath: "/usr/bin/google-chrome", // 生产环境
5 executablePath:
6 "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
7});
8return browser;
9
关于 executablePath 如何可以访问: chrome://version/
「 可执行文件路径进行查看」
设置浏览器实例的其他命令行参数
可以设置args来执行当前运行的浏览器实例一些命令行加以限制,具体可以参考Chromium命令行开关列表
1const puppeteer = require("puppeteer-core");
2const browser = await puppeteer.launch({
3 executablePath:
4 "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
5 args: [
6 "--no-sandbox", // 使用沙盒模式
7 "--disable-setuid-sandbox", // 禁用setuid沙盒(仅限Linux)
8 "--disable-extensions", // 禁用扩展
9 "--incognito", // 禁用GPU硬件加速
10 "--disable-gpu", // 以隐身模式运行
11 "--no-zygote", // 禁用 Zygote 进程模型,启动时不创建一个共享的子进程来提高性能。
12 ],
13});
14return browser;
15
设置浏览器视口分辨率
可以通过 defaultViewport 进行PC端的设置默认的视口分辨率
1const puppeteer = require("puppeteer-core");
2const browser = await puppeteer.launch({
3 executablePath:
4 "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
5 defaultViewport: {
6 height: 1080,
7 width: 1920,
8 },
9});
10return browser;
11
指定移动端设备访问
1const puppeteer = require("puppeteer");
2const iPhone = puppeteer.devices["iPhone 6"];
3
4(async () => {
5 const browser = await puppeteer.launch({
6 headless: false,
7 });
8 await page.emulate(iPhone);
9
10})();
11
其他
如果使用Docker部署可以参考相关资源
简单case上手
介绍完了以上的基础的API下面通过三个小例子来看一下它是如何工作的。
模拟设备截图
通过puppeteer模拟iPhone6进行访问百度的域名,进行当前网页的截图
1const puppeteer = require("puppeteer");
2const iPhone = puppeteer.devices["iPhone 6"];
3
4(async () => {
5 const browser = await puppeteer.launch({
6 headless: false,
7 });
8 const page = await browser.newPage();
9 await page.emulate(iPhone);
10 await page.goto("https://baidu.com/");
11 await page.screenshot({
12 path: "full.png",
13 fullPage: true,
14 });
15 console.log(await page.title());
16 await browser.close();
17})();
18
使用用户搜索截图
通过puppeteer进行百度搜索Puppeteer,进行截图保存到本地
1// baidu search
2const puppeteer = require("puppeteer");
3const screenshot = "baidu.png";
4try {
5 (async () => {
6 const browser = await puppeteer.launch({
7 headless: false,
8 });
9 const page = await browser.newPage();
10 await page.goto("https://baidu.com");
11 await page.type("#kw", "puppeteer");
12 await page.click("#su");
13 await page.waitForTimeout(2000);
14 await page.screenshot({ path: screenshot });
15 await browser.close();
16 })();
17} catch (err) {
18 console.error(err);
19}
20
设置cookie
通过puppeteer进行打开paypal进行cookie的种植,达到用户名的渲染
1// set cookie
2const cookie = {
3 name: "login_email",
4 value: "set_by_cookie@domain.com",
5 domain: ".paypal.com",
6 url: "https://www.paypal.com/",
7 path: "/",
8 httpOnly: true,
9 secure: true,
10};
11
12const puppeteer = require("puppeteer");
13(async () => {
14 const browser = await puppeteer.launch({ headless: false });
15 const page = await browser.newPage();
16 await page.setCookie(cookie);
17 await page.goto("https://www.paypal.com/signin");
18 await page.screenshot({ path: "paypal_login.png" });
19 await browser.close();
20
21})();
22
实战案例
相信通过以上的简单的示例分析,大家对整个流程有了一个初步的认识。下面让围绕目前主流爬取的方式来逐一攻破它的工作原理
解析HTML
下面以codashop 为例,通过解析HTML的方式把相关DOM节点元素进行筛选和过滤,抽离出SKU(商品)的「价格、商品名称」等数据
实例代码如下:
1// codashop
2const puppeteer = require("puppeteer");
3const ua =
4 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";
5
6const config = {
7 url: "https://www.codashop.com/en-my/pubg-mobile-uc-redeem-code",
8 gameName: "pubgm",
9 currency: "RM",
10 country: "my",
11 thous_separator: ",",
12 decimal_point_separator: ".",
13};
14
15// 延时
16const waitFor = async (t) => {
17 return new Promise((r) => setTimeout(r, t));
18};
19
20try {
21 const run = async () => {
22 const roots = ".form-section__denom-group";
23 const browser = await puppeteer.launch({
24 headless: false,
25 defaultViewport: {
26 height: 1080,
27 width: 1920,
28 },
29 args: ["--no-sandbox"],
30 });
31 const page = await browser.newPage();
32 // 设置页面默认超时时间
33 page.setDefaultTimeout(100000);
34
35 // 设置页面的默认导航超时时间
36 page.setDefaultNavigationTimeout(50000);
37
38 // 设置user-agent
39 ua && (await page.setUserAgent(ua));
40
41 await page.goto(config.url, { waitUntil: "domcontentloaded" });
42
43 const section__denom = await page.waitForSelector(roots);
44
45 if (!section__denom) return [];
46
47 const params = { ...config, platform: "codashop" };
48 const _waitFor = waitFor.toString();
49
50 // 进行DOM操作
51 const jsons = await page.evaluate(
52 async (args, _waitFor, _roots) => {
53 const _wait = eval("(" + _waitFor + ")");
54 await _wait(1000);
55
56 let price, sku_name;
57
58 let _lis =
59 Array.from(
60 document.querySelectorAll(".form-section__denom-group li")
61 ) || [];
62
63 if (_lis && _lis.length === 0) return [];
64
65 const games = _lis.map((item) => {
66 const sku_name_dom =
67 item.querySelector(".form-section__denom-data-section") || null;
68 const sku_price_dom =
69 item.querySelector(".starting-price-value") || null;
70
71 if (sku_name_dom) {
72 sku_name = sku_name_dom.innerText || "SKU_NAME";
73 }
74
75 if (sku_price_dom) {
76 price = sku_price_dom.innerText;
77 }
78
79 return {
80 price,
81 sku_name,
82 currency: args.currency,
83 platform: args.platform,
84 game: args.gameName,
85 country: args.country,
86 };
87 });
88
89 return !!(games && games.length) ? games : [];
90 },
91 params,
92 _waitFor,
93 roots
94 );
95 console.log(jsons);
96 /**
97 * [
98 {
99 price: 'RM4.50',
100 sku_name: '60 UC',
101 currency: 'RM',
102 platform: 'codashop',
103 game: 'pubgm',
104 country: 'my'
105 },
106 {
107 price: 'RM22.50',
108 sku_name: '325 UC',
109 currency: 'RM',
110 platform: 'codashop',
111 game: 'pubgm',
112 country: 'my'
113 },
114 {
115 price: 'RM45.00',
116 sku_name: '660 UC',
117 currency: 'RM',
118 platform: 'codashop',
119 game: 'pubgm',
120 country: 'my'
121 },
122 {
123 price: 'RM112.50',
124 sku_name: '1800 UC',
125 currency: 'RM',
126 platform: 'codashop',
127 game: 'pubgm',
128 country: 'my'
129 },
130 {
131 price: 'RM225.00',
132 sku_name: '3850 UC',
133 currency: 'RM',
134 platform: 'codashop',
135 game: 'pubgm',
136 country: 'my'
137 },
138 {
139 price: 'RM450.00',
140 sku_name: '8100 UC',
141 currency: 'RM',
142 platform: 'codashop',
143 game: 'pubgm',
144 country: 'my'
145 }
146 ]
147 */
148 await browser.close();
149 };
150 run();
151} catch (err) {
152 console.error(err);
153}
154
解析SSR渲染数据
以jollymax为例,通过查看当前的源码可以得到两个信息:使用的框架和是否为SSR渲染从而定位到数据的位置,下面是使用nuxtjs的SSR渲染,如图所示:
实例代码如下:
1// jollymax
2const puppeteer = require("puppeteer");
3const ua =
4 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";
5
6const config = {
7 url: "https://www.jollymax.com/ru/PUBG",
8 gameName: "pubgm",
9 currency: "RUB",
10 country: "ru",
11 thous_separator: "", // 千位分隔符
12 decimal_point_separator: ".", // 小数分隔符
13};
14
15try {
16 const run = async () => {
17 const browser = await puppeteer.launch({
18 headless: false,
19 defaultViewport: {
20 height: 1080,
21 width: 1920,
22 },
23 args: ["--no-sandbox"],
24 });
25 // Create a page
26 const page = await browser.newPage();
27
28 // 设置页面默认超时时间
29 page.setDefaultTimeout(100000);
30
31 // 设置页面的默认导航超时时间
32 page.setDefaultNavigationTimeout(50000);
33
34 // 设置user-agent
35 ua && (await page.setUserAgent(ua));
36
37 // 拦截请求
38 await page.setRequestInterception(true);
39
40 page.on("request", async (request) => {
41 // 对一些不必要的资源、进行终止增加加载速度
42 if (
43 request.resourceType() == "image" ||
44 request.resourceType() == "font" ||
45 request.resourceType() == "stylesheet"
46 ) {
47 await request.abort();
48 } else {
49 await request.continue();
50 }
51 });
52
53 await page.goto(config.url, { waitUntil: "domcontentloaded" });
54
55 // 等待整个DOM加载完成
56 await page.waitForSelector(".content-right-part");
57
58 const params = { ...config, platform: "jollymax" };
59
60 const result = await page.evaluate(async (args) => {
61 let filterResults = [];
62
63 if (
64 window &&
65 window.__NUXT__ &&
66 window.__NUXT__.data &&
67 window.__NUXT__.data.length
68 ) {
69 const _serverData = window.__NUXT__.data[0]?.serverData;
70
71 if ("pageData" in _serverData) {
72 const glist = _serverData.pageData.pageInfo.goodsList;
73 if (!(glist && glist.length)) return filterResults;
74
75 const getPrice = (item) => {
76 let result = "0";
77 if (item.payTypeList.length) {
78 result = item.payTypeList[0].amount;
79 }
80 return result.toString();
81 };
82
83 // 默认拿取第一个支付通道的价格
84 return glist.map((item) => {
85 const price = getPrice(item);
86 return {
87 currency: item.currency || args.currency,
88 platform: args.platform,
89 game: args.gameName,
90 country: args.country,
91 price,
92 sku_name: item?.goodsName || "SKU_NAME",
93 };
94 });
95 }
96 }
97
98 return [];
99 }, params);
100
101 console.log(result);
102 /**
103 * [
104 {
105 currency: 'RUB',
106 platform: 'jollymax',
107 game: 'pubgm',
108 country: 'ru',
109 price: '91',
110 sku_name: '60 UC'
111 },
112 {
113 currency: 'RUB',
114 platform: 'jollymax',
115 game: 'pubgm',
116 country: 'ru',
117 price: '440',
118 sku_name: '325 UC'
119 },
120 {
121 currency: 'RUB',
122 platform: 'jollymax',
123 game: 'pubgm',
124 country: 'ru',
125 price: '910',
126 sku_name: '660 UC'
127 },
128 {
129 currency: 'RUB',
130 platform: 'jollymax',
131 game: 'pubgm',
132 country: 'ru',
133 price: '2248',
134 sku_name: '1800 UC'
135 },
136 {
137 currency: 'RUB',
138 platform: 'jollymax',
139 game: 'pubgm',
140 country: 'ru',
141 price: '4500',
142 sku_name: '3850 UC'
143 },
144 {
145 currency: 'RUB',
146 platform: 'jollymax',
147 game: 'pubgm',
148 country: 'ru',
149 price: '9100',
150 sku_name: '8100 UC'
151 },
152 {
153 currency: 'RUB',
154 platform: 'jollymax',
155 game: 'pubgm',
156 country: 'ru',
157 price: '1442',
158 sku_name: 'RP Upgrade Pack-A3'
159 },
160 {
161 currency: 'RUB',
162 platform: 'jollymax',
163 game: 'pubgm',
164 country: 'ru',
165 price: '3608',
166 sku_name: 'Elite RP Upgrade Pack-A3'
167 }
168 ]
169 */
170
171 await browser.close();
172 };
173 run();
174} catch (err) {
175 console.error(err);
176}
177
HTTP劫持\请求
以razer为例,在请求中找到渲染当前页面的关系,通过拦截当前游戏名称的请求进行数据分析,获取当前的商品名称和价格等信息。
实例代码如下:
1// razer
2const puppeteer = require("puppeteer");
3const ua =
4 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";
5
6const config = {
7 url: "https://gold.razer.com/my/en/gold/catalog/pubgm",
8 gameName: "pubgm",
9 currency: "RM",
10 country: "my",
11 thous_separator: ",",
12 decimal_point_separator: ".",
13};
14
15try {
16 const run = async () => {
17 const browser = await puppeteer.launch({
18 headless: false,
19 defaultViewport: {
20 height: 1080,
21 width: 1920,
22 },
23 args: ["--no-sandbox"],
24 });
25 const page = await browser.newPage();
26 // 设置页面默认超时时间
27 page.setDefaultTimeout(100000);
28
29 // 设置页面的默认导航超时时间
30 page.setDefaultNavigationTimeout(50000);
31
32 // 设置user-agent
33 ua && (await page.setUserAgent(ua));
34
35 // 拦截请求
36 await page.setRequestInterception(true);
37
38 page.on("request", async (request) => {
39 // 对一些不必要的资源、进行终止增加加载速度
40 if (
41 request.resourceType() == "image" ||
42 request.resourceType() == "font"
43 ) {
44 await request.abort();
45 } else {
46 await request.continue();
47 }
48 });
49
50 function getResValue() {
51 return new Promise((resolve) => {
52 let result = [];
53
54 page.on("response", async (response) => {
55 const url = response.url();
56 const headers = response.headers();
57 const contentType = headers["content-type"];
58 const _url =
59 url && url.indexOf("/") !== -1 ? url.split("/").pop() : "";
60
61 if (_url && contentType.includes("application/json")) {
62 const jsons = await response.json();
63
64 if (jsons && jsons.gameSkus && jsons.gameSkus.length) {
65 const _gameSkus = jsons.gameSkus || [];
66 result = _gameSkus.map((item) => {
67 const price = item.unitGold || item.unitBaseGold || 0;
68 const sku_name =
69 item.productName || item.vanityName || "SKU_NAME";
70
71 return {
72 currency: config.currency,
73 country: config.country,
74 platform: "razer",
75 game: _url,
76 price: price.toString(),
77 sku_name,
78 };
79 });
80 resolve(result);
81 }
82 }
83 });
84 });
85 }
86 await page.goto(config.url);
87 const result = await getResValue();
88 console.log(result);
89 /**
90 * [
91 {
92 currency: 'RM',
93 country: 'my',
94 platform: 'razer',
95 game: 'pubgm',
96 price: '5',
97 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM5)'
98 },
99 {
100 currency: 'RM',
101 country: 'my',
102 platform: 'razer',
103 game: 'pubgm',
104 price: '10',
105 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM10)'
106 },
107 {
108 currency: 'RM',
109 country: 'my',
110 platform: 'razer',
111 game: 'pubgm',
112 price: '20',
113 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM20)'
114 },
115 {
116 currency: 'RM',
117 country: 'my',
118 platform: 'razer',
119 game: 'pubgm',
120 price: '30',
121 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM30)'
122 },
123 {
124 currency: 'RM',
125 country: 'my',
126 platform: 'razer',
127 game: 'pubgm',
128 price: '40',
129 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM40)'
130 },
131 {
132 currency: 'RM',
133 country: 'my',
134 platform: 'razer',
135 game: 'pubgm',
136 price: '50',
137 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM50)'
138 },
139 {
140 currency: 'RM',
141 country: 'my',
142 platform: 'razer',
143 game: 'pubgm',
144 price: '100',
145 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM100)'
146 },
147 {
148 currency: 'RM',
149 country: 'my',
150 platform: 'razer',
151 game: 'pubgm',
152 price: '200',
153 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM200)'
154 },
155 {
156 currency: 'RM',
157 country: 'my',
158 platform: 'razer',
159 game: 'pubgm',
160 price: '300',
161 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM300)'
162 }
163 ]
164 */
165 await browser.close();
166 };
167 run();
168} catch (err) {
169 console.error(err);
170}
171
模拟用户点击
以razer为例,找到网页的商品的锚点的DOM元素进行模拟点击操作,根据不同商品请求对应的价格的通道数据
实例代码如下:
1// razer 模拟用户点击
2const puppeteer = require("puppeteer");
3const ua =
4 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";
5
6const config = {
7 url: "https://gold.razer.com/my/en/gold/catalog/pubgm",
8 gameName: "pubgm",
9 currency: "RM",
10 country: "my",
11 thous_separator: ",",
12 decimal_point_separator: ".",
13};
14
15const waitFor = async (t) => {
16 return new Promise((r) => setTimeout(r, t));
17};
18const gameSkuList = [];
19
20try {
21 const run = async () => {
22 const browser = await puppeteer.launch({
23 headless: false,
24 defaultViewport: {
25 height: 1080,
26 width: 1920,
27 },
28 args: ["--no-sandbox"],
29 });
30 const page = await browser.newPage();
31 // 设置页面默认超时时间
32 page.setDefaultTimeout(100000);
33
34 // 设置页面的默认导航超时时间
35 page.setDefaultNavigationTimeout(50000);
36
37 // 设置user-agent
38 ua && (await page.setUserAgent(ua));
39
40 await page.goto(config.url);
41
42 await waitFor(3000);
43
44 const webshopStepSku = await page.waitForSelector("#webshop_step_sku");
45 if (!webshopStepSku) {
46 throw new Error("当前的IP被封禁了!!!");
47 }
48
49 const skuItem = await page.$$("#webshop_step_sku .sku-list__item");
50 const darkFilter = await page.$(".onetrust-pc-dark-filter");
51
52 // 自定义弹窗默认关闭
53 await page.evaluateHandle((element) => {
54 element && (element.style.display = "none");
55 }, darkFilter);
56
57 const params = { ...config };
58 const getCards = async (dList, args) => {
59 for (let d of dList) {
60 const sku_name = await page.evaluate((element) => {
61 const res = element.querySelector(".selection-tile__text") || null;
62 if (!res) return {};
63
64 return res?.innerText || "";
65 }, d);
66 await d.click();
67 await waitFor(1500);
68
69 const price_text = await page.evaluate(() => {
70 const channels =
71 document.querySelector("#webshop_step_payment_channels") || null;
72 if (!channels) return {};
73
74 // 优先获取其他支付通道
75 let _details =
76 channels.querySelectorAll(".selection-tile-promos__details")[1] ||
77 null;
78
79 // 兜底钱包
80 if (!_details) {
81 _details =
82 channels.querySelectorAll(".selection-tile-promos__details")[0] ||
83 null;
84
85 if (!_details) return {};
86 }
87
88 const _card =
89 _details.querySelector(".align-self-center.text-right") || null;
90
91 if (!_card) return {};
92
93 return _card?.innerText || "0";
94 });
95
96 const jons = {
97 sku_name,
98 price: price_text,
99 currency: args.currency,
100 platform: "pubgm",
101 game: args.gameName,
102 country: args.country,
103 };
104
105 gameSkuList.push(jons);
106 }
107
108 return gameSkuList;
109 };
110 const result = await getCards(skuItem, params);
111
112 console.log(result);
113 /**
114 * [
115 {
116 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM5)',
117 price: 'RM 5.00',
118 currency: 'RM',
119 platform: 'pubgm',
120 game: 'pubgm',
121 country: 'my'
122 },
123 {
124 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM10)',
125 price: 'RM 10.00',
126 currency: 'RM',
127 platform: 'pubgm',
128 game: 'pubgm',
129 country: 'my'
130 },
131 {
132 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM20)',
133 price: 'RM 20.00',
134 currency: 'RM',
135 platform: 'pubgm',
136 game: 'pubgm',
137 country: 'my'
138 },
139 {
140 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM30)',
141 price: 'RM 30.00',
142 currency: 'RM',
143 platform: 'pubgm',
144 game: 'pubgm',
145 country: 'my'
146 },
147 {
148 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM40)',
149 price: 'RM 40.00',
150 currency: 'RM',
151 platform: 'pubgm',
152 game: 'pubgm',
153 country: 'my'
154 },
155 {
156 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM50)',
157 price: 'RM 50.00',
158 currency: 'RM',
159 platform: 'pubgm',
160 game: 'pubgm',
161 country: 'my'
162 },
163 {
164 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM100)',
165 price: 'RM 100.00',
166 currency: 'RM',
167 platform: 'pubgm',
168 game: 'pubgm',
169 country: 'my'
170 },
171 {
172 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM200)',
173 price: 'RM 200.00',
174 currency: 'RM',
175 platform: 'pubgm',
176 game: 'pubgm',
177 country: 'my'
178 },
179 {
180 sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM300)',
181 price: 'RM 300.00',
182 currency: 'RM',
183 platform: 'pubgm',
184 game: 'pubgm',
185 country: 'my'
186 }
187 */
188 await browser.close();
189 };
190 run();
191} catch (err) {
192 console.error(err);
193}
194
高级应用
在实际情况中不同的网站都有一些不同或者说特殊的场景,比如: 如何爬取多个页面?绕过验证码校验?破解机器人检测等,下面就让我们解锁Puppeteer更强大的功能!
绕过机器检测
我们可以通过检测机器人的网址进行测试,左真实的用户右侧是puppeteer访问,可以明显的看出在右侧的WebDriver标记为红色; Tips: 不同的浏览器可能表现不一致
我们可以使用到插件puppeteer-extra-plugin-stealth,它属于puppeteer-extra 全家桶的一个,访问右图片就明显看到没有报错了。
1// 绕过爬虫检测
2const puppeteer = require("puppeteer-extra");
3const StealthPlugin = require("puppeteer-extra-plugin-stealth");
4puppeteer.use(StealthPlugin());
5(async () => {
6 const browser = await puppeteer.launch({
7 headless: false,
8 });
9 const page = await browser.newPage();
10 await page.goto("https://bot.sannysoft.com/");
11 await browser.close();
12})();
13
绕过验证码检测
对一些网站的验证码的校验,例如下图的google的人机验证,其实可以借助puppeteer-extra-plugin-recaptcha 进行破解处理来完成后续数据的操作,实例代码如下:
Tips: 知识需要付费哦
实例代码如下:
1const puppeteer = require("puppeteer-extra");
2const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
3puppeteer.use(
4 RecaptchaPlugin({
5 provider: {
6 id: "2captcha",
7 token: "xxxxx", // 知识需要付费
8 },
9 visualFeedback: true,
10 })
11);
12const waitFor = async (t) => {
13 return new Promise((r) => setTimeout(r, t));
14};
15puppeteer.launch({ headless: false }).then(async (browser) => {
16 const page = await browser.newPage();
17 await page.goto("https://www.google.com/recaptcha/api2/demo");
18
19 await page.solveRecaptchas();
20
21 await Promise.all([
22 page.waitForNavigation(),
23 page.click(`#recaptcha-demo-submit`),
24 ]);
25 await page.screenshot({ path: 'response.png', fullPage: true })
26 await browser.close()
27});
28
开始多进程
很多场景我们会同时爬取多个网址,为了在性能上得到保证可以采用puppeteer-cluster来管理多个线程进行不同网站的处理,降低性能的损耗
实例代码如下:
1const { Cluster } = require("puppeteer-cluster");
2
3(async () => {
4 // Create a cluster with 2 workers
5 const cluster = await Cluster.launch({
6 concurrency: Cluster.CONCURRENCY_CONTEXT,
7 maxConcurrency: 3,
8 puppeteerOptions: {
9 headless: false,
10 },
11 });
12
13 // Define a task (in this case: screenshot of page)
14 await cluster.task(async ({ page, data: url }) => {
15 await page.goto(url);
16
17 const path = url.replace(/[^a-zA-Z]/g, "_") + ".png";
18 await page.screenshot({ path });
19 console.log(`Screenshot of ${url} saved: ${path}`);
20 });
21
22 // Add some pages to queue
23 cluster.queue("https://www.baidu.com");
24 cluster.queue("https://www.bing.com/?mkt=zh-CN");
25 cluster.queue("https://github.com/");
26
27 // Shutdown after everything is done
28 await cluster.idle();
29 await cluster.close();
30})();
31
总结
puppeteer可以帮助我们完成一些自动化操作的同时也要注意他的优缺点,在进行一些内存消耗较大的任务的时候会导致占用的内存特别高,同时要启动一个 真实的Chrome实例 会对一些需要 快速执行的应用 造成影响。
总体来说,Puppeteer是一个功能强大且易于使用的浏览器自动化工具,适用于各种场景。然而,在选择是否使用Puppeteer时,需要考虑到其对系统资源的消耗和启动时间较慢这两个缺点。