Pixiv爬虫/下载/代理，无需服务器即可一键部署！

项目简介#

把去年用python写的pixiv爬虫项目用AI和serverless架构重写了一下，支持定时任务，支持下载，支持代理访问。

✨ 核心特性#

无服务器架构：基于 Vercel + Cloudflare Workers，零运维成本
智能爬取：支持热度过滤，自动发现优质内容
️ 图片代理：解决跨域问题，提供高速图片访问
数据存储：集成 Supabase 数据库，支持复杂查询
批量下载：支持图片批量下载到 Cloudflare R2
⏰ 定时任务：自动爬取排行榜，无需人工干预

️ 系统架构#

1
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
2
│   Vercel API    │    │ Cloudflare Cron │    │   Supabase DB   │
3
│   (主服务)       │◄──►│   (定时任务)     │◄──►│   (数据存储)     │
4
└─────────────────┘    └─────────────────┘    └─────────────────┘
5
         │                       │                       │
6
         ▼                       ▼                       ▼
7
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
8
│  图片代理服务    │    │   爬虫调度器     │    │   数据分析API   │
9
│ (跨域解决方案)   │    │  (任务分发)      │    │  (统计查询)      │
10
└─────────────────┘    └─────────────────┘    └─────────────────┘

核心功能实现#

1. 智能爬虫引擎#

我们的爬虫不仅能获取基础信息，还能智能分析内容质量：

1
/**
2
 * Pixiv 爬虫核心类
3
 * 支持智能推荐和热度计算
4
 */
5
export class PixivCrawler {
6
  private headers: any;
7
  private logManager: any;
8
  private taskId: string;
9

10
  constructor(pid: string, headers: any, logManager: any, taskId: string) {
11
    this.headers = headers;
12
    this.logManager = logManager;
13
    this.taskId = taskId;
14
  }
15

16
  /**
17
   * 获取作品详细信息并计算热度
18
   * @param pid 作品ID
19
   * @returns 作品信息和热度分数
20
   */
21
  async getIllustDetail(pid: string) {
22
    try {
23
      const url = `https://www.pixiv.net/ajax/illust/${pid}`;
24
      const response = await fetch(url, { headers: this.headers });
25
      const data = await response.json();
26

27
      if (data.error) {
28
        throw new Error(`API错误: ${data.message}`);
29
      }
30

31
      const illust = data.body;
32

33
      // 计算热度分数
34
      const popularity = this.calculatePopularity(
35
        illust.likeCount,
36
        illust.bookmarkCount,
37
        illust.viewCount
38
      );
39

40
      return {
41
        pid: illust.id,
42
        title: illust.title,
43
        tags: illust.tags.tags.map((tag: any) => tag.tag),
44
        likeCount: illust.likeCount,
45
        bookmarkCount: illust.bookmarkCount,
46
        viewCount: illust.viewCount,
47
        popularity,
48
        createDate: illust.createDate
49
      };
50
    } catch (error) {
51
      this.logManager.addLog(`获取作品 ${pid} 失败: ${error.message}`, 'error', this.taskId);
52
      throw error;
53
    }
54
  }
55

56
  /**
57
   * 热度计算算法
58
   * 综合考虑点赞、收藏、浏览量
59
   */
60
  private calculatePopularity(likes: number, bookmarks: number, views: number): number {
61
    if (views === 0) return 0;
62

63
    const likeRate = likes / views;
64
    const bookmarkRate = bookmarks / views;
65

66
    // 加权计算热度分数
67
    return (likeRate * 0.3 + bookmarkRate * 0.7) * Math.log10(views + 1);
68
  }
69
}

2. 图片代理服务#

解决跨域问题，提供高速图片访问：

1
/**
2
 * Pixiv 图片代理服务
3
 * 支持多尺寸智能选择和自动降级
4
 */
5
export class PixivProxy {
6
  private headers: any;
7

8
  constructor() {
9
    this.headers = {
10
      'Referer': 'https://www.pixiv.net/',
11
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
12
    };
13
  }
14

15
  /**
16
   * 代理访问 Pixiv 图片
17
   * @param pid 图片ID
18
   * @param size 期望尺寸
19
   * @returns 图片数据流
20
   */
21
  async proxyImage(pid: string, size: string = 'regular') {
22
    // 尺寸优先级：thumb_mini -> small -> regular -> original
23
    const sizeOptions = ['thumb_mini', 'small', 'regular', 'original'];
24
    const startIndex = sizeOptions.indexOf(size);
25

26
    if (startIndex === -1) {
27
      throw new Error(`不支持的图片尺寸: ${size}`);
28
    }
29

30
    // 按优先级尝试获取图片
31
    for (let i = startIndex; i < sizeOptions.length; i++) {
32
      try {
33
        const currentSize = sizeOptions[i];
34
        const imageUrl = await this.getImageUrl(pid, currentSize);
35

36
        if (imageUrl) {
37
          const imageResponse = await fetch(imageUrl, {
38
            headers: this.headers
39
          });
40

41
          if (imageResponse.ok) {
42
            return {
43
              data: imageResponse.body,
44
              contentType: imageResponse.headers.get('content-type'),
45
              size: currentSize
46
            };
47
          }
48
        }
49
      } catch (error) {
50
        console.log(`尺寸 ${sizeOptions[i]} 获取失败，尝试下一个尺寸`);
51
        continue;
52
      }
53
    }
54

55
    throw new Error(`无法获取图片 ${pid} 的任何尺寸版本`);
56
  }
57

58
  /**
59
   * 获取指定尺寸的图片URL
60
   */
61
  private async getImageUrl(pid: string, size: string): Promise<string | null> {
62
    try {
63
      const response = await fetch(`https://www.pixiv.net/ajax/illust/${pid}`, {
64
        headers: this.headers
65
      });
66

67
      const data = await response.json();
68
      const urls = data.body?.urls;
69

70
      return urls?.[size] || null;
71
    } catch (error) {
72
      return null;
73
    }
74
  }
75
}

3. API 接口设计#

提供完整的 RESTful API：

1
/**
2
 * 主 API 处理器
3
 * 支持多种操作：爬取、下载、代理、统计
4
 */
5
export default async function handler(req: VercelRequest, res: VercelResponse) {
6
  // 设置 CORS 头
7
  res.setHeader('Access-Control-Allow-Origin', '*');
8
  res.setHeader('Access-Control-Allow-Methods', 'GET, POST, OPTIONS');
9
  res.setHeader('Access-Control-Allow-Headers', 'Content-Type');
10

11
  if (req.method === 'OPTIONS') {
12
    res.status(200).end();
13
    return;
14
  }
15

16
  const { action, pid, size } = req.query;
17

18
  try {
19
    switch (action) {
20
      case 'proxy-image':
21
        // 图片代理访问
22
        if (!pid) {
23
          res.status(400).json({ error: '缺少 pid 参数' });
24
          return;
25
        }
26

27
        const proxy = new PixivProxy();
28
        const imageResult = await proxy.proxyImage(pid as string, size as string);
29

30
        res.setHeader('Content-Type', imageResult.contentType);
31
        res.setHeader('Cache-Control', 'public, max-age=86400'); // 缓存1天
32

33
        return imageResult.data.pipe(res);
34

35
      case 'get-pic':
36
        // 获取图片信息
37
        const crawler = new PixivCrawler(pid as string, getPixivHeaders(), logManager, 'api_request');
38
        const illustInfo = await crawler.getIllustDetail(pid as string);
39

40
        res.status(200).json({
41
          success: true,
42
          data: illustInfo
43
        });
44
        break;
45

46
      case 'stats':
47
        // 获取统计信息
48
        const supabase = new SupabaseService();
49
        const stats = await supabase.getStats();
50

51
        res.status(200).json({
52
          success: true,
53
          stats
54
        });
55
        break;
56

57
      default:
58
        // 返回 Web 界面
59
        const htmlContent = getWebInterface();
60
        res.setHeader('Content-Type', 'text/html');
61
        res.status(200).send(htmlContent);
62
    }
63
  } catch (error) {
64
    res.status(500).json({
65
      error: '服务器内部错误',
66
      message: error.message
67
    });
68
  }
69
}

4. 定时任务调度#

使用 Cloudflare Cron Worker 实现自动化：

1
/**
2
 * Cloudflare Cron Worker
3
 * 定时执行爬取任务
4
 */
5
export default {
6
  async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {
7
    console.log('定时任务开始执行:', new Date().toISOString());
8

9
    try {
10
      // 每日排行榜爬取
11
      if (shouldRunDailyRanking(event.cron)) {
12
        await triggerRankingCrawl('daily', env);
13
      }
14

15
      // 每周排行榜爬取
16
      if (shouldRunWeeklyRanking(event.cron)) {
17
        await triggerRankingCrawl('weekly', env);
18
      }
19

20
      // 清理过期日志
21
      if (shouldCleanLogs(event.cron)) {
22
        await cleanExpiredLogs(env);
23
      }
24

25
    } catch (error) {
26
      console.error('定时任务执行失败:', error);
27
    }
28
  }
29
};
30

31
/**
32
 * 触发排行榜爬取
33
 */
34
async function triggerRankingCrawl(type: 'daily' | 'weekly' | 'monthly', env: Env) {
35
  const endpoint = `${env.MAIN_SERVICE_URL}/api/?action=${type}`;
36

37
  try {
38
    const response = await fetch(endpoint, {
39
      method: 'GET',
40
      headers: {
41
        'Authorization': `Bearer ${env.API_TOKEN}`
42
      }
43
    });
44

45
    if (response.ok) {
46
      console.log(`${type} 排行榜爬取任务已触发`);
47
    } else {
48
      throw new Error(`HTTP ${response.status}: ${response.statusText}`);
49
    }
50
  } catch (error) {
51
    console.error(`触发 ${type} 排行榜爬取失败:`, error);
52
  }
53
}

3分钟快速部署#

步骤1：克隆项目#

1
git clone https://github.com/your-username/serverless_pixiv_crawler.git
2
cd serverless_pixiv_crawler
3
npm install

步骤2：配置环境变量#

复制 .env.example 为 .env：

1
# Supabase 数据库配置
2
SUPABASE_URL=your_supabase_url_here
3
SUPABASE_ANON_KEY=your_supabase_anon_key_here
4

5
# Pixiv 配置
6
PIXIV_COOKIE=your_pixiv_cookie_here
7

8
# Cloudflare R2 配置（可选）
9
CLOUDFLARE_ACCOUNT_ID=your_account_id
10
CLOUDFLARE_ACCESS_KEY_ID=your_access_key
11
CLOUDFLARE_SECRET_ACCESS_KEY=your_secret_key
12
CLOUDFLARE_BUCKET_NAME=your_bucket_name

步骤3：部署到 Vercel#

1
# 安装 Vercel CLI
2
npm i -g vercel
3

4
# 登录并部署
5
vercel login
6
vercel --prod

步骤4：部署定时任务（可选）#

1
cd cron_worker
2
npm install
3

4
# 配置 Cloudflare Workers
5
npx wrangler login
6
npx wrangler deploy

功能演示#

1. Web 管理界面#

部署完成后，访问你的 Vercel 域名，你将看到一个现代化的管理界面：

实时统计：显示爬取数量、成功率等关键指标
任务管理：启动、停止、监控爬取任务
日志查看：实时查看系统运行日志
数据搜索：快速查找和筛选爬取的内容

2. API 接口使用#

1
// 获取图片信息
2
fetch('https://your-domain.vercel.app/api/?action=get-pic&pid=123456')
3
  .then(res => res.json())
4
  .then(data => console.log(data));
5

6
// 代理访问图片
7
const imageUrl = 'https://your-domain.vercel.app/api/?action=proxy-image&pid=123456&size=regular';
8
document.getElementById('image').src = imageUrl;
9

10
// 启动爬取任务
11
fetch('https://your-domain.vercel.app/api/', {
12
  method: 'POST',
13
  headers: { 'Content-Type': 'application/json' },
14
  body: JSON.stringify({
15
    pid: '123456',
16
    targetNum: 1000,
17
    popularityThreshold: 0.22
18
  })
19
});

3. 数据分析#

系统自动收集和分析数据，提供丰富的统计信息：

1
-- 热门标签统计
2
SELECT tag, COUNT(*) as count
3
FROM illustrations
4
CROSS JOIN LATERAL unnest(tags) as tag
5
GROUP BY tag
6
ORDER BY count DESC
7
LIMIT 20;
8

9
-- 热度分布分析
10
SELECT
11
  CASE
12
    WHEN popularity >= 0.8 THEN '超高热度'
13
    WHEN popularity >= 0.5 THEN '高热度'
14
    WHEN popularity >= 0.2 THEN '中等热度'
15
    ELSE '低热度'
16
  END as level,
17
  COUNT(*) as count
18
FROM illustrations
19
GROUP BY level;

高级特性#

1. 智能推荐算法#

系统内置智能推荐算法，能够： - 内容发现：基于已有数据发现相似优质内容 - 热度预测：预测内容未来的热度趋势 - 风格分析：识别和分类不同的艺术风格

2. 防封机制#

请求头轮换：模拟真实浏览器行为
⏱️ 智能延迟：动态调整请求间隔
️ 错误重试：智能处理网络异常

3. 数据质量保证#

✅ 自动去重：避免重复数据
内容验证：确保数据完整性
质量评分：为每个内容计算质量分数

成本分析#

这个项目的最大优势是完全免费：

服务	免费额度	足够支撑
Vercel	100GB 带宽/月	中小型项目
Supabase	500MB 数据库	50万条记录
Cloudflare Workers	10万请求/天	大部分使用场景
Cloudflare R2	10GB 存储	数万张图片

故障排除#

常见问题#

部署失败
检查环境变量配置
确认 Supabase 连接正常
爬取失败
验证 Pixiv Cookie 有效性
检查网络连接状态
图片无法显示
确认代理服务正常运行
检查跨域配置

性能优化建议#

启用缓存：合理设置 CDN 缓存策略
监控指标：定期检查系统性能指标
定期清理：清理过期数据和日志

总结#

这个 Serverless Pixiv 爬虫项目展示了现代化 Web 开发的强大能力：

✅ 零运维成本：完全基于云服务，无需管理服务器
✅ 高可扩展性：自动伸缩，应对流量波动
✅ 功能完整：爬取、存储、分析、展示一体化
✅ 部署简单：3分钟即可完成部署

无论你是想学习 Serverless 架构，还是需要一个实用的数据收集工具，这个项目都是一个很好的起点。