文章目录

步骤
1
:进入豆瓣网
步骤
2
:初始化变量
步骤
3
:HTTP请求与反爬策略（详细实现）
步骤
4
:HTML解析与数据提取
步骤
5
:完整代码

深入解析：基于Spring Boot的豆瓣电影海报爬虫实现

木易

原创

发布时间: 2025-07-12 15:38:09 | 阅读数

编码

springBoot

爬虫

豆瓣

Jsoup

在本教程中，我们将深入解析如何基于 Spring Boot 框架实现豆瓣电影海报的爬取与展示。为了满足页面展示需求，我们需要获取最新影片的封面海报及其相关信息。通过本教程的学习，你将掌握如何结合爬虫技术与 Spring Boot 后端开发，构建一个完整的电影信息抓取与展示系统，涵盖数据爬取、处理、存储的全流程。

进入豆瓣网

进入豆瓣电影页面后，我们可以爬取当前正在热映的影片信息。点击“全部正在热映”选项，进入相关页面后，系统会显示你当前所在的城市，并且支持切换至其他城市，以便查看不同地区的热映影片情况。

初始化变量

@Component

public class TaskUtil {

@Resource

MovieService movieService; // 数据库服务层注入

// 配置爬取页面URL地 /nowplaying/要爬取的城市名称/

private static final String DOUBAN_MOVIE_URL =

"https://movie.douban.com/cinema/nowplaying/beijing/";

// 用于模拟浏览器身份标识

private static final String USER_AGENT =

"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0";

// 存储HTTP请求中Accept头部的信息，客户端期望从服务器获取的内容类型及其优先级

private static final String ACCEPT_HEADER =

"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";

// 模拟浏览器Cookie

private static final String COOKIE = "ll=\"108288\"; bid=bUEPKsO4-L4; ...";

// 本地图片存储路

private static final String filePath = "haibao/";

// 图片访问域名

private static final String fileHost = "http://localhost:8080/api/";

}

获取 Cookie、User-Agent 和 Accept-Header 的方式如下：

打开浏览器，进入目标网页（如豆瓣电影页面）。
右键点击空白区域，选择“检查”打开开发者工具。
切换到 Network（网络） 标签，并设置筛选条件为 All（全部）。
刷新页面，找到任意一个 HTTP 请求。
点击该请求，在 Headers（标头） 选项中可以查看以下信息：
User-Agent：浏览器标识信息。
Accept：客户端接受的数据类型。
Cookie：用于维持会话状态的信息。

这些参数可用于模拟浏览器请求，绕过反爬虫机制。

filePath和fileHost：

filePath：图片在本地文件系统的存储目录，值为 "haibao/"，表示所有下载的图片将保存在此目录下。
fileHost：图片在网络上可访问的基础URL，值为 "http://localhost:8080/api/"，用于提供图片的在线访问链接。

Maven 依赖配置（pom.xml）

在项目 pom.xml 文件中添加以下依赖项，以便支持 HTTP 请求和 HTML 解析功能：

xml

深色版本

<groupId>org.apache.httpcomponents.client5</groupId>

<artifactId>httpclient5</artifactId>

</dependency>

<groupId>org.jsoup</groupId>

<artifactId>jsoup</artifactId>

</dependency>

HTTP请求与反爬策略（详细实现）

try (CloseableHttpClient httpClient = HttpClients.createDefault()) {

HttpGet httpGet = new HttpGet(DOUBAN_MOVIE_URL);

// 设置请求头伪装浏览器

httpGet.setHeader("Accept", ACCEPT_HEADER);

httpGet.setHeader("User-Agent", USER_AGENT);

httpGet.setHeader("Cookie", COOKIE);

try (CloseableHttpResponse response = httpClient.execute(httpGet)) {

// 获取HTML内容

String htmlContent = EntityUtils.toString(response.getEntity());

// 进入解析流程

parseAndSaveMovies(htmlContent);

} catch (Exception e) {

throw new RuntimeException(e);

}

User-Agent模拟Firefox浏览器
Cookie携带用户会话信息（如bid身份标识）
使用try-with-resources确保HTTP连接自动关闭

HTML解析与数据提取

在页面空白处点击鼠标右键，选择“检查”以打开开发者工具。随后可以找到我们所需爬取内容对应的页面元素，查看其 class 或 id 等标识信息。接着，我们可以通过其中一个标识来获取该部分页面的 HTML 内容，并使用 Jsoup 将其解析为文档对象模型（DOM），以便进行后续的数据提取和处理。

private void parseAndSaveMovies(String htmlContent) throws Exception {

Document doc = Jsoup.parse(htmlContent);

Elements liList = doc.select("li.list-item"); // 定位电影列表项

for (int i = 0; i < liList.size(); i++) {

Element li = liList.get(i);

// 提取关键数据

String region = li.attr("data-region"); // 地区

String releaseDate = li.attr("data-release"); // 上映时间

String name = li.select("img").attr("alt"); // 电影名称

String posterUrl = li.select("img").attr("src"); // 海报URL

String ticketUrl = li.select("a").attr("href"); // 购票链接

// 下载并存储海报

String savedPath = saveImage(posterUrl, fileHost, filePath);

// 构建Movie对象

Movie movie = new Movie();

movie.setDiqu(region);

movie.setShangyingshijian(releaseDate);

movie.setMingcheng(name);

movie.setHaibao(savedPath); // 存储网络可访问路径

movie.setGoupiaodizhi(ticketUrl);

movieService.insertMovie(movie); // 持久化到数据库

}

CSS选择器定位：li.list-item精准捕获列表元素
属性提取：从data-*属性获取结构化数据
相对路径转绝对URL：saveImage()处理图片下载与路径映射

Jsoup 的主要用途包括：

用途	描述
HTML解析	将原始HTML字符串转为可操作的文档对象模型（DOM）结构
数据提取（爬虫）	使用选择器（类似css）快速提取特定标签、属性或文本内容
HTML清理	过滤 HTML 中的不安全内容，防止 XSS 攻击（适合富文本编辑器输入）
HTML操作	修改、添加或删除HTML节点，生成新的HTML内容
网络请求	提供简单的http请求功能（get/post），模拟浏览器访问网页

完整代码

package com.example.task;

import com.example.entity.Movie;

import com.example.service.MovieService;

import org.apache.hc.client5.http.classic.methods.HttpGet;

import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;

import org.apache.hc.client5.http.impl.classic.CloseableHttpResponse;

import org.apache.hc.client5.http.impl.classic.HttpClients;

import org.apache.hc.core5.http.ParseException;

import org.apache.hc.core5.http.io.entity.EntityUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import org.springframework.scheduling.annotation.Scheduled;

import org.springframework.stereotype.Component;

import javax.annotation.Resource;

import java.io.*;

import java.net.HttpURLConnection;

import java.net.URL;

import java.nio.file.*;

import java.nio.file.attribute.BasicFileAttributes;

import java.util.concurrent.TimeUnit;

@Component

public class TaskUtil {

@Resource

MovieService movieService;

// 配置爬取页面URL地址 /nowplaying/要爬取的城市名称/

private static final String DOUBAN_MOVIE_URL =

"https://movie.douban.com/cinema/nowplaying/beijing/";

// 用于模拟浏览器身份标识

private static final String USER_AGENT =

"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0";

// 存储HTTP请求中Accept头部的信息，客户端期望从服务器获取的内容类型及其优先级

private static final String ACCEPT_HEADER =

"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";

// 模拟浏览器Cookie

private static final String COOKIE = "ll=\"108288\"; bid=bUEPKsO4-L4; ...";

// 本地图片存储路

private static final String filePath = "haibao/";

// 图片访问域名

private static final String fileHost = "http://localhost:8080/api/";

// 每六小时运行一次

@Scheduled(cron = "0 0 0/6 * * ?")

public void getDianYingHaiBao() throws IOException, ParseException {

System.out.println("开始定时任务");

Path targetPath = Paths.get(System.getProperty("user.dir") + "\\"+filePath);

// 增加权限检查

if (!Files.isWritable(targetPath)) {

targetPath.toFile().setWritable(true);

}

deleteDirectoryWithRetry(targetPath);

movieService.deleteAllMovie();

try (CloseableHttpClient httpClient = HttpClients.createDefault()) {

HttpGet httpGet = new HttpGet(DOUBAN_MOVIE_URL);

// 设置请求头伪装浏览器

httpGet.setHeader("Accept", ACCEPT_HEADER);

httpGet.setHeader("User-Agent", USER_AGENT);

httpGet.setHeader("Cookie", COOKIE);

try (CloseableHttpResponse response = httpClient.execute(httpGet)) {

// 获取HTML内容

String htmlContent = EntityUtils.toString(response.getEntity());

// 进入解析流程

parseAndSaveMovies(htmlContent);

} catch (Exception e) {

throw new RuntimeException(e);

}

System.out.println("结束定时任务");

}

/**

* 删除文件

* @param path 文件路径

* @throws IOException

public void deleteDirectoryWithRetry(Path path) throws IOException {

// 递归删除增强版

Files.walkFileTree(path, new SimpleFileVisitor<Path>() {

@Override

public FileVisitResult visitFile(Path file, BasicFileAttributes attrs)

throws IOException {

deleteWithRetry(file); // 删除文件

return FileVisitResult.CONTINUE;

}

@Override

public FileVisitResult postVisitDirectory(Path dir, IOException exc)

throws IOException {

deleteWithRetry(dir); // 删除目录

return FileVisitResult.CONTINUE;

}

private void deleteWithRetry(Path path) throws IOException {

int retryCount = 0;

final int MAX_RETRY = 3;

while (retryCount < MAX_RETRY) {

try {

// 处理 Windows 文件属性

if (Files.exists(path)) {

// 清除只读属性

if (path.toFile().exists()) {

path.toFile().setWritable(true);

}

Files.delete(path);

}

return;

} catch (AccessDeniedException ex) {

// 处理文件锁定

retryCount++;

try {

TimeUnit.MILLISECONDS.sleep(300 * retryCount);

} catch (InterruptedException e) {

Thread.currentThread().interrupt();

throw new IOException("删除操作被中断", e);

}

} catch (IOException ex) {

if (retryCount == MAX_RETRY - 1) {

throw new IOException("无法删除: " + path + " (重试次数超限)", ex);

}

retryCount++;

}

});

}

private void parseAndSaveMovies(String htmlContent) throws Exception {

Document doc = Jsoup.parse(htmlContent);

Elements liList = doc.select("li.list-item"); // 定位电影列表项

for (int i = 0; i < liList.size(); i++) {

Element li = liList.get(i);

// 提取关键数据

String region = li.attr("data-region"); // 地区

String releaseDate = li.attr("data-release"); // 上映时间

String name = li.select("img").attr("alt"); // 电影名称

String posterUrl = li.select("img").attr("src"); // 海报URL

String ticketUrl = li.select("a").attr("href"); // 购票链接

// 下载并存储海报

String savedPath = saveImage(posterUrl, fileHost, filePath);

// 构建Movie对象

Movie movie = new Movie();

movie.setDiqu(region);

movie.setShangyingshijian(releaseDate);

movie.setMingcheng(name);

movie.setHaibao(savedPath); // 存储网络可访问路径

movie.setGoupiaodizhi(ticketUrl);

movieService.insertMovie(movie); // 持久化到数据库

}

/**

* 保存图片到本地

* @param urlPath 图片路径

* @param fileHost 访问图片端口

* @param filePath 图片存储路径

* @return 访问图片的完整路径

public String saveImage(String urlPath,String fileHost,String filePath) throws Exception{

//定义一个URL对象，就是你想下载的图片的URL地址

URL url = new URL(urlPath);

//打开连接

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

// 设置关键请求头

conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36");

conn.setRequestProperty("Accept", "image/webp,image/apng,image/*,*/*;q=0.9");

conn.setRequestProperty("Accept-Language", "zh-CN,zh;q=0.9");

//设置请求方式为"GET"

conn.setRequestMethod("GET");

//超时响应时间为10秒

conn.setConnectTimeout(10 * 1000);

//通过输入流获取图片数据

InputStream is = conn.getInputStream();

//得到图片的二进制数据，以二进制封装得到数据，具有通用性

byte[] data = readInputStream(is);

//创建一个文件对象用来保存图片，默认保存当前工程根目录，起名叫Copy.jpg

File dir = new File("haibao");