Spring AI PDF Document Reader: Extract Text with Apache PDFBox in Spring Boot


To use the Spring AI PDF Document Reader, which utilizes Apache PDFBox to extract text from PDF documents in a Spring Boot application, you can follow this comprehensive example.

Steps to Implement

1. Setup Spring Boot Application

Make sure you have a Spring Boot project with the necessary dependencies. You can generate a Spring Boot project using Spring Initializr.

  • Dependencies:
    • Spring Web for creating REST endpoints.
    • PDF Document Reader Spring AI PDF document reader. It uses Apache PdfBox to extract text from PDF documents and converting them into a list of Spring AI Document objects..
Complete pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<parent>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-parent</artifactId>
		<version>3.4.1</version>
		<relativePath/> <!-- lookup parent from repository -->
	</parent>
	<groupId>com.example</groupId>
	<artifactId>demo</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>demo</name>
	<description>Demo project for Spring Boot</description>
	<properties>
		<java.version>21</java.version>
		<spring-ai.version>1.0.0-M4</spring-ai.version>
	</properties>
	<dependencies>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>
		<dependency>
			<groupId>org.springframework.ai</groupId>
			<artifactId>spring-ai-pdf-document-reader</artifactId>
		</dependency>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
		</dependency>
	</dependencies>
	<dependencyManagement>
		<dependencies>
			<dependency>
				<groupId>org.springframework.ai</groupId>
				<artifactId>spring-ai-bom</artifactId>
				<version>${spring-ai.version}</version>
				<type>pom</type>
				<scope>import</scope>
			</dependency>
		</dependencies>
	</dependencyManagement>

	<build>
		<plugins>
			<plugin>
				<groupId>org.springframework.boot</groupId>
				<artifactId>spring-boot-maven-plugin</artifactId>
			</plugin>
		</plugins>
	</build>
	<repositories>
		<repository>
			<id>spring-milestones</id>
			<name>Spring Milestones</name>
			<url>https://repo.spring.io/milestone</url>
			<snapshots>
				<enabled>false</enabled>
			</snapshots>
		</repository>
	</repositories>

</project>

2. Service to Process PDF Documents

The service will leverage Apache PDFBox through the Spring AI PDF Document Reader library to extract text.

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;

import java.io.IOException;

@Service
public class PdfReaderService {

    public String extractTextFromPdf(MultipartFile file) {
        try {
            // Load the PDF from the file
            PDDocument document = PDDocument.load(file.getInputStream());
            
            // Extract text using PDFTextStripper
            PDFTextStripper pdfTextStripper = new PDFTextStripper();
            String text = pdfTextStripper.getText(document);

            // Close the document
            document.close();

            return text;
        } catch (IOException e) {
            throw new RuntimeException("Error processing PDF file: " + e.getMessage(), e);
        }
    }
}

3. REST Controller to Handle Requests

Create a controller to expose the functionality via a REST API.

import com.example.pdfreader.service.PdfReaderService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

@RestController
@RequestMapping("/api/pdf")
public class PdfReaderController {

    @Autowired
    private PdfReaderService pdfReaderService;

    @PostMapping("/extract")
    public ResponseEntity<String> extractText(@RequestParam("file") MultipartFile file) {
        try {
            String extractedText = pdfReaderService.extractTextFromPdf(file);
            return ResponseEntity.ok(extractedText);
        } catch (Exception e) {
            return ResponseEntity.badRequest().body("Error extracting text: " + e.getMessage());
        }
    }
}

4. Application Properties

(Optional) If you need to configure file upload limits, add the following to your application.properties or application.yml:

spring.servlet.multipart.max-file-size=10MB
spring.servlet.multipart.max-request-size=10MB

5. Testing the Application

  1. Start the Spring Boot application: Run the PdfReaderApplication class.

  2. Test the API:

    • Use Postman or cURL to send a POST request to http://localhost:8080/api/pdf/extract with a file parameter containing a PDF document.

    Example cURL command:

    curl -X POST -F "file=@sample.pdf" http://localhost:8080/api/pdf/extract
  3. Response:

    • The extracted text from the PDF will be returned in the response body.

Example Output

For a PDF containing:

Hello, this is a sample PDF.
It contains multiple lines of text.

The API response will be:

{
    "text": "Hello, this is a sample PDF.\nIt contains multiple lines of text.\n"
}

Notes:

  1. Apache PDFBox Integration: This example directly uses PDFBox, which is commonly leveraged by tools like spring-ai-pdf-document-reader.
  2. Error Handling: Enhance error handling for edge cases like corrupt PDFs, unsupported file formats, or very large files.
  3. Unit Testing: Add JUnit tests for your service and controller to validate the behavior with sample PDFs.

Get Your Copy of Spring AI in Action Today!

🚀 Don’t miss out on this amazing opportunity to elevate your development skills with AI.
📖 Transform your Spring applications using cutting-edge AI technologies.

🎉 Unlock amazing savings of 34.04% with our exclusive offer!

👉 Click below to save big and shop now!
🔗 Grab Your 34.04% Discount Now!

👉 Click below to save big and shop now!
🔗 Grab Your 34.04% Discount Now!

Popular posts from this blog

Learn Java 8 streams with an example - print odd/even numbers from Array and List

Java Stream API - How to convert List of objects to another List of objects using Java streams?

Registration and Login with Spring Boot + Spring Security + Thymeleaf

Java, Spring Boot Mini Project - Library Management System - Download

ReactJS, Spring Boot JWT Authentication Example

Top 5 Java ORM tools - 2024

Java - Blowfish Encryption and decryption Example

Spring boot video streaming example-HTML5

Google Cloud Storage + Spring Boot - File Upload, Download, and Delete