Spring AI and Apache Tika: Document Text Extraction Example


To use Spring AI with Apache Tika for document reading, you can create an example that extracts text from various document formats (like PDF, Word, Excel) and processes it using a Spring Boot application. Below is an example:

1. Add Dependency

Ensure the dependency is added to your pom.xml:

<properties>
    <java.version>21</java.version>
    <spring-ai.version>1.0.0-M4</spring-ai.version>
  </properties>
  <dependencies>
    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-tika-document-reader</artifactId>
    </dependency>

    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-test</artifactId>
      <scope>test</scope>
    </dependency>
  </dependencies>
  <dependencyManagement>
    <dependencies>
      <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-bom</artifactId>
        <version>${spring-ai.version}</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
    </dependencies>
  </dependencyManagement>

  <build>
    <plugins>
      <plugin>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-maven-plugin</artifactId>
      </plugin>
    </plugins>
  </build>
  <repositories>
    <repository>
      <id>spring-milestones</id>
      <name>Spring Milestones</name>
      <url>https://repo.spring.io/milestone</url>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
  </repositories>

2. Create a Service for Document Parsing

package com.example.tikareader.service;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.stereotype.Service;
import org.xml.sax.SAXException;

import java.io.IOException;
import java.io.InputStream;

@Service
public class DocumentReaderService {

    public String extractText(InputStream inputStream) throws TikaException, IOException, SAXException {
        // Handler for document content
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();

        // Parse document
        parser.parse(inputStream, handler, metadata, new ParseContext());

        return handler.toString();
    }
}

3. Create a REST Controller

package com.example.tikareader.controller;

import com.example.tikareader.service.DocumentReaderService;
import org.apache.tika.exception.TikaException;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import org.xml.sax.SAXException;

import java.io.IOException;

@RestController
@RequestMapping("/api/documents")
public class DocumentController {

    @Autowired
    private DocumentReaderService documentReaderService;

    @PostMapping("/read")
    public ResponseEntity<String> readDocument(@RequestParam("file") MultipartFile file) {
        try {
            String content = documentReaderService.extractText(file.getInputStream());
            return new ResponseEntity<>(content, HttpStatus.OK);
        } catch (TikaException | IOException | SAXException e) {
            return new ResponseEntity<>("Error reading document: " + e.getMessage(), HttpStatus.INTERNAL_SERVER_ERROR);
        }
    }
}

4. Application Properties

Ensure your application.properties or application.yml is configured properly if needed. For this example, no specific configuration is required for Tika.

5. Test the API

Run the Spring Boot application and use a tool like Postman or cURL to test the API.

Example cURL Request:

curl -X POST -F "file=@example.pdf" http://localhost:8080/api/documents/read

This setup will allow you to extract text from uploaded documents using Apache Tika through a Spring Boot application.

Get Your Copy of Spring AI in Action Today!

🚀 Don’t miss out on this amazing opportunity to elevate your development skills with AI.
📖 Transform your Spring applications using cutting-edge AI technologies.

🎉 Unlock amazing savings of 34.04% with our exclusive offer!

👉 Click below to save big and shop now!
🔗 Grab Your 34.04% Discount Now!

👉 Click below to save big and shop now!
🔗 Grab Your 34.04% Discount Now!

Comments

Popular posts from this blog

Spring Boot OpenAI Integration: Step-by-Step Guide

Orchestration-Based Saga Architecture and Spring Boot Microservices Implementation Guide

Spring Boot 3 + Angular 15 + Material - Full Stack CRUD Application Example