Search API Attachments

Extracts text content from file attachments and indexes it with Search API, enabling full-text search within uploaded documents.

search_api_attachments

13,382 sites

drupal.org

gui api integration

search search_api file attachment extraction tika solr pdf document fulltext media

Drupal 10 Drupal 11

Install

Drupal 11, 10 v10.0.5

composer require 'drupal/search_api_attachments:^10.0'

Overview

Search API Attachments extends the Search API module by extracting text content from file attachments and making it searchable. The module supports extracting text from various document formats including PDF, Microsoft Office documents, and many other file types.

The module provides multiple extraction methods to accommodate different server environments: Apache Tika (JAR application or server mode), Solr's built-in extraction capabilities, pdftotext command-line tool, Python pdf2txt library, and the Go-based docconv extractor. Each method has its own configuration requirements and supported file formats.

Once configured, the module adds a Search API processor that automatically extracts text from file fields and media entity references during indexing. Extracted content is cached to improve performance on subsequent index operations. The module also provides a field formatter for displaying extracted text directly on entities and a Views filter for excluding attachments from search queries.

Features

Extracts text content from file attachments for Search API indexing
Supports 6 different extraction methods: Tika App JAR, Tika JAX-RS Server, Solr Extractor, Pdftotext, Python Pdf2txt, and Docconv
Indexes content from both File fields and Media entity references
Provides configurable caching system with Key Value (database) or Files storage backends
Includes built-in extraction testing functionality with a sample PDF file
Queue worker with automatic retry mechanism for failed extractions (up to 5 attempts)
Field formatter to display extracted text content on entity view pages
Views filter to selectively exclude attachment content from search queries
Hook system for custom control over which files should be indexed
Supports file filtering by extension, MIME type, size, and privacy settings
Automatic cache invalidation when files are updated or deleted
Preserves cache across site-wide cache clears (configurable)
Option to read plain text files directly without extraction tool

Use Cases

Document Library Search

Build a searchable document library where users can search within PDF documents, Word files, and other attachments. Create a content type with a file field, configure Search API with the File attachments processor, and create a View with fulltext search exposed filter. Users can then find documents by searching for text contained within the files.

Media Asset Management

Enable searching within media documents attached to content. Add a media field referencing Document media type to your content type, enable the File attachments processor on your index, and add the attachment field to fulltext search. This allows searching across both content and attached media documents.

Pure Media File Index

Create a search specifically for media files without requiring parent content. Create a Search API index with Media as the data source, limit to Document bundle, enable File attachments processor, and add the extracted text field. Build a View for searching directly within media file contents.

Selective Attachment Search

Allow users to choose whether to include attachment content in their search. Use the provided Views filter 'Exclude search in attachments' as an exposed filter. When checked, the search will only look in regular content fields, improving precision when attachment content is not relevant.

Large File Processing with Queue

Handle extraction of large files or unreliable extractors using the queue system. Failed extractions are automatically queued for retry. Run 'drush queue-run search_api_attachments' during cron or manually to process queued items. The system retries up to 5 times before logging an error.

Tips

Use Tika Server instead of Tika JAR for better performance in high-volume environments - the server can handle concurrent requests and avoids JVM startup overhead
Set reasonable file size limits to prevent memory issues during extraction - files over 50MB rarely need full text indexing
Enable 'Preserve cached extractions' to avoid re-extracting files after cache clears, especially useful with large document libraries
Use the 'read text files directly' option for plain text files to skip unnecessary extractor calls
Monitor the extraction queue during cron runs - a growing queue may indicate extractor configuration issues
For PDF-only sites, pdftotext is faster and lighter than Tika but lacks support for other formats
Consider using private file storage for extracted cache files to keep sensitive document content secure
Test extraction configuration after server changes as paths to executables may change
Use hook_search_api_attachments_indexable to exclude specific files programmatically based on custom criteria
Add the 'Exclude search in attachments' Views filter to give users control over search scope

Dependencies

Search API Required

Related Projects

Search API

Core integration. The module provides a Search API processor that adds extracted file content as indexable fields.

Search API Solr

When using the Solr extractor, the module leverages Search API Solr's extractContentFromFile method for text extraction.

Media Core

Full support for Media entity references. The module can extract content from files attached via media fields.

Views Core

Provides a Views filter to selectively exclude attachment content from fulltext searches.

Technical Details

Search API Attachments /admin/config/search/search_api_attachments

Configure the text extraction method and caching options for Search API Attachments. This page allows you to select and configure the extraction backend used to extract text from uploaded files for indexing.

Administer Search API Attachments

Configure the commands used by Search API Attachments to extract data. This permission has restricted access.

hook_search_api_attachments_indexable

Determines whether an attachment should be indexed. Return FALSE to prevent a specific file from being indexed.

hook_search_api_attachments_content_extracted

Allows other modules to react after content extraction for a file. Useful for triggering reindexing of related entities.

hook_text_extractor_info_alter

Alters the text extractor plugin definitions. Allows modifying or removing available extraction methods.

drush queue-list

Lists all queues including search_api_attachments queue to see pending extraction tasks.

drush queue-run search_api_attachments

Processes items in the search_api_attachments queue to retry failed extractions.

Extraction test fails with 'Tika Extractor is not available'

Verify Java is installed and accessible. Check the java path in configuration. Ensure the Tika JAR file path is correct and the file exists. Try running the java command manually: java -jar /path/to/tika-app.jar -V

Files are not being indexed

Ensure the File attachments processor is enabled on your index. Check that files are not excluded by extension, size, or privacy settings. Verify the extraction method is configured and working using the test function. Check file permissions and that files exist on disk.

Extraction works but content is not searchable

Verify the attachment field is added to your index fields. Ensure the field type is set to 'Fulltext'. Reindex after making changes. Check that the field is included in the fulltext search fields for your View.

Memory issues or timeouts during indexing

Configure the 'Limit size of extracted string' setting to restrict extracted content size (e.g., '1 MB'). Set 'Maximum upload size' to skip very large files. Consider using Tika Server instead of Tika JAR for better resource management.

Queue items keep failing

Check logs at admin/reports/dblog for extraction errors. Verify the extractor is still working using the test function. Items are retried 5 times before permanent failure. Clear the queue if items are permanently stuck.

Cache not being cleared

If 'Preserve cached extractions' is enabled, sitewide cache clears won't clear extraction cache. Use the module's cache service directly or disable the preserve option. When changing cache backend, old cache is automatically cleared.

The 'Administer Search API Attachments' permission has restricted access as it allows configuring system paths and external service connections
When using file-based caching, prefer 'private' scheme to prevent direct access to extracted content
Extracted text from private files may be stored in database or public files depending on cache configuration - review security implications
External extractors (Tika Server, Solr) receive file content - ensure secure connections in production environments
The module executes shell commands for several extractors - paths are validated but ensure proper server hardening