Search API Attachments
Extracts text content from file attachments and indexes it with Search API, enabling full-text search within uploaded documents.
search_api_attachments
Install
composer require 'drupal/search_api_attachments:^10.0'
Overview
Search API Attachments extends the Search API module by extracting text content from file attachments and making it searchable. The module supports extracting text from various document formats including PDF, Microsoft Office documents, and many other file types.
The module provides multiple extraction methods to accommodate different server environments: Apache Tika (JAR application or server mode), Solr's built-in extraction capabilities, pdftotext command-line tool, Python pdf2txt library, and the Go-based docconv extractor. Each method has its own configuration requirements and supported file formats.
Once configured, the module adds a Search API processor that automatically extracts text from file fields and media entity references during indexing. Extracted content is cached to improve performance on subsequent index operations. The module also provides a field formatter for displaying extracted text directly on entities and a Views filter for excluding attachments from search queries.
Features
- Extracts text content from file attachments for Search API indexing
- Supports 6 different extraction methods: Tika App JAR, Tika JAX-RS Server, Solr Extractor, Pdftotext, Python Pdf2txt, and Docconv
- Indexes content from both File fields and Media entity references
- Provides configurable caching system with Key Value (database) or Files storage backends
- Includes built-in extraction testing functionality with a sample PDF file
- Queue worker with automatic retry mechanism for failed extractions (up to 5 attempts)
- Field formatter to display extracted text content on entity view pages
- Views filter to selectively exclude attachment content from search queries
- Hook system for custom control over which files should be indexed
- Supports file filtering by extension, MIME type, size, and privacy settings
- Automatic cache invalidation when files are updated or deleted
- Preserves cache across site-wide cache clears (configurable)
- Option to read plain text files directly without extraction tool
Use Cases
Document Library Search
Build a searchable document library where users can search within PDF documents, Word files, and other attachments. Create a content type with a file field, configure Search API with the File attachments processor, and create a View with fulltext search exposed filter. Users can then find documents by searching for text contained within the files.
Media Asset Management
Enable searching within media documents attached to content. Add a media field referencing Document media type to your content type, enable the File attachments processor on your index, and add the attachment field to fulltext search. This allows searching across both content and attached media documents.
Pure Media File Index
Create a search specifically for media files without requiring parent content. Create a Search API index with Media as the data source, limit to Document bundle, enable File attachments processor, and add the extracted text field. Build a View for searching directly within media file contents.
Selective Attachment Search
Allow users to choose whether to include attachment content in their search. Use the provided Views filter 'Exclude search in attachments' as an exposed filter. When checked, the search will only look in regular content fields, improving precision when attachment content is not relevant.
Large File Processing with Queue
Handle extraction of large files or unreliable extractors using the queue system. Failed extractions are automatically queued for retry. Run 'drush queue-run search_api_attachments' during cron or manually to process queued items. The system retries up to 5 times before logging an error.
Tips
- Use Tika Server instead of Tika JAR for better performance in high-volume environments - the server can handle concurrent requests and avoids JVM startup overhead
- Set reasonable file size limits to prevent memory issues during extraction - files over 50MB rarely need full text indexing
- Enable 'Preserve cached extractions' to avoid re-extracting files after cache clears, especially useful with large document libraries
- Use the 'read text files directly' option for plain text files to skip unnecessary extractor calls
- Monitor the extraction queue during cron runs - a growing queue may indicate extractor configuration issues
- For PDF-only sites, pdftotext is faster and lighter than Tika but lacks support for other formats
- Consider using private file storage for extracted cache files to keep sensitive document content secure
- Test extraction configuration after server changes as paths to executables may change
- Use hook_search_api_attachments_indexable to exclude specific files programmatically based on custom criteria
- Add the 'Exclude search in attachments' Views filter to give users control over search scope
Technical Details
Admin Pages 1
/admin/config/search/search_api_attachments
Configure the text extraction method and caching options for Search API Attachments. This page allows you to select and configure the extraction backend used to extract text from uploaded files for indexing.
Permissions 1
Hooks 3
hook_search_api_attachments_indexable
Determines whether an attachment should be indexed. Return FALSE to prevent a specific file from being indexed.
hook_search_api_attachments_content_extracted
Allows other modules to react after content extraction for a file. Useful for triggering reindexing of related entities.
hook_text_extractor_info_alter
Alters the text extractor plugin definitions. Allows modifying or removing available extraction methods.
Drush Commands 2
drush queue-list
Lists all queues including search_api_attachments queue to see pending extraction tasks.
drush queue-run search_api_attachments
Processes items in the search_api_attachments queue to retry failed extractions.
Troubleshooting 6
Verify Java is installed and accessible. Check the java path in configuration. Ensure the Tika JAR file path is correct and the file exists. Try running the java command manually: java -jar /path/to/tika-app.jar -V
Ensure the File attachments processor is enabled on your index. Check that files are not excluded by extension, size, or privacy settings. Verify the extraction method is configured and working using the test function. Check file permissions and that files exist on disk.
Verify the attachment field is added to your index fields. Ensure the field type is set to 'Fulltext'. Reindex after making changes. Check that the field is included in the fulltext search fields for your View.
Configure the 'Limit size of extracted string' setting to restrict extracted content size (e.g., '1 MB'). Set 'Maximum upload size' to skip very large files. Consider using Tika Server instead of Tika JAR for better resource management.
Check logs at admin/reports/dblog for extraction errors. Verify the extractor is still working using the test function. Items are retried 5 times before permanent failure. Clear the queue if items are permanently stuck.
If 'Preserve cached extractions' is enabled, sitewide cache clears won't clear extraction cache. Use the module's cache service directly or disable the preserve option. When changing cache backend, old cache is automatically cleared.
Security Notes 5
- The 'Administer Search API Attachments' permission has restricted access as it allows configuring system paths and external service connections
- When using file-based caching, prefer 'private' scheme to prevent direct access to extracted content
- Extracted text from private files may be stored in database or public files depending on cache configuration - review security implications
- External extractors (Tika Server, Solr) receive file content - ensure secure connections in production environments
- The module executes shell commands for several extractors - paths are validated but ensure proper server hardening