Skip to content

Can not extract text from Office documents (.docx extension) #16864

@dadoonet

Description

@dadoonet

Elasticsearch version: 2.2.0
Description of the problem including expected versus actual behavior: Mapper attachments plugin (or ingest-attachment) works with Text of PDF, but not with the Office formats.
Steps to reproduce:

  1. Install mapper-attachments plugin
  2. Index a Word (.docx document)
  3. Look at logs DEBUG level.

Logs:

[2016-02-29 16:43:39,341][DEBUG][mapper.attachment ] Failed to extract [100000] characters of text for [null]: [Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51667d8a]
...
Caused by: java.lang.IllegalStateException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at org.apache.xmlbeans.XmlBeans.getContextTypeLoader(XmlBeans.java:336)
...
Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)

Analysis:

As recent Office documents are now xml based (.docx, .xlsx...), Tika can not read them anymore in the context of elasticsearch because getClassLoader call is forbidden.

Reported by many users at https://discuss.elastic.co/t/no-hits-when-do-a-text-search-in-an-attachment-for-docx-file/41779

Switching to .doc legacy format works well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions