Can not extract text from Office documents (`.docx` extension)

**Elasticsearch version**: 2.2.0
**Description of the problem including expected versus actual behavior**: Mapper attachments plugin (or ingest-attachment) works with Text of PDF, but not with the Office formats.
**Steps to reproduce**:
1. Install mapper-attachments plugin
2. Index a Word (`.docx` document)
3. Look at logs `DEBUG` level.

**Logs**:

```
[2016-02-29 16:43:39,341][DEBUG][mapper.attachment ] Failed to extract [100000] characters of text for [null]: [Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51667d8a]
...
Caused by: java.lang.IllegalStateException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at org.apache.xmlbeans.XmlBeans.getContextTypeLoader(XmlBeans.java:336)
...
Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
```

**Analysis**:

As recent Office documents are now `xml` based (`.docx`,  `.xlsx`...), Tika can not read them anymore in the context of elasticsearch because `getClassLoader` call is forbidden.

Reported by many users at https://discuss.elastic.co/t/no-hits-when-do-a-text-search-in-an-attachment-for-docx-file/41779

Switching to `.doc` legacy format works well.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can not extract text from Office documents (`.docx` extension) #16864

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can not extract text from Office documents (.docx extension) #16864

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Can not extract text from Office documents (`.docx` extension) #16864