need a way how to limit the size of files processed by indexer (Bugzilla #19176)

status NEW severity _enhancement_ in component _indexer_ for _---_
Reported in version _unspecified_ on platform _ANY/Generic_
Assigned to: Trond Norbye

On 2012-02-15 13:52:01 +0000, Vladimir Kotal wrote:

> Recent reindexing with 0.11 revealed that the indexer cannot cope with larger files and just blows up (JAVA_OPTS is default, set to 2 GB):
> 
> 2012-02-15 14:30:53.572+0100 INFO t15 DefaultIndexChangedListener.fileAdd: Add: /foo.cpio (PlainAnalyzer)
> 2012-02-15 14:31:43.178+0100 SEVERE t15 IndexDatabase$1.run: Problem updating lucene index database: 
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:2882)
>   at org.opensolaris.opengrok.analysis.plain.PlainAnalyzer.analyze(PlainAnalyzer.java:77)
>   at org.opensolaris.opengrok.analysis.TextAnalyzer.analyze(TextAnalyzer.java:60)
>   at org.opensolaris.opengrok.analysis.AnalyzerGuru.getDocument(AnalyzerGuru.java:262)
>   at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:584)
>   at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:814)
>   at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:787)
>   at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:787)
>   at org.opensolaris.opengrok.index.IndexDatabase.update(IndexDatabase.java:354)
>   at org.opensolaris.opengrok.index.IndexDatabase$1.run(IndexDatabase.java:158)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> 
> 2012-02-15 14:31:43.194+0100 INFO t10 Indexer.sendToConfigHost: Send configuration to: localhost:2424
> 2012-02-15 14:31:44.488+0100 INFO t10 Indexer.sendToConfigHost: Configuration update routine done, check log output for errors.
> 
> $ du -sh /foo.cpio
>  311M /foo.cpio
> 
> There should be an option which would allow us to say that files larger than xy bytes should be ignored by the indexer (similar to the -i option for filenames).

On 2012-02-15 13:54:37 +0000, Vladimir Kotal wrote:

> Maybe there should even be some sane default, like 100 MB.

On 2012-02-16 12:26:12 +0000, Knut Anders Hatlen wrote:

> The analyzers don't really need to read the entire file into memory, they could also operate on streams. The reason why they do read the file into memory, I think, is to avoid reading every file twice (once to add it to the Lucene indexes, and once to build the xref). I'm not sure how important this optimization is (should run some experiments to see).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

need a way how to limit the size of files processed by indexer (Bugzilla #19176) #534

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

need a way how to limit the size of files processed by indexer (Bugzilla #19176) #534

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions