Skip to content

need a way how to limit the size of files processed by indexer (Bugzilla #19176) #534

Open
@vladak

Description

@vladak

status NEW severity enhancement in component indexer for ---
Reported in version unspecified on platform ANY/Generic
Assigned to: Trond Norbye

On 2012-02-15 13:52:01 +0000, Vladimir Kotal wrote:

Recent reindexing with 0.11 revealed that the indexer cannot cope with larger files and just blows up (JAVA_OPTS is default, set to 2 GB):

2012-02-15 14:30:53.572+0100 INFO t15 DefaultIndexChangedListener.fileAdd: Add: /foo.cpio (PlainAnalyzer)
2012-02-15 14:31:43.178+0100 SEVERE t15 IndexDatabase$1.run: Problem updating lucene index database:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at org.opensolaris.opengrok.analysis.plain.PlainAnalyzer.analyze(PlainAnalyzer.java:77)
at org.opensolaris.opengrok.analysis.TextAnalyzer.analyze(TextAnalyzer.java:60)
at org.opensolaris.opengrok.analysis.AnalyzerGuru.getDocument(AnalyzerGuru.java:262)
at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:584)
at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:814)
at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:787)
at org.opensolaris.opengrok.index.IndexDatabase.indexDown(IndexDatabase.java:787)
at org.opensolaris.opengrok.index.IndexDatabase.update(IndexDatabase.java:354)
at org.opensolaris.opengrok.index.IndexDatabase$1.run(IndexDatabase.java:158)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

2012-02-15 14:31:43.194+0100 INFO t10 Indexer.sendToConfigHost: Send configuration to: localhost:2424
2012-02-15 14:31:44.488+0100 INFO t10 Indexer.sendToConfigHost: Configuration update routine done, check log output for errors.

$ du -sh /foo.cpio
311M /foo.cpio

There should be an option which would allow us to say that files larger than xy bytes should be ignored by the indexer (similar to the -i option for filenames).

On 2012-02-15 13:54:37 +0000, Vladimir Kotal wrote:

Maybe there should even be some sane default, like 100 MB.

On 2012-02-16 12:26:12 +0000, Knut Anders Hatlen wrote:

The analyzers don't really need to read the entire file into memory, they could also operate on streams. The reason why they do read the file into memory, I think, is to avoid reading every file twice (once to add it to the Lucene indexes, and once to build the xref). I'm not sure how important this optimization is (should run some experiments to see).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions