Skip to content

Building and Running The Framework

bcarrier edited this page Jul 14, 2011 · 6 revisions

Building the source

IMPORTANT! This document assumes that you have hadoop, hdfs, and hbase all running properly on whatever machine you wish to run the jar file on!

The pipeline code does end-to-end processing of a directory of documents (text extraction, document vectorization, cluster generation, etc.). To build it, do the following:

  1. Run maven in the pre-build folder.
  2. Run maven in the root project folder jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk% mvn [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Reactor Build Order: [INFO] [INFO] Hadoop modules for The SleuthKit [INFO] Hadoop clustering for The SleuthKit [INFO] Hadoop text regex search for The SleuthKit [INFO] Text extraction or The SleuthKit [INFO] Hadoop pipeline management for The SleuthKit [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Hadoop modules for The SleuthKit 1-SNAPSHOT [INFO] ------------------------------------------------------------------------ ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 26.904s [INFO] Finished at: Thu May 05 17:25:55 EDT 2011 [INFO] Final Memory: 16M/204M [INFO] ------------------------------------------------------------------------ This will create a jar file in the pipeline/target folder.
  3. Checkout fsrip (git clone https://github.com/jonstewart/fsrip.git) and build with 'scons'
  4. Add FSRIP_ROOT/deps/lib to LD_LIBRARY_PATH and FSRIP_ROOT/build/src/ to your PATH.
  5. Set the HADOOP_HOME environment variable.
  6. Copy in the report template: % rm -Rf reports/data % hadoop fs -copyFromLocal reports /texaspete/template/reports

Running the Pipeline

If you want to run the grep search, you need to put a file on HDFS with java regexes. (You will want to do this if you're running the pipeline.) One with a few (uninteresting) regexes is in the match project folder:

jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk% hadoop fs -put match/src/main/resources/regexes /texaspete/regexes

You can, of course, make your own regexes. Any standard java regex will work with one regex per line. If you use java globbing it should take the first glob, though there has not be extensive testing done on this yet.

Now that all the code has been built, have a look at the output in the pipeline/target directory.

jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk% cd pipeline/target
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk/pipeline/target% ls
archive-tmp/  maven-archiver/                    *sleuthkit-pipeline-1-SNAPSHOT-job.jar*
classes/      sleuthkit-pipeline-1-SNAPSHOT.jar  surefire/
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk/pipeline/target%

The job jar is the one you should use with hadoop. First, you will want to run fsrip on an image to create a JSON metadata file for it. You will then want to copy BOTH the JSON Metadata file AND the image file onto HDFS (the usual directory for this is /texaspete/img, though you can put them wherever you like).

You can run the entire pipeline using:

% hadoop jar sleuthkit-pipeline-1-SNAPSHOT-job.jar org.sleuthkit.hadoop.pipeline.Ingest $IMG_MD5 $IMG_PATH $MG_JSON_PATH
% hadoop jar sleuthkit-pipeline-1-SNAPSHOT-job.jar org.sleuthkit.hadoop.pipeline.Pipeline $IMG_MD5

Output data will go into various directories under: /texaspete/data/$IMG_MD5/

Note that if you wish to run the individual components of the pipeline separately, you should be able to do that from this jar by invoking their java classes directly. Most have usage/help lines which may be of use.

Clone this wiki locally