I made a change in the blogger configuration to ease the later work when blogging. It is possible that older entries are not correctly formatted.

Thursday 6 May 2010

Apache Tika - Content and Metadata Extraction in Java

Apache Tika is an useful tool to extract text and metadata from a number of formats.

For example, you have a document pdf, doc,... on the web from which you wish to extract some part. Then you can use tika to extract some part. For this you can use tika:

curl http:urltodoc/.../document.pdf | java -jar tika-app/target/tika-app-0.7.jar --text
produces the text of the document. Other options exist to return an html, an xhtml or only the metadata of the document.

Maven Integration

As for other maven projects, you can specify the dependency in the pom. Note however, that depending on your needs, you might want to specify one of these ( mostly quoted from this page):

  • tika-core/target/tika-core-0.7.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.
  • tika-parsers/target/tika-parsers-0.7.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.
  • tika-app/target/tika-app-0.7.jar Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.
  • tika-bundle/target/tika-bundle-0.7.jar Tika bundle. An OSGi bundle that includes everything you need to use all Tika functionality in an OSGi environment.

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
  <version>0.7</version>
</dependency>
If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you'll want to depend on tika-parsers instead:
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>0.7</version>
</dependency>