The Java and Alfresco World: apache tika

Tika is a java library which can be used for detecting document types, detecting language of document’s content and extracting content/metadata from various types of file. It uses many existing document parsers libraries such as JDom, Jackson, JSoup, XMLBeans, JAXB, POI, PDF Box, etc. for content/metadata extraction. It utilizes less memory and able to extract content quickly.

It provides APIs to identify content language and detect mime type of file (based on file extension).

It provided generic interface for parsing content and extracting content/metadata by encapsulates all the third party parser libraries within a single parser interface. It is capable of extracting content from various popular files formats such PDF, Word, Excel Sheet, CSV, Text, Images, XML, JSON etc. There is a Tika class which is the simplest and direct way of calling Tika from Java. It follows the facade design pattern. You can find the class in the Tika API at org.apache.tika.Tika

Search engines, content management systems and machine learning tools uses Tika for content/metadata extraction, document analysis, content analysis, indexing and translation.

Pre-requisite:

1- JDK 1.6 or Above
2- Download latest Tika dependencies (1.12 is the latest version currently). See the list of dependencies given below:

<dependencies>

<dependency>

<groupId>org.apache.tika</groupId>

<artifactId>tika-core</artifactId>

<version>1.12</version>

</dependency>

<dependency>

<groupId>org.apache.tika</groupId>

<artifactId>tika-parsers</artifactId>

<version>1.12</version>

</dependency>

<dependency>

<groupId>org.apache.tika</groupId>

<artifactId>tika-serialization</artifactId>

<version>1.12</version>

</dependency>

<dependency>

<groupId>org.apache.tika</groupId>

<artifactId>tika-app</artifactId>

<version>1.12</version>

</dependency>

<dependency>

<groupId>org.apache.tika</groupId>

<artifactId>tika-bundle</artifactId>

<version>1.12</version>

</dependency>

<dependency>

<groupId>org.apache.tika</groupId>

<artifactId>tika-batch</artifactId>

<version>1.12</version>

</dependency>

<dependency>

<groupId>org.apache.tika</groupId>

<artifactId>tika-translate</artifactId>

<version>1.12</version>

</dependency>

</dependencies>

You can find the TIKA dependencies from below given URL:

http://mvnrepository.com/search?q=org.apache.tika

You can also refer to the GitHub repository, if you are interested in exploring the source code.

You can refer to documentation here: http://tika.apache.org/index.html

Getting started: http://tika.apache.org/1.12/gettingstarted.html

Parser Guide: https://tika.apache.org/1.12/parser_guide.html

See examples here: https://tika.apache.org/1.8/examples.html

Find the JavaDocs here: https://tika.apache.org/1.12/api/

So let’s explore the available APIs and try to do some hands-on. First we will try to use Tika GUI and see the magic of Tika.

Follow the below given steps to use Tika GUI:

1- Open command prompt.
2- Locate the ‘tika-app-1.12.jar’ and copy the full path.
3- Run below command to open Tika GUI.

C:\Users\Abhinav\.m2\repository\org\apache\tika\tika-app\1.12>java -jar tika-app-1.12.jar -g

4- It will open below shown GUI.

5- Go to ‘File’ menu and select ‘Open’ menu to open any file of your choice. E.g. A word document
6- Once you select the file app will automatically extract the metadata and content from the selected file. You can see all available metadata on selected document. E.g.

Application-Name: Microsoft Office Word

Application-Version: 14.0000

Author: Abhinav

Character Count: 3

Character-Count-With-Spaces: 3

Content-Length: 18700

Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

Creation-Date: 2016-02-25T12:49:00Z

Last-Author: Abhinav

Last-Modified: 2016-02-25T12:56:00Z

Last-Save-Date: 2016-02-25T12:56:00Z

Line-Count: 1

Page-Count: 1

Paragraph-Count: 1

Revision-Number: 2

7- Go to ‘View’ option, you will see options such as “Metadata, Formatted text, Plain text, Main content, Structured text (in xml format) and Recursive JSON”. By default “Metadata” will be selected as view in the app.

8- Let’s see how “Formatted text” will look like. I selected “Formatted text” as view;it will display the content as it is in word document. See the screen below:

9- Let’s see how “Plain text” will look like. I selected “Plain text” as view; it will display the content as plain text. See the screen below:

Similarly you can view other options also. This app is helpful in case if you want to test the capabilities of Tika and also want to see how the extracted metadata or content will look like after processing.

Type Detection in Tika:

Tika identifies the MimeType of any file. It uses following mechanisms.

1- File Extension:- File extension is used to identify the MimeType.

2- Content-type Metadata:- If file extension is lost due to some reason, it uses the metadata supplied with the file to detect the MimeType.

3- Magic bytes info:- If you observe the raw bytes of a file (open the file in notepad), you will notice some unique character patterns in each type of file. Some files have special byte prefixes called magic bytes that are specially made and included in a file for the purpose of identifying the mime type. e.g. you can find CA FE BA BE (hexadecimal format) in a java file and %PDF (ASCII format) in a PDF file. Tika uses this information to identify the media type of a file.

4- Character encoding:- It identifies the mime type of plain text files using their character encoding.

5- XML characters (<, />):- To identify the XML file type, Tika first parses the XML document and extract the information such as namespace, processing instruction etc.

Let try it with a simple java code. Here we will use the “org.apache.tika.Tika” class as discussed above.

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.util.Scanner;

import org.apache.tika.Tika;

public class FileTypeDetection {

public static void main(String[] args) throws Exception {

try (final Scanner scanner = new Scanner(System.in);) {

System.out.println("Please enter a fileName/filePath: ");

final String filePath = scanner.nextLine();

final File fileObject = new File(filePath);

//Instantiate TIKA facade class

final Tika tika = new Tika();

//Call detect method of the TIKA class

final String filetype = tika.detect(fileObject);

System.out.println("\nDetected fileType: "+filetype);

}

Output:

Please enter a fileName/filePath:

C:\Users\abhinav\Desktop\inside-marklogic-server-r7.pdf

Detected fileType: application/pdf

Please enter a fileName/filePath:

C:\Users\abhinav\Desktop\set-jre-args-share.png

Detected fileType: image/png

Content extraction in TIKA:

Tika uses multiple parsers to extract the content from given file. Tika chooses suitable parser for the given file is decided by Tika facade class based on the file type detection. Tika provides multiple overloaded and useful methods to extract the content. The most used method of Tika class is “parseToString (...)”. It can extract the content of a file given from file system and it can parse the file from a URL as well. There is one more method parse (....). It returns the instance of java.io.Reader. It is useful if you want to store the extracted content into file, string, stream etc.

We will see example of above discussed methods.

See the complete list of useful methods using following link: https://tika.apache.org/1.12/api/org/apache/tika/Tika.html

Following steps will be performed by Tika to extract the content:

1- Detect the file type of file using type detection mechanism (as mentioned above).
2- Once type of file is known, choose the suitable parser from its parser library.
3- Parser will parse the content, extract the text, and also throw exceptions for unreadable formats.

Let try it with a simple java code. Here we will use the “org.apache.tika.Tika” class.

I have a text file “helloTika.txt” which has following content:

A "Hello, World!" program is a computer program that outputs "Hello, World!" on a display device, often standard output.

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.io.IOException;

import java.util.Scanner;

import org.apache.tika.Tika;

import org.apache.tika.exception.TikaException;

public class ExtractionTest {

public static void main(String[] args) throws IOException, TikaException {

try (final Scanner scanner = new Scanner(System.in);) {

System.out.println("Please enter a fileName/filePath: ");

final String filePath = scanner.nextLine();

final File fileObject = new File(filePath);

//Instantiate TIKA facade class

final Tika tika = new Tika();

//Call the parseToString method to get the extracted content as string.

final String extractedContent = tika.parseToString(fileObject);

System.out.println("\nExtracted content: "+extractedContent);

}

Output:

Please enter a fileName/filePath:

C:\Users\abhinav\Desktop\helloTika.txt

Extracted content: A "Hello, World!" program is a computer program that outputs "Hello, World!" on a display device, often standard output.

Let’s try the same with as XML file which is accessed via URL:

http://www.w3schools.com/xsl/books.xml

package com.github.abhinavmishra14.tika;

import java.io.IOException;

import java.net.URL;

import java.util.Scanner;

import org.apache.tika.Tika;

import org.apache.tika.exception.TikaException;

public class ExtractionTest {

public static void main(String[] args) throws IOException, TikaException {

try (final Scanner scanner = new Scanner(System.in);) {

System.out.println("Please enter a fileURL: ");

final String fileURL = scanner.nextLine();

//Create the instance of URL

final URL url = new URL(fileURL);

//Instantiate TIKA facade class

final Tika tika = new Tika();

//Call the parseToString method to get the extracted content as string.

final String extractedContent = tika.parseToString(url);

System.out.println("\nExtracted content: "+extractedContent);

}

Output:

Please enter a fileURL:

http://www.w3schools.com/xsl/books.xml

Extracted content:

Everyday Italian

Giada De Laurentiis

2005

30.00

Harry Potter

J K. Rowling

2005

29.99

XQuery Kick Start

James McGovern

Per Bothner

Kurt Cagle

James Linn

Vaidyanathan Nagarajan

2003

49.99

Learning XML

Erik T. Ray

2003

39.95

Let’s save the extracted content to a file:

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.io.FileWriter;

import java.io.IOException;

import java.io.Reader;

import java.net.URL;

import java.util.Scanner;

import org.apache.commons.lang.StringUtils;

import org.apache.tika.Tika;

import org.apache.tika.exception.TikaException;

public class ExtractAndWriteToAFile {

public static void main(String[] args) throws IOException, TikaException {

try (final Scanner scanner = new Scanner(System.in);) {

System.out.println("Please enter a fileURL: ");

final String fileURL = scanner.nextLine();

//Create the instance of URL

final URL url = new URL(fileURL);

//Instantiate TIKA facade class

final Tika tika = new Tika();

//Call the parse method to get the extracted content as java.io.Reader object.

final Reader reader = tika.parse(url);

//Get the file name from the URL.

//Expected file name will be the name without extension

final File fileName = new File(

StringUtils.substringBefore(StringUtils.substringAfterLast(fileURL, "/"), "."));

try(final FileWriter fileWriter = new FileWriter(fileName);){

System.out.println("Writing to file..");

int character =-1;

while ((character = reader.read()) != -1) {

fileWriter.write(character);

}

System.out.println("Writing to file completed!");

}

Output: (Open the saved file to see the saved content)

Please enter a fileURL:

http://www.w3schools.com/xsl/books.xml

Writing to file..

Writing to file completed!

Till now we saw that, we are using the Tika facade class and calling the default implemented methods which are doing the job beautifully. What if you want to have more control over parsing? Hmm

Well we can do that, Tika API provides a Parser Interface, which you can find under ‘org.apache.tika.parser’ package.

This package has an AbstractParser class which implements the Parser Interface. This package also contains some basic parsers which are created by extending the AbstractParser class. You can also create your own parser implementation by extending AbstractParser or by implementing Parser Interface.

Let’s see how the parsers are implemented using a basic class diagram.

There are multiple parser implementations classes available in Tika, such as XMLParser, MP3Parser, and PDFParser, OfficeParser, OOXMLParser etc.

So, in order to use the parser’s implementations you need to follow below given steps.

1- Read the file into java.io.InputStream.

2- Create the instance of Parser. You can use any of these individual document parsers. Or you can use either CompositeParser or AutoDetectParser that uses all the parser classes internally and extracts the contents of a document using a suitable parser.

3- Create the instance of content handler; there are many content handlers available in org.apache.tika.sax package. Some of them are: org.apache.tika.sax.BodyContentHandler , org.apache.tika.sax.LinkContentHandler , org.apache.tika.sax.PhoneExtractingContentHandler, org.apache.tika.sax.TeeContentHandler, org.apache.tika.sax.ElementMappingContentHandler, org.xml.sax.helpers.DefaultHandler , Etc. See Javadocs for details on each handler.

4- Create the instance of org.apache.tika.metadata.Metadata

5- Create the instance of org.apache.tika.parser.ParseContext

6- Call the parse (…) method of Parser created at step 1 and pass the reference of InputStream, contentHandler, metadata and parseContext.

Let’s try it out:

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStream;

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

public class XMLParserTest {

public static void main(String[] args) throws IOException,

SAXException, TikaException {

try (final InputStream inputStream = new FileInputStream(

new File("C:/Users/abhinav/Desktop/books.xml"));) {

//Create the instance of parser. Here I am using AutoDetectParser.

//You can create the instance of

//XMLParser,MP3Parser,PDFParser,OfficeParser,OOXMLParser based on your need

final Parser parser = new AutoDetectParser();

final ContentHandler contentHandler = new BodyContentHandler();

final Metadata metadata = new Metadata();

final ParseContext parseCtx = new ParseContext();

parser.parse(inputStream, contentHandler, metadata, parseCtx);

System.out.println("Extracted content: "+contentHandler.toString());

}

Output:

Extracted content:

Everyday Italian

Giada De Laurentiis

2005

30.00

Harry Potter

J K. Rowling

2005

29.99

XQuery Kick Start

James McGovern

Per Bothner

Kurt Cagle

James Linn

Vaidyanathan Nagarajan

2003

49.99

Learning XML

Erik T. Ray

2003

39.95

Let’s try to extract the content from a text file:

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStream;

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.ToTextContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

public class TextParserTest {

public static void main(String[] args) throws IOException,

SAXException, TikaException {

try (final InputStream inputStream = new FileInputStream(

new File("C:/Users/abhinav/Desktop/hello.txt"));) {

// Create the instance of parser.

final Parser parser = new AutoDetectParser();

final ContentHandler contentHandler = new ToTextContentHandler();

final Metadata metadata = new Metadata();

final ParseContext parseCtx = new ParseContext();

parser.parse(inputStream, contentHandler, metadata, parseCtx);

System.out.println("Extracted content: "+contentHandler.toString());

}

Output:

Extracted content: Hello world is simple test program

Let’s try to extract the content from a excel sheet:

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStream;

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

public class ExcelSheetExtractionTest {

public static void main(String[] args) throws IOException,

SAXException, TikaException {

try (final InputStream inputStream = new FileInputStream(

new File("C:/Users/abhinav/Desktop/Students.xlsx"));) {

//final Parser parser = new AutoDetectParser();

//Create the instance OOXMLParser. AutoDetectParser will also work.

final Parser parser = new OOXMLParser();

final ContentHandler contentHandler = new BodyContentHandler();

final Metadata metadata = new Metadata();

final ParseContext parseCtx = new ParseContext();

parser.parse(inputStream, contentHandler, metadata, parseCtx);

System.out.println("Extracted content: "+contentHandler.toString());

}

Output:

Extracted content: Sheet1

Name Age RollNo Grade

Abhinav 27 101 12

Ashutosh 18 102 10

Abhishek 22 103 8

Similarly you can perform content extraction it for PDF, MS Word and PowerPoint etc.

Metadata extraction in TIKA:

Metadata is nothing but the additional information supplied with a file. For e.g. in an audio file, the artist, album, title, year, composer etc. are metadata information. Whenever we parse a file using parse(…), we pass reference of an empty metadata object as a parameter. parse(…) method extracts the metadata of the given file (if there are any), and copies them into the metadata object. So, after parsing the file using parse(), we can extract the metadata from that object.

Let’s try with as example:

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStream;

import java.util.Arrays;

import java.util.List;

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

public class ExcelSheetExtractionTest {

public static void main(String[] args) throws IOException,

SAXException, TikaException {

try (final InputStream inputStream = new FileInputStream(

new File("C:/Users/abhinav/Desktop/Students.xlsx"));) {

//final Parser parser = new AutoDetectParser();

//Create the instance OOXMLParser. AutoDetectParser would also work.

final Parser parser = new OOXMLParser();

final ContentHandler contentHandler = new BodyContentHandler();

final Metadata metadata = new Metadata();

final ParseContext parseCtx = new ParseContext();

parser.parse(inputStream, contentHandler, metadata, parseCtx);

System.out.println("Extracted content: "+contentHandler.toString());

System.out.println("------------Extracted metadata-----------”);

//Extract the metadata information from metadata object

final List<String> metadataProps = Arrays.asList(metadata.names());

for (final String metadataProp : metadataProps) {

System.out.println(metadataProp + ": " + metadata.get(metadataProp));

}

Output:

Extracted content: Sheet1

Name Age RollNo Grade

Abhinav 27 101 12

Ashutosh 18 102 10

Abhishek 22 103 8

---------Extracted metadata---------------

meta:last-author: abhinav

meta:creation-date: 2016-04-08T11:08:29Z

dcterms:modified: 2016-04-08T11:10:12Z

meta:save-date: 2016-04-08T11:10:12Z

Last-Author: abhinav

Application-Name: Microsoft Excel

dc:creator: abhinav

dcterms:created: 2016-04-08T11:08:29Z

Author: abhinav

Last-Modified: 2016-04-08T11:10:12Z

Application-Version: 12.0000

date: 2016-04-08T11:10:12Z

modified: 2016-04-08T11:10:12Z

creator: abhinav

extended-properties:AppVersion: 12.0000

Creation-Date: 2016-04-08T11:08:29Z

protected: false

meta:author: abhinav

extended-properties:Application: Microsoft Excel

Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Last-Save-Date: 2016-04-08T11:10:12Z

You can also add/update metadata to the documents. Metadata class provides methods to add/update metadata. Let’s try it with an example:

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStream;

import java.util.Arrays;

import java.util.List;

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

public class ExcelSheetExtractionTest {

public static void main(String[] args) throws IOException,

SAXException, TikaException {

try (final InputStream inputStream = new FileInputStream(

new File("C:/Users/abhinav/Desktop/Students.xlsx"));) {

//final Parser parser = new AutoDetectParser();

//Create the instance OOXMLParser. AutoDetectParser would also work.

final Parser parser = new OOXMLParser();

final ContentHandler contentHandler = new BodyContentHandler();

final Metadata metadata = new Metadata();

final ParseContext parseCtx = new ParseContext();

parser.parse(inputStream, contentHandler, metadata, parseCtx);

System.out.println("Extracted content: "+contentHandler.toString());

System.out.println("---------Extracted metadata---------------");

//Extract the metadata information from metadata object

final List<String> metadataProps = Arrays.asList(metadata.names());

for (final String metadataProp : metadataProps) {

System.out.println(metadataProp + ": " + metadata.get(metadataProp));

}

System.out.println("------------------------------------------");

//Add metadata information

metadata.add("Usage", "Used for example");

//Update metadata information

metadata.set("Author", "Abhinav Mishra");

System.out.println("\n---------Updated metadata---------------");

final List<String> updatedMetadataProps = Arrays.asList(metadata.names());

for (final String metadataProp : updatedMetadataProps) {

System.out.println(metadataProp + ": " + metadata.get(metadataProp));

}

Output:

Extracted content: Sheet1

Name Age RollNo Grade

Abhinav 27 101 12

Ashutosh 18 102 10

Abhishek 22 103 8

---------Extracted metadata---------------

meta:last-author: abhinav

meta:creation-date: 2016-04-08T11:08:29Z

dcterms:modified: 2016-04-08T11:10:12Z

meta:save-date: 2016-04-08T11:10:12Z

Last-Author: abhinav

Application-Name: Microsoft Excel

dc:creator: abhinav

dcterms:created: 2016-04-08T11:08:29Z

Author: abhinav

Last-Modified: 2016-04-08T11:10:12Z

Application-Version: 12.0000

date: 2016-04-08T11:10:12Z

modified: 2016-04-08T11:10:12Z

creator: abhinav

extended-properties:AppVersion: 12.0000

Creation-Date: 2016-04-08T11:08:29Z

protected: false

meta:author: abhinav

extended-properties:Application: Microsoft Excel

Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Last-Save-Date: 2016-04-08T11:10:12Z

------------------------------------------

---------Updated metadata---------------

meta:last-author: abhinav

meta:creation-date: 2016-04-08T11:08:29Z

dcterms:modified: 2016-04-08T11:10:12Z

meta:save-date: 2016-04-08T11:10:12Z

Last-Author: abhinav

Application-Name: Microsoft Excel

dc:creator: abhinav

dcterms:created: 2016-04-08T11:08:29Z

Author: Abhinav Mishra

Last-Modified: 2016-04-08T11:10:12Z

Application-Version: 12.0000

date: 2016-04-08T11:10:12Z

Usage: Used for example

modified: 2016-04-08T11:10:12Z

creator: abhinav

extended-properties:AppVersion: 12.0000

Creation-Date: 2016-04-08T11:08:29Z

protected: false

meta:author: abhinav

extended-properties:Application: Microsoft Excel

Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Last-Save-Date: 2016-04-08T11:10:12Z

Language Detection using TIKA:

Tika provide language detection tool as well, it is very useful if you want to differentiate the documents based on language. Tika adds the language information into metadata while parsing the document.

Tika can detect 18 languages from 184 standard languages standardized by ISO 639-1. Language detection in Tika is done using the getLanguage (…) method of the LanguageIdentifier class. This method returns the code name of the language in String format. To get the language of the content you have to pass the content to the Constructor of LanguageIdentifier class.

e.g. LanguageIdentifier langId = new LanguageIdentifier (“Hello Tika”);

Let’s see the working of language detection tool using an example:

package com.github.abhinavmishra14.tika;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStream;

import org.apache.tika.exception.TikaException;

import org.apache.tika.language.LanguageIdentifier;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

public class LanguageDetectionTest {

public static void main(String[] args) throws IOException,

SAXException, TikaException {

try (final InputStream inputStream = new FileInputStream(

new File("C:/Users/abhinav/Desktop/hello.txt"));) {

// You can create the instance of

// XMLParser,MP3Parser,PDFParser,OfficeParser based on your need

final Parser parser = new AutoDetectParser();

final ContentHandler contentHandler = new BodyContentHandler();

final Metadata metadata = new Metadata();

final ParseContext parseCtx = new ParseContext();

parser.parse(inputStream, contentHandler, metadata, parseCtx);

final String extractedContent = contentHandler.toString();

System.out.println("Extracted content: "+extractedContent);

System.out.println("Detecting the content language..");

final LanguageIdentifier langIdentifier = new LanguageIdentifier(extractedContent);

System.out.println("Language of the content is: "+langIdentifier.getLanguage());

}

Output:

Extracted content: Hello world is simple test program

Detecting the content language..

Language of the content is: en

The Java and Alfresco World

Friday, April 8, 2016

Introduction to Apache TIKA

Popular Posts

Search This Blog

Featured Post

Setup ACS-7.x, ASS-2.x and Local Transformation Service using distribution package step by step Part-1

Friday, April 8, 2016

Introduction to Apache TIKA

Subscribe To

Popular Posts

Search This Blog

Featured Post

Setup ACS-7.x, ASS-2.x and Local Transformation Service using distribution package step by step Part-1