Tika is a java library which can be used for detecting
document types, detecting language of document’s content and extracting
content/metadata from various types of file. It uses many existing document
parsers libraries such as JDom, Jackson, JSoup, XMLBeans, JAXB, POI, PDF Box,
etc. for content/metadata extraction. It utilizes less memory and able to
extract content quickly.
It provides APIs to identify content language and detect mime
type of file (based on file extension).
It provided generic interface for parsing content and extracting
content/metadata by encapsulates all the third party parser libraries within a
single parser interface. It is capable of extracting content from various
popular files formats such PDF, Word, Excel Sheet, CSV, Text, Images, XML, JSON
etc. There is a Tika class which is the simplest and direct way of calling Tika
from Java. It follows the facade design pattern. You can find the class in the
Tika API at org.apache.tika.Tika
Search engines, content management systems and machine
learning tools uses Tika for content/metadata extraction, document analysis, content
analysis, indexing and translation.
Pre-requisite:
1-      JDK 1.6 or Above
2- Download latest Tika dependencies (1.12 is the latest version currently). See the list of dependencies given below:
2- Download latest Tika dependencies (1.12 is the latest version currently). See the list of dependencies given below:
| 
<dependencies> 
                <dependency> 
                                <groupId>org.apache.tika</groupId> 
                                <artifactId>tika-core</artifactId> 
                                <version>1.12</version> 
                </dependency> 
                <dependency> 
                                <groupId>org.apache.tika</groupId> 
                                <artifactId>tika-parsers</artifactId> 
                                <version>1.12</version> 
                </dependency> 
                <dependency> 
                                <groupId>org.apache.tika</groupId> 
                                <artifactId>tika-serialization</artifactId> 
                                <version>1.12</version> 
                </dependency> 
                <dependency> 
                                <groupId>org.apache.tika</groupId> 
                                <artifactId>tika-app</artifactId> 
                                <version>1.12</version> 
                </dependency> 
                <dependency> 
                                <groupId>org.apache.tika</groupId> 
                                <artifactId>tika-bundle</artifactId> 
                                <version>1.12</version> 
                </dependency> 
                <dependency> 
                                <groupId>org.apache.tika</groupId> 
                                <artifactId>tika-batch</artifactId> 
                                <version>1.12</version> 
                </dependency> 
                <dependency> 
                                <groupId>org.apache.tika</groupId> 
                                <artifactId>tika-translate</artifactId> 
                                <version>1.12</version> 
                </dependency> 
</dependencies> | 
You can find the TIKA dependencies from below given URL:
You can also
refer to the GitHub repository,
if you are interested in exploring the source code.
You can
refer to documentation here: http://tika.apache.org/index.html
Getting
started: http://tika.apache.org/1.12/gettingstarted.html
Parser
Guide: https://tika.apache.org/1.12/parser_guide.html
See examples
here: https://tika.apache.org/1.8/examples.html
Find the
JavaDocs here: https://tika.apache.org/1.12/api/
So let’s
explore the available APIs and try to do some hands-on.  First we will try to use Tika GUI and see the
magic of Tika.
Follow the below given steps to use Tika GUI:
1- Open command prompt.
2- Locate the ‘tika-app-1.12.jar’ and copy the full path.
3- Run below command to open Tika GUI.
| 
C:\Users\Abhinav\.m2\repository\org\apache\tika\tika-app\1.12>java
  -jar tika-app-1.12.jar -g | 
5- Go to ‘File’ menu and select ‘Open’ menu to open any file of your choice. E.g. A word document
6- Once you select the file app will automatically extract the metadata and content from the selected file. You can see all available metadata on selected document. E.g.
| 
Application-Name: Microsoft Office Word 
Application-Version: 14.0000 
Author: Abhinav 
Character Count: 3 
Character-Count-With-Spaces: 3 
Content-Length: 18700 
Content-Type:
  application/vnd.openxmlformats-officedocument.wordprocessingml.document 
Creation-Date: 2016-02-25T12:49:00Z 
Last-Author: Abhinav  
Last-Modified: 2016-02-25T12:56:00Z 
Last-Save-Date: 2016-02-25T12:56:00Z 
Line-Count: 1 
Page-Count: 1 
Paragraph-Count: 1 
Revision-Number: 2 | 
7- Go to ‘View’ option, you will see options such as “Metadata, Formatted text, Plain text, Main content, Structured text (in xml format) and Recursive JSON”. By default “Metadata” will be selected as view in the app. 
8- Let’s see how “Formatted text” will look like. I selected “Formatted text” as view;it will display the content as it is in word document. See the screen below:
9- Let’s see how “Plain text” will look like. I selected “Plain text” as view; it will display the content as plain text. See the screen below:
Similarly you can view other options also. This app is
helpful in case if you want to test the capabilities of Tika and also want to
see how the extracted metadata or content will look like after processing.
Type Detection in Tika:
Tika
identifies the MimeType of any file. It uses following mechanisms.
1-   File Extension:- File extension is used to
identify the MimeType.
2-    Content-type Metadata:- If file extension is lost
due to some reason, it uses the metadata supplied with the file to detect the
MimeType.
3-    Magic bytes info:- If you observe the raw bytes
of a file (open the file in notepad), you will notice some unique character
patterns in each type of file. Some files have special byte prefixes called magic bytes that are specially made and
included in a file for the purpose of identifying the mime type. e.g. you can
find CA FE BA BE (hexadecimal format) in a java file and %PDF (ASCII format) in
a PDF file. Tika uses this information to identify the media type of a file.
4-    Character encoding:- It identifies the mime type
of plain text files using their character encoding. 
5-    XML characters (<, />):- To identify the
XML file type, Tika first parses the XML document and extract the information
such as namespace, processing instruction etc. 
Let try it
with a simple java code. Here we will use the “org.apache.tika.Tika” class as
discussed above. 
| 
package
  com.github.abhinavmishra14.tika; 
import java.io.File; 
import java.util.Scanner; 
import
  org.apache.tika.Tika; 
public class FileTypeDetection
  { 
       public static void main(String[] args) throws Exception { 
              try (final Scanner scanner = new Scanner(System.in);) { 
              System.out.println("Please
  enter a fileName/filePath: "); 
              final String filePath = scanner.nextLine(); 
              final File fileObject = new File(filePath); 
              //Instantiate TIKA facade class  
              final Tika tika = new Tika(); 
              //Call detect method of the TIKA
  class 
              final String filetype = tika.detect(fileObject); 
                  System.out.println("\nDetected
  fileType: "+filetype); 
              } 
       } 
} | 
Output:
| 
Please enter a
  fileName/filePath:  
C:\Users\abhinav\Desktop\inside-marklogic-server-r7.pdf 
Detected
  fileType: application/pdf 
Please enter a
  fileName/filePath:  
C:\Users\abhinav\Desktop\set-jre-args-share.png 
Detected fileType:
  image/png | 
Content extraction in TIKA:
Tika uses
multiple parsers to extract the content from given file. Tika chooses suitable
parser for the given file is decided by Tika
facade class based on the file type detection. Tika provides multiple
overloaded and useful methods to extract the content. The most used method of
Tika class is “parseToString (...)”. It can extract the content of a file
given from file system and it can parse the file from a URL as well. There is
one more method parse (....). It returns the instance of java.io.Reader.  It is useful if you want to store the
extracted content into file, string, stream etc. 
We will see
example of above discussed methods.
See the complete
list of useful methods using following link: https://tika.apache.org/1.12/api/org/apache/tika/Tika.html
Following
steps will be performed by Tika to extract the content:
1-      Detect the file type of file using type detection mechanism (as mentioned above). 
2- Once type of file is known, choose the suitable parser from its parser library.
3- Parser will parse the content, extract the text, and also throw exceptions for unreadable formats.
2- Once type of file is known, choose the suitable parser from its parser library.
3- Parser will parse the content, extract the text, and also throw exceptions for unreadable formats.
Let try it
with a simple java code. Here we will use the “org.apache.tika.Tika” class.
I have a text file “helloTika.txt” which
has following content:
| 
A
  "Hello, World!" program is a computer program that outputs
  "Hello, World!" on a display device, often standard output. | 
| 
package
  com.github.abhinavmishra14.tika; 
import java.io.File; 
import
  java.io.IOException; 
import java.util.Scanner; 
import
  org.apache.tika.Tika; 
import
  org.apache.tika.exception.TikaException; 
public class ExtractionTest { 
       public static void main(String[] args) throws IOException,
  TikaException { 
              try (final Scanner scanner = new Scanner(System.in);) { 
                 System.out.println("Please
  enter a fileName/filePath: "); 
                 final String filePath = scanner.nextLine(); 
                 final File fileObject = new File(filePath); 
                 //Instantiate TIKA facade class  
                 final Tika tika = new Tika(); 
              //Call the parseToString method to
  get the extracted content as string. 
                 final String extractedContent = tika.parseToString(fileObject); 
                 System.out.println("\nExtracted
  content: "+extractedContent); 
              } 
       } 
} | 
Output:
| 
Please enter a
  fileName/filePath:  
C:\Users\abhinav\Desktop\helloTika.txt 
Extracted content:
  A "Hello, World!" program is a computer program that outputs
  "Hello, World!" on a display device, often standard output. | 
Let’s try the same with as XML file which
is accessed via URL:
| 
package
  com.github.abhinavmishra14.tika; 
import
  java.io.IOException; 
import java.net.URL; 
import java.util.Scanner; 
import
  org.apache.tika.Tika; 
import
  org.apache.tika.exception.TikaException; 
public class ExtractionTest { 
       public static void main(String[] args) throws IOException,
  TikaException { 
              try (final Scanner scanner = new Scanner(System.in);) { 
               
  System.out.println("Please enter a fileURL: "); 
               
  final String fileURL = scanner.nextLine(); 
               
  //Create
  the instance of URL 
               
  final URL
  url = new URL(fileURL); 
               
  //Instantiate
  TIKA facade class  
               
  final Tika tika = new Tika(); 
               //Call the parseToString method to get
  the extracted content as string. 
               final String extractedContent = tika.parseToString(url); 
               System.out.println("\nExtracted
  content: "+extractedContent); 
              } 
       } 
} | 
Output:
| Please enter a fileURL: 
Extracted
  content:   
   Everyday Italian 
   Giada De Laurentiis 
   2005 
   30.00 
   Harry Potter 
   J K. Rowling 
   2005 
   29.99 
   XQuery Kick Start 
   James McGovern 
   Per Bothner 
   Kurt Cagle 
   James Linn 
   Vaidyanathan Nagarajan 
   2003 
   49.99 
   Learning XML 
   Erik T. Ray 
   2003 
   39.95 | 
Let’s save the extracted content to a file:
| 
package
  com.github.abhinavmishra14.tika; 
import java.io.File; 
import
  java.io.FileWriter; 
import
  java.io.IOException; 
import java.io.Reader; 
import java.net.URL; 
import java.util.Scanner; 
import
  org.apache.commons.lang.StringUtils; 
import
  org.apache.tika.Tika; 
import
  org.apache.tika.exception.TikaException; 
public class
  ExtractAndWriteToAFile { 
       public static void main(String[] args) throws IOException,
  TikaException { 
              try (final Scanner scanner = new Scanner(System.in);) { 
               System.out.println("Please
  enter a fileURL: "); 
               final String fileURL = scanner.nextLine(); 
               //Create the instance of URL 
               final URL url = new URL(fileURL); 
               //Instantiate TIKA facade class  
               final Tika tika = new Tika(); 
       //Call the parse method to get the extracted
  content as java.io.Reader object. 
               final Reader reader = tika.parse(url); 
               //Get the file name from the URL. 
              //Expected file name will be
  the name without extension 
               final File fileName = new File( 
StringUtils.substringBefore(StringUtils.substringAfterLast(fileURL, "/"), ".")); 
                     try(final FileWriter fileWriter = new FileWriter(fileName);){ 
                           System.out.println("Writing to
  file.."); 
                           int character =-1; 
                           while ((character = reader.read()) != -1) { 
                             fileWriter.write(character); 
                           } 
                           System.out.println("Writing to
  file completed!"); 
                     } 
              } 
       } 
} | 
Output: (Open the
saved file to see the saved content)
| Please enter a fileURL: 
http://www.w3schools.com/xsl/books.xml 
Writing
  to file.. 
Writing
  to file completed! | 
Till now we
saw that, we are using the Tika facade class and calling the default
implemented methods which are doing the job beautifully.  What if you want to have more control over
parsing? Hmm
Well we can
do that, Tika API provides a Parser Interface, which you can
find under ‘org.apache.tika.parser’ package.
This package has an AbstractParser class which
implements the Parser Interface. This package also contains some basic parsers
which are created by extending the AbstractParser class.  You can also create your own parser
implementation by extending AbstractParser or by implementing Parser Interface.
There are multiple
parser implementations classes available in Tika, such as XMLParser, MP3Parser,
and PDFParser, OfficeParser, OOXMLParser etc.
1- Read the file into java.io.InputStream.
2- Create the instance of Parser. You can use any of these individual document parsers. Or you can use either CompositeParser or AutoDetectParser that uses all the parser classes internally and extracts the contents of a document using a suitable parser.
3- Create the instance of content handler; there are many content handlers available in org.apache.tika.sax package. Some of them are: org.apache.tika.sax.BodyContentHandler , org.apache.tika.sax.LinkContentHandler , org.apache.tika.sax.PhoneExtractingContentHandler, org.apache.tika.sax.TeeContentHandler, org.apache.tika.sax.ElementMappingContentHandler, org.xml.sax.helpers.DefaultHandler , Etc. See Javadocs for details on each handler.
4- Create the instance of org.apache.tika.metadata.Metadata
5- Create the instance of org.apache.tika.parser.ParseContext
6- Call the parse (…) method of Parser created at step 1 and pass the reference of InputStream, contentHandler, metadata and parseContext.
Let’s try it out:
| 
package
  com.github.abhinavmishra14.tika; 
import java.io.File; 
import
  java.io.FileInputStream; 
import
  java.io.IOException; 
import
  java.io.InputStream; 
import
  org.apache.tika.exception.TikaException; 
import org.apache.tika.metadata.Metadata; 
import
  org.apache.tika.parser.AutoDetectParser; 
import
  org.apache.tika.parser.ParseContext; 
import
  org.apache.tika.parser.Parser; 
import org.apache.tika.sax.BodyContentHandler; 
import
  org.xml.sax.ContentHandler; 
import org.xml.sax.SAXException; 
public class XMLParserTest { 
       public static void main(String[] args) throws IOException, 
                     SAXException, TikaException { 
        try (final InputStream inputStream = new FileInputStream( 
                           new File("C:/Users/abhinav/Desktop/books.xml"));) { 
         
  //Create
  the instance of parser. Here I am using AutoDetectParser. 
          //You can create the instance of 
         
  //XMLParser,MP3Parser,PDFParser,OfficeParser,OOXMLParser
  based on your need 
         
  final Parser parser = new AutoDetectParser(); 
         
  final ContentHandler contentHandler = new BodyContentHandler(); 
         
  final Metadata metadata = new Metadata(); 
         
  final ParseContext parseCtx = new ParseContext(); 
         
  parser.parse(inputStream, contentHandler, metadata, parseCtx); 
         
  System.out.println("Extracted content: "+contentHandler.toString()); 
         } 
       } 
}
   | 
Output:
| Extracted content: 
  
  Everyday Italian 
  
  Giada De Laurentiis 
  
  2005 
  
  30.00 
  
  Harry Potter 
  
  J K. Rowling 
  
  2005 
  
  29.99 
  
  XQuery Kick Start 
  
  James McGovern 
  
  Per Bothner 
  
  Kurt Cagle 
  
  James Linn 
  
  Vaidyanathan Nagarajan 
  
  2003 
  
  49.99 
  
  Learning XML 
  
  Erik T. Ray 
  
  2003 
  
  39.95 | 
Let’s try to
extract the content from a text file:
| 
package
  com.github.abhinavmishra14.tika; 
import java.io.File; 
import java.io.FileInputStream; 
import
  java.io.IOException; 
import
  java.io.InputStream; 
import
  org.apache.tika.exception.TikaException; 
import
  org.apache.tika.metadata.Metadata; 
import
  org.apache.tika.parser.AutoDetectParser; 
import
  org.apache.tika.parser.ParseContext; 
import
  org.apache.tika.parser.Parser; 
import
  org.apache.tika.sax.ToTextContentHandler; 
import
  org.xml.sax.ContentHandler; 
import
  org.xml.sax.SAXException; 
public class TextParserTest { 
       public static void main(String[] args) throws IOException, 
                     SAXException, TikaException { 
          
  try (final InputStream inputStream = new FileInputStream( 
                           new File("C:/Users/abhinav/Desktop/hello.txt"));) { 
              // Create the instance of parser. 
              final Parser parser = new
  AutoDetectParser(); 
              final ContentHandler contentHandler = new
  ToTextContentHandler(); 
              final Metadata metadata = new Metadata(); 
              final ParseContext parseCtx = new ParseContext(); 
              parser.parse(inputStream, contentHandler, metadata, parseCtx); 
              System.out.println("Extracted
  content: "+contentHandler.toString()); 
          
  } 
       } 
} | 
Output:
| Extracted content: Hello world is simple test program | 
Let’s try to
extract the content from a excel sheet:
| 
package
  com.github.abhinavmishra14.tika; 
import java.io.File; 
import
  java.io.FileInputStream; 
import
  java.io.IOException; 
import
  java.io.InputStream; 
import
  org.apache.tika.exception.TikaException; 
import
  org.apache.tika.metadata.Metadata; 
import
  org.apache.tika.parser.ParseContext; 
import
  org.apache.tika.parser.Parser; 
import
  org.apache.tika.parser.microsoft.ooxml.OOXMLParser; 
import
  org.apache.tika.sax.BodyContentHandler; 
import
  org.xml.sax.ContentHandler; 
import
  org.xml.sax.SAXException; 
public class
  ExcelSheetExtractionTest { 
       public static void main(String[] args) throws IOException, 
                     SAXException, TikaException { 
        
  try (final InputStream inputStream = new FileInputStream( 
                      new File("C:/Users/abhinav/Desktop/Students.xlsx"));) { 
              //final Parser parser = new
  AutoDetectParser(); 
              //Create the instance OOXMLParser.
  AutoDetectParser will also work. 
              final Parser parser = new OOXMLParser(); 
              final ContentHandler contentHandler = new
  BodyContentHandler(); 
              final Metadata metadata = new Metadata(); 
              final ParseContext parseCtx = new ParseContext(); 
              parser.parse(inputStream, contentHandler, metadata, parseCtx); 
              System.out.println("Extracted
  content: "+contentHandler.toString()); 
          
  } 
       } 
} | 
Output:
| Extracted content: Sheet1 
       Name          Age 
  RollNo Grade 
       Abhinav       27     101    12 
       Ashutosh      18     102    10 
       Abhishek      22     103    8 | 
Similarly you
can perform content extraction it for PDF, MS Word and PowerPoint etc.
Metadata extraction in TIKA:
Metadata is
nothing but the additional information supplied with a file. For e.g. in an audio
file, the artist, album, title, year, composer etc. are metadata information. Whenever
we parse a file using parse(…), we pass reference of an empty metadata object as
a parameter.  parse(…) method extracts
the metadata of the given file (if there are any), and copies them into the
metadata object. So, after parsing the file using parse(), we can extract the
metadata from that object.
Let’s try
with as example:
| 
package
  com.github.abhinavmishra14.tika; 
import java.io.File; 
import
  java.io.FileInputStream; 
import
  java.io.IOException; 
import
  java.io.InputStream; 
import java.util.Arrays; 
import java.util.List; 
import
  org.apache.tika.exception.TikaException; 
import
  org.apache.tika.metadata.Metadata; 
import
  org.apache.tika.parser.ParseContext; 
import
  org.apache.tika.parser.Parser; 
import
  org.apache.tika.parser.microsoft.ooxml.OOXMLParser; 
import
  org.apache.tika.sax.BodyContentHandler; 
import
  org.xml.sax.ContentHandler; 
import
  org.xml.sax.SAXException; 
public class
  ExcelSheetExtractionTest { 
       public static void main(String[] args) throws IOException, 
                     SAXException, TikaException { 
        
  try (final InputStream inputStream = new FileInputStream( 
                           new File("C:/Users/abhinav/Desktop/Students.xlsx"));) { 
              //final Parser parser = new
  AutoDetectParser(); 
              //Create the instance OOXMLParser.
  AutoDetectParser would also work. 
              final Parser parser = new OOXMLParser(); 
              final ContentHandler contentHandler
  = new
  BodyContentHandler(); 
              final Metadata metadata = new Metadata(); 
              final ParseContext parseCtx = new ParseContext(); 
              parser.parse(inputStream, contentHandler, metadata, parseCtx); 
              System.out.println("Extracted
  content: "+contentHandler.toString()); 
             System.out.println("------------Extracted
  metadata-----------”); 
              //Extract the metadata information
  from metadata object  
              final List<String>
  metadataProps = Arrays.asList(metadata.names()); 
              for (final String metadataProp : metadataProps) { 
                 System.out.println(metadataProp + ": " + metadata.get(metadataProp)); 
              } 
        
  } 
     } 
} | 
Output:
| Extracted content: Sheet1 
                Name      Age         RollNo    Grade 
                Abhinav 27           101         12 
                Ashutosh               18           102         10 
                Abhishek               22           103         8 
---------Extracted
  metadata--------------- 
meta:last-author: abhinav 
meta:creation-date:
  2016-04-08T11:08:29Z 
dcterms:modified: 2016-04-08T11:10:12Z 
meta:save-date: 2016-04-08T11:10:12Z 
Last-Author: abhinav 
Application-Name: Microsoft Excel 
dc:creator: abhinav 
dcterms:created: 2016-04-08T11:08:29Z 
Author: abhinav 
Last-Modified: 2016-04-08T11:10:12Z 
Application-Version: 12.0000 
date: 2016-04-08T11:10:12Z 
modified: 2016-04-08T11:10:12Z 
creator: abhinav 
extended-properties:AppVersion:
  12.0000 
Creation-Date: 2016-04-08T11:08:29Z 
protected: false 
meta:author: abhinav 
extended-properties:Application:
  Microsoft Excel 
Content-Type:
  application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 
Last-Save-Date: 2016-04-08T11:10:12Z | 
You can also add/update metadata to the documents. Metadata class provides methods to add/update metadata. Let’s try it with an example:
| 
package com.github.abhinavmishra14.tika; 
import java.io.File; 
import
  java.io.FileInputStream; 
import
  java.io.IOException; 
import
  java.io.InputStream; 
import java.util.Arrays; 
import java.util.List; 
import
  org.apache.tika.exception.TikaException; 
import org.apache.tika.metadata.Metadata; 
import
  org.apache.tika.parser.ParseContext; 
import
  org.apache.tika.parser.Parser; 
import
  org.apache.tika.parser.microsoft.ooxml.OOXMLParser; 
import
  org.apache.tika.sax.BodyContentHandler; 
import
  org.xml.sax.ContentHandler; 
import org.xml.sax.SAXException; 
public class
  ExcelSheetExtractionTest { 
       public static void main(String[] args) throws IOException, 
                     SAXException, TikaException { 
        
  try (final InputStream inputStream = new FileInputStream( 
                           new File("C:/Users/abhinav/Desktop/Students.xlsx"));) { 
              //final Parser parser = new
  AutoDetectParser(); 
              //Create the instance OOXMLParser.
  AutoDetectParser would also work. 
              final Parser parser = new OOXMLParser(); 
              final ContentHandler contentHandler = new BodyContentHandler(); 
              final Metadata metadata = new Metadata(); 
              final ParseContext parseCtx = new ParseContext(); 
              parser.parse(inputStream, contentHandler, metadata, parseCtx); 
              System.out.println("Extracted
  content: "+contentHandler.toString()); 
              System.out.println("---------Extracted
  metadata---------------"); 
              //Extract the metadata
  information from metadata object  
              final List<String>
  metadataProps = Arrays.asList(metadata.names()); 
              for (final String metadataProp : metadataProps) { 
               
  System.out.println(metadataProp + ": " + metadata.get(metadataProp)); 
              } 
               System.out.println("------------------------------------------"); 
               //Add metadata information 
               metadata.add("Usage", "Used for
  example"); 
               //Update metadata information 
               metadata.set("Author", "Abhinav
  Mishra"); 
               System.out.println("\n---------Updated
  metadata---------------"); 
       final List<String> updatedMetadataProps = Arrays.asList(metadata.names()); 
               for (final String metadataProp : updatedMetadataProps) { 
                 System.out.println(metadataProp + ": " + metadata.get(metadataProp)); 
               } 
         
  } 
    } 
} | 
Output:
| Extracted content: Sheet1 
                Name      Age         RollNo    Grade 
                Abhinav 27           101         12 
                Ashutosh               18           102         10 
                Abhishek               22           103         8 
---------Extracted
  metadata--------------- 
meta:last-author: abhinav 
meta:creation-date:
  2016-04-08T11:08:29Z 
dcterms:modified: 2016-04-08T11:10:12Z 
meta:save-date: 2016-04-08T11:10:12Z 
Last-Author: abhinav 
Application-Name: Microsoft Excel 
dc:creator: abhinav 
dcterms:created: 2016-04-08T11:08:29Z 
Author: abhinav 
Last-Modified: 2016-04-08T11:10:12Z 
Application-Version: 12.0000 
date: 2016-04-08T11:10:12Z 
modified: 2016-04-08T11:10:12Z 
creator: abhinav 
extended-properties:AppVersion:
  12.0000 
Creation-Date: 2016-04-08T11:08:29Z 
protected: false 
meta:author: abhinav 
extended-properties:Application:
  Microsoft Excel 
Content-Type:
  application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 
Last-Save-Date: 2016-04-08T11:10:12Z 
------------------------------------------ 
---------Updated
  metadata--------------- 
meta:last-author: abhinav 
meta:creation-date:
  2016-04-08T11:08:29Z 
dcterms:modified: 2016-04-08T11:10:12Z 
meta:save-date: 2016-04-08T11:10:12Z 
Last-Author: abhinav 
Application-Name: Microsoft Excel 
dc:creator: abhinav 
dcterms:created: 2016-04-08T11:08:29Z 
Author:
  Abhinav Mishra 
Last-Modified: 2016-04-08T11:10:12Z 
Application-Version: 12.0000 
date: 2016-04-08T11:10:12Z 
Usage:
  Used for example 
modified: 2016-04-08T11:10:12Z 
creator: abhinav 
extended-properties:AppVersion:
  12.0000 
Creation-Date: 2016-04-08T11:08:29Z 
protected: false 
meta:author: abhinav 
extended-properties:Application:
  Microsoft Excel 
Content-Type:
  application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 
Last-Save-Date: 2016-04-08T11:10:12Z | 
Language Detection using TIKA:
Tika provide
language detection tool as well, it is very useful if you want to differentiate
the documents based on language. Tika adds the language information into
metadata while parsing the document.
Tika can
detect 18 languages from 184 standard languages standardized by ISO 639-1. Language
detection in Tika is done using the getLanguage (…) method of the LanguageIdentifier class.
This method returns the code name of the language in String format.  To
get the language of the content you have to pass the content to the Constructor
of LanguageIdentifier 
class.
Let’s see
the working of language detection tool using an example:
| 
package
  com.github.abhinavmishra14.tika; 
import java.io.File; 
import
  java.io.FileInputStream; 
import
  java.io.IOException; 
import
  java.io.InputStream; 
import
  org.apache.tika.exception.TikaException; 
import org.apache.tika.language.LanguageIdentifier; 
import
  org.apache.tika.metadata.Metadata; 
import
  org.apache.tika.parser.AutoDetectParser; 
import
  org.apache.tika.parser.ParseContext; 
import
  org.apache.tika.parser.Parser; 
import
  org.apache.tika.sax.BodyContentHandler; 
import
  org.xml.sax.ContentHandler; 
import
  org.xml.sax.SAXException; 
public class
  LanguageDetectionTest { 
       public static void main(String[] args) throws IOException, 
                     SAXException, TikaException { 
        try (final InputStream inputStream = new FileInputStream( 
                           new File("C:/Users/abhinav/Desktop/hello.txt"));) { 
              // You can create the instance of 
              //
  XMLParser,MP3Parser,PDFParser,OfficeParser based on your need 
              final Parser parser = new
  AutoDetectParser(); 
              final ContentHandler contentHandler = new
  BodyContentHandler(); 
               final Metadata metadata = new Metadata(); 
               final ParseContext parseCtx = new ParseContext(); 
               parser.parse(inputStream, contentHandler, metadata, parseCtx); 
               final String extractedContent = contentHandler.toString(); 
               System.out.println("Extracted
  content: "+extractedContent); 
               System.out.println("Detecting
  the content language.."); 
final LanguageIdentifier
  langIdentifier = new
  LanguageIdentifier(extractedContent); 
System.out.println("Language of
  the content is: "+langIdentifier.getLanguage()); 
           } 
       } 
} | 
Output:
| Extracted content: Hello world is simple test program 
Detecting the
  content language.. 
Language of the
  content is: en | 

