Java：apacheTikaを使ってみるのバックアップの現在との差分(No.1)

バックアップ一覧
差分を表示
ソースを表示
バックアップを表示
Java：apacheTikaを使ってみるへ行く。
- 1 (2012-11-07 (水) 21:56:30)

追加された行はこの色です。
削除された行はこの色です。

*Java：apacheTikaを使ってみる [#u0b4ecaf]

Java、と書きつつ、まずはコマンドラインから実行してみる。

***ダウンロード [#a087190e]

Tikaのjarを以下のURLからダウンロード

-http://tika.apache.org/download.html

***使ってみる [#he3de2b0]

ここに、Tikaのコマンドラインが載っている。

-http://tika.apache.org/1.2/gettingstarted.html

前提としては、javaのパスが通っていること。

まず、テキスト情報だけ出力してみる。オプションは「-t」

 java -jar tika-app-1.2.jar -t test.doc

これでやるとテキストがダーッと流れますｗ　ので、適当にリダイレクトするなり。

メタ情報はこんな感じ。オプションは「-j」若しくは「-x」

 java -jar tika-app-1.2.jar -j test.doc

なんか、いろいろ出力されますｗ

触った感じだと、結構ちゃんと出力されている気がしますｗ
これは便利かもｗ

以下usage

 usage: java -jar tika-app.jar [option...] [file|port...]
 
 Options:
     -?  or --help          Print this usage message
     -v  or --verbose       Print debug level messages
     -V  or --version       Print the Apache Tika version number
 
     -g  or --gui           Start the Apache Tika GUI
     -s  or --server        Start the Apache Tika server
     -f  or --fork          Use Fork Mode for out-of-process extraction
 
     -x  or --xml           Output XHTML content (default)
     -h  or --html          Output HTML content
     -t  or --text          Output plain text content
     -T  or --text-main     Output plain text content (main content only)
     -m  or --metadata      Output only metadata
     -j  or --json          Output metadata in JSON
     -y  or --xmp           Output metadata in XMP
     -l  or --language      Output only language
     -d  or --detect        Detect document type
     -eX or --encoding=X    Use output encoding X
     -pX or --password=X    Use document password X
     -z  or --extract       Extract all attachements into current directory
     --extract-dir=<dir     Specify target directory for -z
     -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
                            whitespace, for better readability
 
     --create-profile=X
          Create NGram profile, where X is a profile name
     --list-parsers
          List the available document parsers
     --list-parser-details
          List the available document parsers, and their supported mime types
     --list-detectors
          List the available document detectors
     --list-met-models
          List the available metadata models, and their supported keys
     --list-supported-types
          List all known media types and related information
 
 Description:
     Apache Tika will parse the file(s) specified on the
     command line and output the extracted text content
     or metadata to standard output.
 
     Instead of a file name you can also specify the URL
     of a document to be parsed.
 
     If no file name or URL is specified (or the special
     name "-" is used), then the standard input stream
     is parsed. If no arguments were given and no input
     data is available, the GUI is started instead.
 
 - GUI mode
 
     Use the "--gui" (or "-g") option to start the
     Apache Tika GUI. You can drag and drop files from
     a normal file explorer to the GUI window to extract
     text content and metadata from the files.
 
 - Server mode
 
     Use the "--server" (or "-s") option to start the
     Apache Tika server. The server will listen to the
     ports you specify as one or more arguments.

Java：apacheTikaを使ってみる のバックアップの現在との差分(No.1)

Java：apacheTikaを使ってみるのバックアップの現在との差分(No.1)