トップ   編集 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS

Java:apacheTikaを使ってみる の変更点

Top / Java:apacheTikaを使ってみる

*Java:apacheTikaを使ってみる [#u0b4ecaf]

Java、と書きつつ、まずはコマンドラインから実行してみる。

***ダウンロード [#a087190e]

Tikaのjarを以下のURLからダウンロード

-http://tika.apache.org/download.html

***使ってみる [#he3de2b0]

ここに、Tikaのコマンドラインが載っている。

-http://tika.apache.org/1.2/gettingstarted.html

前提としては、javaのパスが通っていること。

まず、テキスト情報だけ出力してみる。オプションは「-t」

 java -jar tika-app-1.2.jar -t test.doc

これでやるとテキストがダーッと流れますw ので、適当にリダイレクトするなり。

メタ情報はこんな感じ。オプションは「-j」若しくは「-x」

 java -jar tika-app-1.2.jar -j test.doc

なんか、いろいろ出力されますw

触った感じだと、結構ちゃんと出力されている気がしますw
これは便利かもw

以下usage

>usage: java -jar tika-app.jar [option...] [file|port...]
>
>Options:
>    -?  or --help          Print this usage message
>    -v  or --verbose       Print debug level messages
>    -V  or --version       Print the Apache Tika version number
>
>    -g  or --gui           Start the Apache Tika GUI
>    -s  or --server        Start the Apache Tika server
>    -f  or --fork          Use Fork Mode for out-of-process extraction
>
>    -x  or --xml           Output XHTML content (default)
>    -h  or --html          Output HTML content
>    -t  or --text          Output plain text content
>    -T  or --text-main     Output plain text content (main content only)
>    -m  or --metadata      Output only metadata
>    -j  or --json          Output metadata in JSON
>    -y  or --xmp           Output metadata in XMP
>    -l  or --language      Output only language
>    -d  or --detect        Detect document type
>    -eX or --encoding=X    Use output encoding X
>    -pX or --password=X    Use document password X
>    -z  or --extract       Extract all attachements into current directory
>    --extract-dir=<dir>    Specify target directory for -z
>    -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
>                           whitespace, for better readability
>
>    --create-profile=X
>         Create NGram profile, where X is a profile name
>    --list-parsers
>         List the available document parsers
>    --list-parser-details
>         List the available document parsers, and their supported mime types
>    --list-detectors
>         List the available document detectors
>    --list-met-models
>         List the available metadata models, and their supported keys
>    --list-supported-types
>         List all known media types and related information
>
>Description:
>    Apache Tika will parse the file(s) specified on the
>    command line and output the extracted text content
>    or metadata to standard output.
>
>    Instead of a file name you can also specify the URL
>    of a document to be parsed.
>
>    If no file name or URL is specified (or the special
>    name "-" is used), then the standard input stream
>    is parsed. If no arguments were given and no input
>    data is available, the GUI is started instead.
>
>- GUI mode
>
>    Use the "--gui" (or "-g") option to start the
>    Apache Tika GUI. You can drag and drop files from
>    a normal file explorer to the GUI window to extract
>    text content and metadata from the files.
>
>- Server mode
>
>    Use the "--server" (or "-s") option to start the
>    Apache Tika server. The server will listen to the
>    ports you specify as one or more arguments.

 usage: java -jar tika-app.jar [option...] [file|port...]
 
 Options:
     -?  or --help          Print this usage message
     -v  or --verbose       Print debug level messages
     -V  or --version       Print the Apache Tika version number
 
     -g  or --gui           Start the Apache Tika GUI
     -s  or --server        Start the Apache Tika server
     -f  or --fork          Use Fork Mode for out-of-process extraction
 
     -x  or --xml           Output XHTML content (default)
     -h  or --html          Output HTML content
     -t  or --text          Output plain text content
     -T  or --text-main     Output plain text content (main content only)
     -m  or --metadata      Output only metadata
     -j  or --json          Output metadata in JSON
     -y  or --xmp           Output metadata in XMP
     -l  or --language      Output only language
     -d  or --detect        Detect document type
     -eX or --encoding=X    Use output encoding X
     -pX or --password=X    Use document password X
     -z  or --extract       Extract all attachements into current directory
     --extract-dir=<dir     Specify target directory for -z
     -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
                            whitespace, for better readability
 
     --create-profile=X
          Create NGram profile, where X is a profile name
     --list-parsers
          List the available document parsers
     --list-parser-details
          List the available document parsers, and their supported mime types
     --list-detectors
          List the available document detectors
     --list-met-models
          List the available metadata models, and their supported keys
     --list-supported-types
          List all known media types and related information
 
 Description:
     Apache Tika will parse the file(s) specified on the
     command line and output the extracted text content
     or metadata to standard output.
 
     Instead of a file name you can also specify the URL
     of a document to be parsed.
 
     If no file name or URL is specified (or the special
     name "-" is used), then the standard input stream
     is parsed. If no arguments were given and no input
     data is available, the GUI is started instead.
 
 - GUI mode
 
     Use the "--gui" (or "-g") option to start the
     Apache Tika GUI. You can drag and drop files from
     a normal file explorer to the GUI window to extract
     text content and metadata from the files.
 
 - Server mode
 
     Use the "--server" (or "-s") option to start the
     Apache Tika server. The server will listen to the
     ports you specify as one or more arguments.