Java:apacheTikaを使ってみる の変更点
Top / Java:apacheTikaを使ってみる
- 追加された行はこの色です。
- 削除された行はこの色です。
- Java:apacheTikaを使ってみる へ行く。
- Java:apacheTikaを使ってみる の差分を削除
*Java:apacheTikaを使ってみる [#u0b4ecaf] Java、と書きつつ、まずはコマンドラインから実行してみる。 ***ダウンロード [#a087190e] Tikaのjarを以下のURLからダウンロード -http://tika.apache.org/download.html ***使ってみる [#he3de2b0] ここに、Tikaのコマンドラインが載っている。 -http://tika.apache.org/1.2/gettingstarted.html 前提としては、javaのパスが通っていること。 まず、テキスト情報だけ出力してみる。オプションは「-t」 java -jar tika-app-1.2.jar -t test.doc これでやるとテキストがダーッと流れますw ので、適当にリダイレクトするなり。 メタ情報はこんな感じ。オプションは「-j」若しくは「-x」 java -jar tika-app-1.2.jar -j test.doc なんか、いろいろ出力されますw 触った感じだと、結構ちゃんと出力されている気がしますw これは便利かもw 以下usage >usage: java -jar tika-app.jar [option...] [file|port...] > >Options: > -? or --help Print this usage message > -v or --verbose Print debug level messages > -V or --version Print the Apache Tika version number > > -g or --gui Start the Apache Tika GUI > -s or --server Start the Apache Tika server > -f or --fork Use Fork Mode for out-of-process extraction > > -x or --xml Output XHTML content (default) > -h or --html Output HTML content > -t or --text Output plain text content > -T or --text-main Output plain text content (main content only) > -m or --metadata Output only metadata > -j or --json Output metadata in JSON > -y or --xmp Output metadata in XMP > -l or --language Output only language > -d or --detect Detect document type > -eX or --encoding=X Use output encoding X > -pX or --password=X Use document password X > -z or --extract Extract all attachements into current directory > --extract-dir=<dir> Specify target directory for -z > -r or --pretty-print For XML and XHTML outputs, adds newlines and > whitespace, for better readability > > --create-profile=X > Create NGram profile, where X is a profile name > --list-parsers > List the available document parsers > --list-parser-details > List the available document parsers, and their supported mime types > --list-detectors > List the available document detectors > --list-met-models > List the available metadata models, and their supported keys > --list-supported-types > List all known media types and related information > >Description: > Apache Tika will parse the file(s) specified on the > command line and output the extracted text content > or metadata to standard output. > > Instead of a file name you can also specify the URL > of a document to be parsed. > > If no file name or URL is specified (or the special > name "-" is used), then the standard input stream > is parsed. If no arguments were given and no input > data is available, the GUI is started instead. > >- GUI mode > > Use the "--gui" (or "-g") option to start the > Apache Tika GUI. You can drag and drop files from > a normal file explorer to the GUI window to extract > text content and metadata from the files. > >- Server mode > > Use the "--server" (or "-s") option to start the > Apache Tika server. The server will listen to the > ports you specify as one or more arguments. usage: java -jar tika-app.jar [option...] [file|port...] Options: -? or --help Print this usage message -v or --verbose Print debug level messages -V or --version Print the Apache Tika version number -g or --gui Start the Apache Tika GUI -s or --server Start the Apache Tika server -f or --fork Use Fork Mode for out-of-process extraction -x or --xml Output XHTML content (default) -h or --html Output HTML content -t or --text Output plain text content -T or --text-main Output plain text content (main content only) -m or --metadata Output only metadata -j or --json Output metadata in JSON -y or --xmp Output metadata in XMP -l or --language Output only language -d or --detect Detect document type -eX or --encoding=X Use output encoding X -pX or --password=X Use document password X -z or --extract Extract all attachements into current directory --extract-dir=<dir Specify target directory for -z -r or --pretty-print For XML and XHTML outputs, adds newlines and whitespace, for better readability --create-profile=X Create NGram profile, where X is a profile name --list-parsers List the available document parsers --list-parser-details List the available document parsers, and their supported mime types --list-detectors List the available document detectors --list-met-models List the available metadata models, and their supported keys --list-supported-types List all known media types and related information Description: Apache Tika will parse the file(s) specified on the command line and output the extracted text content or metadata to standard output. Instead of a file name you can also specify the URL of a document to be parsed. If no file name or URL is specified (or the special name "-" is used), then the standard input stream is parsed. If no arguments were given and no input data is available, the GUI is started instead. - GUI mode Use the "--gui" (or "-g") option to start the Apache Tika GUI. You can drag and drop files from a normal file explorer to the GUI window to extract text content and metadata from the files. - Server mode Use the "--server" (or "-s") option to start the Apache Tika server. The server will listen to the ports you specify as one or more arguments.