トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS


Last-modified: 2013-08-24 (土) 02:14:21 (3897d)
Top / Java:apacheTikaを使ってみる









java -jar tika-app-1.2.jar -t test.doc

これでやるとテキストがダーッと流れますw ので、適当にリダイレクトするなり。


java -jar tika-app-1.2.jar -j test.doc


触った感じだと、結構ちゃんと出力されている気がしますw これは便利かもw


usage: java -jar tika-app.jar [option...] [file|port...]

    -?  or --help          Print this usage message
    -v  or --verbose       Print debug level messages
    -V  or --version       Print the Apache Tika version number

    -g  or --gui           Start the Apache Tika GUI
    -s  or --server        Start the Apache Tika server
    -f  or --fork          Use Fork Mode for out-of-process extraction

    -x  or --xml           Output XHTML content (default)
    -h  or --html          Output HTML content
    -t  or --text          Output plain text content
    -T  or --text-main     Output plain text content (main content only)
    -m  or --metadata      Output only metadata
    -j  or --json          Output metadata in JSON
    -y  or --xmp           Output metadata in XMP
    -l  or --language      Output only language
    -d  or --detect        Detect document type
    -eX or --encoding=X    Use output encoding X
    -pX or --password=X    Use document password X
    -z  or --extract       Extract all attachements into current directory
    --extract-dir=<dir     Specify target directory for -z
    -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
                           whitespace, for better readability

         Create NGram profile, where X is a profile name
         List the available document parsers
         List the available document parsers, and their supported mime types
         List the available document detectors
         List the available metadata models, and their supported keys
         List all known media types and related information

    Apache Tika will parse the file(s) specified on the
    command line and output the extracted text content
    or metadata to standard output.

    Instead of a file name you can also specify the URL
    of a document to be parsed.

    If no file name or URL is specified (or the special
    name "-" is used), then the standard input stream
    is parsed. If no arguments were given and no input
    data is available, the GUI is started instead.

- GUI mode

    Use the "--gui" (or "-g") option to start the
    Apache Tika GUI. You can drag and drop files from
    a normal file explorer to the GUI window to extract
    text content and metadata from the files.

- Server mode

    Use the "--server" (or "-s") option to start the
    Apache Tika server. The server will listen to the
    ports you specify as one or more arguments.