Magika：次世代のGoogle製コンテンツタイプ検出ツール

「ファイルの拡張子を見て判定するのではなく、コンテンツタイプを正確に検出したい」
「実績のあるコンテンツタイプ検出ツールを探している」

このような場合には、Magikaがオススメです。
この記事では、Google製のコンテンツタイプ検出ツールMagikaについて説明しています。

本記事の内容

Magikaとは？
Magikaのインストール
Magikaの動作確認

それでは、上記に沿って解説していきます。

Magikaとは？

Magikaはディープラーニングに基づく新しいツールです。
要するに、AIによるコンテンツタイプ検出ツールということになります。

単一のCPUでミリ秒単位で正確にファイルを識別します。
1M以上のファイルと100以上のタイプの評価で、99%以上の精度を達成しています。

Googleでは、以下のサービスにおいてユーザーの安全性向上に役立てています。

Gmail
Google Drive
Google Safe Browsing

つまり、Magikaは実績ありのツールと言えます。

なお、100以上のコンテンツタイプをサポートしているようです。

File not found · google/magika

Detect file content types with deep learning. Contribute to google/magika development by creating an account on GitHub.

Magikaのインストール

現時点（2024年2月中旬）におけるMagikaの最新バージョンは、0.5.0です。
Python製のため、pipで簡単にインストールできます。

サポートしているPythonのバージョンは、以下。

Magikaをインストールするには、次のコマンドを実行するだけです。

pip install magika

処理が終わったら、Magikaのインストールは完了です。

Magikaの動作確認

Magikaには、二つの使い方があります。

コマンドラインツール
Python API

それぞれで動作を確かめてみましょう。
コンテンツタイプを識別するために、対象ファイルとしてpngを用意します。

コマンドラインツール

使い方は、ヘルプで確認できます。

> magika --help
Usage: magika [OPTIONS] [FILE]...

  Magika - Determine type of FILEs with deep-learning.

Options:
  -r, --recursive                 When passing this option, magika scans every
                                  file within directories, instead of
                                  outputting "directory"
  --json                          Output in JSON format.
  --jsonl                         Output in JSONL format.
  -i, --mime-type                 Output the MIME type instead of a verbose
                                  content type description.
  -l, --label                     Output a simple label instead of a verbose
                                  content type description. Use --list-output-
                                  content-types for the list of supported
                                  output.
  -c, --compatibility-mode        Compatibility mode: output is as close as
                                  possible to `file` and colors are disabled.
  -s, --output-score              Output the prediction's score in addition to
                                  the content type.
  -m, --prediction-mode [best-guess|medium-confidence|high-confidence]
  --batch-size INTEGER            How many files to process in one batch.
  --no-dereference                This option causes symlinks not to be
                                  followed. By default, symlinks are
                                  dereferenced.
  --colors / --no-colors          Enable/disable use of colors.
  -v, --verbose                   Enable more verbose output.
  -vv, --debug                    Enable debug logging.
  --generate-report               Generate report useful when reporting
                                  feedback.
  --version                       Print the version and exit.
  --list-output-content-types     Show a list of supported content types.
  --model-dir DIRECTORY           Use a custom model.
  -h, --help                      Show this message and exit.

  Magika version: "0.5.0"

  Default model: "standard_v1"

  Send any feedback to magika-dev@google.com or via GitHub issues.

用意したpngを識別します。

> magika input.png  
input.png: PNG image data (image)

Python API

pathlibモジュールのPathクラスを利用するのが、少し面倒です。
それ以外は、非常にシンプルと言えます。

from magika import Magika
from pathlib import Path

magika = Magika()
result = magika.identify_path(Path('input.png'))
print(result.output.ct_label)

上記を実行した結果は、以下。

png