Initial commit

This commit is contained in:
Joop Schilder 2021-03-23 21:58:40 +01:00
commit 61623f5a35
30 changed files with 1936 additions and 0 deletions

2
.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
/vendor/
/.idea/

93
README.md Normal file
View File

@ -0,0 +1,93 @@
# PDF Finder 🗞 🖇
This is a simple command line utility that allows you to look for PDF documents in any directory (recursively).
I have a lot of PDF documents spread around my home directory and subfolders and I'm too unorganized to do something
about it. Instead of taking an hour to organize the files, I took 7 hours to write this program. It uses `pdfinfo` to
collect metadata. The same can probably be achieved with simple shell scripts (globbing combined with `grep`, `sed`
and `awk`
gets you very very far). I chose PHP because I wanted to do something more with this (JSON API for my home network).
That part is left as an exercise for the reader.
There's two executables in `bin`: `pdf-finder.php` and `pdf-show-info.php`.
## Runtime requirements
To run it, you need [Composer](https://getcomposer.org/) and [PHP >= 7.4](https://www.php.net/), as well
as [poppler-utils](https://pypi.org/project/poppler-utils/). Installation of poppler-utils on Ubuntu is very simple:
```sh
# apt update && apt install poppler-utils
```
## Finding documents: `bin/pdf-finder.php`
The first executable, `pdf-finder.php`, is used to actually find PDFs based on search terms. The first argument should
always be the directory. Filters are optional.
### Examples
To find every PDF document with 'python' in its path, filename or any metadata field in the ~/Documents folder:
```sh
$ bin/pdf-finder.php ~/Documents python
```
... with 'python' in the title (metadata property):
```sh
$ bin/pdf-finder.php ~/Documents title=python
```
... with 'ritchie' in the author field and where the title property is set:
```sh
$ bin/pdf-finder.php ~/Documents author=ritchie title=
```
... with 'programming' and 'python' in the filename:
```sh
$ bin/pdf-finder.php ~/Documents filename=programming filename=python
```
### Available filters
Filters are based on the information supplied by the `pdfinfo`
command [(man page here)](https://www.xpdfreader.com/pdfinfo-man.html). Dates, when given, are printed in ISO-8601
format. Common fields are listed below. `filepath` (or `path`) is the path excluding the filename. `filename` (or `file`
or `name`) is the name of the file excluding the path.
| Common filters |
| :--- |
| `filepath`, `path` |
| `filename`, `file`, `name` |
| `title` |
| `subject` |
| `keywords` |
| `author` |
| `creator` |
| `producer` |
### A note on filters
About 50% of the PDF files on my computer contain usable metadata. It's almost never complete, although this depends on
the source you got your files from.
Using `path=python` yields the same results as `filepath=python`. The `path` is an alias to `filepath`. The same goes
for `file` and `name`: both are aliases to `filename`.
Filters are cumulative: adding more filters further restricts the output.
## Listing document info: `bin/pdf-show-info.php`
The second utility is basically a fancy wrapper for `pdfinfo`. It takes one argument, the path to a PDF document, and
spits out a table with information about the document.
```sh
$ bin/pdf-show-info.php ~/path/to/document.pdf
```
## Final note
Do as you please, as that is the beauty of open source.

26
bin/pdf-finder.php Executable file
View File

@ -0,0 +1,26 @@
#!/usr/bin/env php
<?php
use IO\ExceptionHandler;
use IO\Input\FinderArguments;
use IO\Output\DocumentListingOutput;
use PDF\Document;
require_once __DIR__ . '/../vendor/autoload.php';
ExceptionHandler::registerCallback();
$arguments = FinderArguments::createFromGlobals();
$directory = $arguments->getDirectory();
$filters = $arguments->getFilters();
printf('Scanning "%s"...%s', $directory, PHP_EOL);
$locator = new RecursiveDocumentLocator();
$documents = $locator->findDocuments($directory);
foreach ($filters as $filter) {
printf('Applying filter { %s }...%s', $filter, PHP_EOL);
$documents = $documents->filter(fn(Document $document) => $filter->allows($document));
}
DocumentListingOutput::forDocuments($documents)->render();

19
bin/pdf-show-info.php Executable file
View File

@ -0,0 +1,19 @@
#!/usr/bin/env php
<?php
use IO\ExceptionHandler;
use IO\Input\ShowInfoArguments;
use IO\Output\DocumentOutput;
require_once __DIR__ . '/../vendor/autoload.php';
ExceptionHandler::registerCallback();
$arguments = ShowInfoArguments::createFromGlobals();
$file = $arguments->getFile();
$documentFactory = DocumentFactory::create();
$document = $documentFactory->createDocument($file);
$output = DocumentOutput::forDocument($document);
$output->render();

19
composer.json Normal file
View File

@ -0,0 +1,19 @@
{
"name": "joopschilder/pdf-finder",
"type": "project",
"license": "MIT",
"keywords": ["pdf", "documents", "search", "metadata", "info", "portable document format"],
"description": "Utility to locate PDF files based on their metadata",
"autoload": {
"psr-0": {
"": [
"src"
]
}
},
"require": {
"symfony/console": "^5.2",
"cocur/slugify": "^4.0",
"illuminate/collections": "^8.33"
}
}

1085
composer.lock generated Normal file

File diff suppressed because it is too large Load Diff

25
src/DocumentFactory.php Normal file
View File

@ -0,0 +1,25 @@
<?php
use IO\Shell\Pdfinfo;
use PDF\Document;
class DocumentFactory
{
private Pdfinfo $pdfinfo;
public function __construct(?Pdfinfo $pdfinfo = null)
{
$this->pdfinfo = $pdfinfo ?? new Pdfinfo();
}
public static function create(): self
{
return new self();
}
public function createDocument(SplFileInfo $file): Document
{
$metadata = $this->pdfinfo->getMetadata($file);
return new Document($file, $metadata);
}
}

View File

@ -0,0 +1,12 @@
<?php
namespace Filter;
use PDF\Document;
interface DocumentFilter
{
public function allows(Document $document): bool;
public function __toString(): string;
}

View File

@ -0,0 +1,15 @@
<?php
namespace Filter;
class FilterFactory
{
public function createFromString(string $string): DocumentFilter
{
if (preg_match('/^.+=.*$/', $string)) {
[$prop, $term] = explode('=', $string);
return new SpecificFilter(trim($prop), trim($term));
}
return new GenericFilter($string);
}
}

View File

@ -0,0 +1,34 @@
<?php
namespace Filter;
use PDF\Document;
class GenericFilter implements DocumentFilter
{
private string $term;
public function __construct(string $term)
{
$this->term = $term;
}
public function allows(Document $document): bool
{
if ($this->term === '') {
return true;
}
foreach ($document->getProperties() as $key => $value) {
if (stripos($value, $this->term) !== false) {
return true;
}
}
return false;
}
public function __toString(): string
{
return sprintf('[*] contains \'%s\'', $this->term);
}
}

View File

@ -0,0 +1,42 @@
<?php
namespace Filter;
use PDF\Document;
use RuntimeException;
class SpecificFilter implements DocumentFilter
{
private string $property;
private string $term;
public function __construct(string $property, string $term)
{
$this->property = strtolower($property);
$this->term = strtolower($term);
}
public function allows(Document $document): bool
{
if ($this->property === '') {
return true;
}
try {
$property = $document->getProperty($this->property);
if ($this->term === '' && !empty($property)) {
// Filter is "prop=", which only checks if it exists.
return true;
}
return stripos($property, $this->term) !== false;
} catch (RuntimeException $e) {
// No such property exists, we don't pass
return false;
}
}
public function __toString(): string
{
return sprintf('property \'%s\' contains \'%s\'', $this->property, $this->term);
}
}

View File

@ -0,0 +1,11 @@
<?php
namespace IO\Exception;
class DirectoryNotFoundException extends IOException
{
public function __construct(string $directory)
{
parent::__construct(sprintf('Directory \'%s\' not found', $directory));
}
}

View File

@ -0,0 +1,11 @@
<?php
namespace IO\Exception;
class FileNotFoundException extends IOException
{
public function __construct(string $file)
{
parent::__construct(sprintf('File \'%s\' not found', $file));
}
}

View File

@ -0,0 +1,11 @@
<?php
namespace IO\Exception;
class FileNotReadableException extends IOException
{
public function __construct(string $file)
{
parent::__construct(sprintf('File \'%s\' is not readable', $file));
}
}

View File

@ -0,0 +1,9 @@
<?php
namespace IO\Exception;
use RuntimeException;
abstract class IOException extends RuntimeException
{
}

View File

@ -0,0 +1,12 @@
<?php
namespace IO\Exception;
class MissingFileArgumentException extends IOException
{
public function __construct()
{
parent::__construct('Missing file argument');
}
}

View File

@ -0,0 +1,11 @@
<?php
namespace IO\Exception;
class NotADirectoryException extends IOException
{
public function __construct(string $directory)
{
parent::__construct(sprintf('Argument \'%s\' is not a directory', $directory));
}
}

View File

@ -0,0 +1,24 @@
<?php
namespace IO;
use Throwable;
class ExceptionHandler
{
private static bool $registered = false;
public static function registerCallback(): void
{
if (self::$registered) {
return;
}
set_exception_handler(static function (Throwable $t) {
print($t->getMessage());
exit(1);
});
self::$registered = true;
}
}

View File

@ -0,0 +1,18 @@
<?php
namespace IO\Input;
trait ArgvAccess
{
protected static function getArguments(): array
{
// Get local copy of $argv
global $argv;
$arguments = $argv;
// Lose the script name
array_shift($arguments);
return $arguments;
}
}

View File

@ -0,0 +1,55 @@
<?php
namespace IO\Input;
use Filter\DocumentFilter;
use Filter\FilterFactory;
use IO\Exception\DirectoryNotFoundException;
use IO\Exception\NotADirectoryException;
class FinderArguments
{
use ArgvAccess;
private ?string $directory;
private array $filters;
public function __construct(?string $directory, array $filters)
{
$this->directory = $directory;
$this->filters = $filters;
$factory = new FilterFactory();
$this->filters = array_map([$factory, 'createFromString'], $this->filters);
}
public static function createFromGlobals(): self
{
$arguments = self::getArguments();
$dir = array_shift($arguments) ?? getcwd();
$dir = rtrim($dir, DIRECTORY_SEPARATOR);
return new self($dir, $arguments);
}
public function getDirectory(): string
{
if (!file_exists($this->directory)) {
throw new DirectoryNotFoundException($this->directory);
}
if (!is_dir($this->directory)) {
throw new NotADirectoryException($this->directory);
}
return $this->directory;
}
/**
* @return DocumentFilter[]
*/
public function getFilters(): array
{
return $this->filters;
}
}

View File

@ -0,0 +1,41 @@
<?php
namespace IO\Input;
use IO\Exception\FileNotFoundException;
use IO\Exception\FileNotReadableException;
use IO\Exception\MissingFileArgumentException;
use SplFileInfo;
class ShowInfoArguments
{
use ArgvAccess;
private ?string $file;
public function __construct(?string $file)
{
$this->file = $file;
}
public static function createFromGlobals(): self
{
$arguments = self::getArguments();
return new self(array_shift($arguments));
}
public function getFile(): SplFileInfo
{
if (is_null($this->file)) {
throw new MissingFileArgumentException();
}
if (!file_exists($this->file)) {
throw new FileNotFoundException($this->file);
}
if (!is_readable($this->file)) {
throw new FileNotReadableException($this->file);
}
return new SplFileInfo($this->file);
}
}

View File

@ -0,0 +1,67 @@
<?php
namespace IO\Output;
use PDF\Document;
use Symfony\Component\Console\Output\OutputInterface;
class DocumentListingOutput implements Output
{
/** @var Document[] */
private iterable $documents;
public function __construct(iterable $documents)
{
$this->documents = $documents;
}
public static function forDocuments(iterable $documents): self
{
return new self($documents);
}
public function render(?OutputInterface $output = null): void
{
if (count($this->documents) === 0) {
print('Your search yielded no results.' . PHP_EOL);
return;
}
$template = new TableTemplate([
'Filename' => [
'min_width' => 40,
'max_width' => 80,
],
'Title' => [
'min_width' => 40,
'max_width' => 80,
'null_value' => '-',
],
'Author' => [
'min_width' => 16,
'max_width' => 32,
'null_value' => '-',
],
'Path' => [
'min_width' => 16,
'max_width' => 32,
'formatter' => static function (string $path) {
$search = sprintf('/home/%s', get_current_user());
return str_replace($search, '~', $path);
},
],
]);
foreach ($this->documents as $document) {
$template->addRow([
$document->file->getBasename(),
$document->metadata->title,
$document->metadata->author,
$document->file->getPath(),
]);
}
$template->generate($output)->render();
}
}

View File

@ -0,0 +1,42 @@
<?php
namespace IO\Output;
use PDF\Document;
use Symfony\Component\Console\Output\OutputInterface;
class DocumentOutput implements Output
{
private Document $document;
public function __construct(Document $document)
{
$this->document = $document;
}
public static function forDocument(Document $document): self
{
return new self($document);
}
public function render(?OutputInterface $output = null): void
{
$template = new TableTemplate([
'Property' => [
'min_width' => 20,
'max_width' => 20,
],
'Value' => [
'min_width' => 80,
'max_width' => 80,
'null_value' => '-',
],
]);
foreach ($this->document->getProperties() as $property => $value) {
$template->addRow([$property, $value]);
}
$template->generate($output)->render();
}
}

8
src/IO/Output/Output.php Normal file
View File

@ -0,0 +1,8 @@
<?php
namespace IO\Output;
interface Output
{
public function render(): void;
}

View File

@ -0,0 +1,74 @@
<?php
namespace IO\Output;
use Symfony\Component\Console\Helper\Table;
use Symfony\Component\Console\Output\ConsoleOutput;
use Symfony\Component\Console\Output\OutputInterface;
class TableTemplate
{
private array $headers;
private array $properties;
private array $rows = [];
public function __construct(array $properties)
{
$this->headers = array_keys($properties);
$this->properties = array_values($properties);
}
public function addRow(array $row): void
{
$row = array_values($row);
foreach ($row as $columnIndex => &$value) {
if (isset($this->properties[$columnIndex]['null_value'])) {
$value ??= $this->properties[$columnIndex]['null_value'];
}
if (isset($this->properties[$columnIndex]['formatter'])) {
$value = call_user_func($this->properties[$columnIndex]['formatter'], $value);
}
if (isset($this->properties[$columnIndex]['max_width'])) {
$value = $this->trim($value, $this->properties[$columnIndex]['max_width']);
}
}
unset($value);
$this->rows[] = $row;
}
public function generate(?OutputInterface $output = null): Table
{
$table = new Table($output ?? new ConsoleOutput());
$table->setStyle('box-double');
$table->setHeaders($this->headers);
foreach ($this->properties as $columnIndex => $columnProperties) {
if (isset($columnProperties['min_width'])) {
$table->setColumnWidth($columnIndex, $columnProperties['min_width']);
}
if (isset($columnProperties['max_width'])) {
$table->setColumnMaxWidth($columnIndex, $columnProperties['max_width']);
}
}
$table->setRows($this->rows);
return $table;
}
/**
* Trims a string if it's longer than $length and adds '...' to the end if trimmed.
* @param string $string
* @param int $length
* @return string
*/
private function trim(string $string, int $length): string
{
if (strlen($string) <= $length) {
return $string;
}
return '' . substr($string, 0, $length - 3) . '...';
}
}

25
src/IO/Shell/Pdfinfo.php Normal file
View File

@ -0,0 +1,25 @@
<?php
namespace IO\Shell;
use PDF\Metadata;
class Pdfinfo
{
use ShellCommandExecutor;
public function getMetadata(string $filepath): Metadata
{
$lines = $this->shellExec('pdfinfo', '-isodates', $filepath);
$data = [];
foreach ($lines as $line) {
$parts = explode(':', $line, 2);
if (count($parts) === 2) {
$data[trim($parts[0])] = trim($parts[1]);
}
}
return (new Metadata)->fillWith($data);
}
}

View File

@ -0,0 +1,19 @@
<?php
namespace IO\Shell;
trait ShellCommandExecutor
{
protected function shellExec(string $command, string ...$args): array
{
$args = array_map('escapeshellarg', $args);
$output = shell_exec(sprintf(
'%s %s 2>/dev/null',
escapeshellcmd($command),
implode(' ', $args)
));
return explode(PHP_EOL, $output);
}
}

43
src/PDF/Document.php Normal file
View File

@ -0,0 +1,43 @@
<?php
namespace PDF;
use RuntimeException;
use SplFileInfo;
class Document
{
public SplFileInfo $file;
public Metadata $metadata;
public function __construct(SplFileInfo $file, ?Metadata $metadata = null)
{
$this->file = $file;
$this->metadata = $metadata ?? new Metadata();
}
public function getProperty(string $prop): ?string
{
if (in_array($prop, ['path', 'filepath'])) {
return $this->file->getPath();
}
if (in_array($prop, ['file', 'name', 'filename'])) {
return $this->file->getBasename();
}
if (property_exists($this->metadata, $prop)) {
return $this->metadata->{$prop};
}
throw new RuntimeException('No such property');
}
public function getProperties(): array
{
return [
'filepath' => $this->file->getPath(),
'filename' => $this->file->getBasename(),
] + $this->metadata->toArray();
}
}

53
src/PDF/Metadata.php Normal file
View File

@ -0,0 +1,53 @@
<?php
namespace PDF;
use Cocur\Slugify\Slugify;
class Metadata
{
public ?string $abbreviation = null;
public ?string $author = null;
public ?string $creationdate = null;
public ?string $creator = null;
public ?string $encrypted = null;
public ?string $form = null;
public ?string $javascript = null;
public ?string $keywords = null;
public ?string $linearized = null;
public ?string $moddate = null;
public ?string $optimized = null;
public ?string $page_rot = null;
public ?string $page_size = null;
public ?string $pages = null;
public ?string $pdf_subtype = null;
public ?string $pdf_version = null;
public ?string $producer = null;
public ?string $standard = null;
public ?string $subject = null;
public ?string $subtitle = null;
public ?string $suspects = null;
public ?string $tagged = null;
public ?string $title = null;
public ?string $userproperties = null;
public function fillWith(array $array): Metadata
{
$slugify = new Slugify(['separator' => '_']);
$array = array_filter($array, static fn(string $v) => trim($v) !== '');
foreach ($array as $key => $value) {
$key = $slugify->slugify($key);
if (property_exists(__CLASS__, $key)) {
$this->{$key} = trim($value);
}
}
return $this;
}
public function toArray(): array
{
return get_object_vars($this);
}
}

View File

@ -0,0 +1,30 @@
<?php
use Illuminate\Support\Collection;
use PDF\Document;
class RecursiveDocumentLocator
{
private DocumentFactory $documentFactory;
public function __construct(?DocumentFactory $documentFactory = null)
{
$this->documentFactory = $documentFactory ?? new DocumentFactory();
}
/**
* @return Collection<Document>|Document[]
*/
public function findDocuments(string $directory): Collection
{
$iterator = new RecursiveIteratorIterator(
new RecursiveDirectoryIterator($directory),
RecursiveIteratorIterator::SELF_FIRST
);
return collect($iterator)
->filter(static fn(SplFileInfo $fileInfo) => $fileInfo->isFile())
->filter(static fn(SplFileInfo $fileInfo) => preg_match('/.pdf$/i', $fileInfo->getBasename()))
->map(fn(SplFileInfo $fileInfo) => $this->documentFactory->createDocument($fileInfo));
}
}