# pdfgrab

## What is it?

This is a reborn tool, used during the epoche dinosaurs were traipsing the earth. 
Basically it analyses PDF files for Metadata. You can direct it to a file or directory with pdfs. 
You can show it the url of a pdf or use the integrated googlesearch (thanx to mario vilas) class
to search for pdfs at target site, download and analyse them

## What information can be gathered?

This depends on the software used to create the pdf. And if it has been cleaned. 
However, common are the following things:

* Producer
* Creator
* CreationDate
* ModificationDate
* Author
* Title
* Subject

and some more :)

## How does it work?

Every more complex filetype above .txt or alike uses metadata for convinience, customer support or only to spread it has been used.
There is a lot information about metadata in different sort of files like pictures, documents, videos, music online. This tool
focuses on pdf only. 
If you are new to that term have a look here:
https://en.wikipedia.org/wiki/Metadata

## Download

```
git clone https://github.com/c0decave/pdfgrab
cd pdfgrab
python3 pdfgrab.py -h
```

## Usage

Those are your options major options:
* grab pdf from url and analyse
* search site for pdfs via google, grab and analyse
* analyse a local pdf
* analyse a local directory with pdfs in it

### Single Url Mode

```
# ./pdfgrab.py -u https://www.kernel.org/doc/mirror/ols2004v2.pdf
```
Result:
```
[+] Grabbing https://www.kernel.org/doc/mirror/ols2004v2.pdf
[+] Written 3893173 bytes for File: pdfgrab/ols2004v2.pdf
[+] Opening pdfgrab/ols2004v2.pdf
--------------------------------------------------------------------------------
File: pdfgrab/ols2004v2.pdf
/Producer pdfTeX-0.14h
/Creator TeX
/CreationDate D:20040714015300
--------------------------------------------------------------------------------
```
### Single File Mode

```
# ./pdfgrab.py -f pdfgrab/ols2004v2.pdf 
```
Result:
```
[+] Parsing pdfgrab/ols2004v2.pdf
[+] Opening pdfgrab/ols2004v2.pdf
--------------------------------------------------------------------------------
File: pdfgrab/ols2004v2.pdf
/Producer pdfTeX-0.14h
/Creator TeX
/CreationDate D:20040714015300
--------------------------------------------------------------------------------
```

### Directory Mode

```
./pdfgrab.py -F pdfgrab/
```
Will analyse all pdf's in that directory


### Google Search Mode
```
# ./pdfgrab.py -s site:kernel.org
```
Result:
```
[+] Seek and analysing site:kernel.org
http://vger.kernel.org/lpc_bpf2018_talks/bpf_global_data_and_static_keys.pdf
http://vger.kernel.org/netconf2018_files/JiriPirko_netconf2018.pdf
http://vger.kernel.org/netconf2018_files/PaoloAbeni_netconf2018.pdf
http://vger.kernel.org/lpc_net2018_talks/LPC_XDP_Shirokov_paper_v1.pdf
http://vger.kernel.org/netconf2018_files/FlorianFainelli_netconf2018.pdf
http://vger.kernel.org/lpc_net2018_talks/tc_sw_paper.pdf
https://www.kernel.org/doc/mirror/ols2009.pdf
https://www.kernel.org/doc/mirror/ols2004v2.pdf
http://vger.kernel.org/lpc_net2018_talks/ktls_bpf.pdf
http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf

[+] Written 211391 bytes for File: pdfgrab/bpf_global_data_and_static_keys.pdf
[+] Opening pdfgrab/bpf_global_data_and_static_keys.pdf
--------------------------------------------------------------------------------
File: pdfgrab/bpf_global_data_and_static_keys.pdf
/Author 
/Title 
/Subject 
/Creator LaTeX with Beamer class version 3.36
/Producer pdfTeX-1.40.17
/Keywords 
/CreationDate D:20181102231821+01'00'
/ModDate D:20181102231821+01'00'
/Trapped /False
/PTEX.Fullbanner This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
```

## TODO
* json output
* txt output
* catch conn refused connections
* set option for certificate verification, default is false
* complete analyse.txt and seperated
* clean up code
* do more testing
* add random useragent for google and website pdf gathering
* ~~add decryption routine~~
* ~~catch ssl exceptions~~


## Google

* Search: filetype:pdf site:com
* Results: 264.000.000

## Disclaimer

Have fun!