Compare commits

...

10 Commits

Author SHA1 Message Date
c0decave
132750867f release version 0.4.9 2019-11-07 12:51:30 +01:00
c0decave
a89ac93c3d bugfix pre-release 0.4.8 2019-11-06 12:49:54 +01:00
c0decave
4f63e62690 updated Readme.md 2019-11-05 14:50:02 +01:00
c0decave
e1d7c3f760 release of version 0.4.7 added html reporting, added logging, reordered libraries, added experimental xmp meta data, fixed bug introduced due xmp meta data, added todo list 2019-11-05 14:42:24 +01:00
c0decave
fa3b925d6f yak yak yak 2019-10-02 18:39:23 +02:00
dash
f7339cda5f yak yak yak 2019-10-02 18:38:31 +02:00
dash
1c6d03ef0e spelling error 2019-10-02 18:19:05 +02:00
dash
5f9cdf86d1 several changes,new features,bugfixes 2019-10-02 18:17:41 +02:00
dash
64f48eef9a cert-check, random-user-agent, catching too many requests 2019-09-26 18:45:02 +02:00
dash
fb2dfb1527 updated readme 2019-09-26 17:50:10 +02:00
13 changed files with 987 additions and 221 deletions

View File

@@ -1,11 +1,26 @@
# pdfgrab # pdfgrab
* Version 0.4.9
## What is it? ## What is it?
This is a reborn tool, used during the epoche dinosaurs were traipsing the earth. This is a reborn tool, used during the epoche dinosaurs were traipsing the earth.
Basically it analyses PDF files for Metadata. You can direct it to a file or directory with pdfs. Basically it analyses PDF files for Metadata. You can direct it to a file or directory with pdfs.
You can show it the url of a pdf or use the integrated googlesearch (thanx to mario vilas) class You can show it the url of a pdf or use the integrated googlesearch (thanx to mario vilas class)
to search for pdfs at target site, download and analyse them to search for pdfs at target site, download and analyse them.
## What is new in 0.4.9?
* exported reporting methods to libreport.py
* added optargs for disabling different report methods
* made the html report a bit more shiny
* added function for generating html report after analysis
* exported requests and storing data to new library
* code fixes and more clear error handling
* removed necessary site: parameter at search flag -s
* updated readme
* -s flag now acceppts several domains
* console logging more clean
## What information can be gathered? ## What information can be gathered?
@@ -22,18 +37,49 @@ However, common are the following things:
and some more :) and some more :)
## What is this for anyways?
Well, this can be used for a range of things. However, i will only focus on the
security part of it. Depending on your target you will get information about:
* used software in company xyz
* possible version numbers
* this will help you to identify existing vulnerabilities
* sometimes pdfs are rendered new, for instance on uploads
* now you can figure what the rendering engine is and find bugs in it
* who is the author of documents
* sometimes usernames are users of the OS itself
* congrats you just found by analysing a pdf an existing username in the domain
* combine the information with the first part, you know which user uses which software
* passwords ... do i need to say more?
## Is it failproof?
Not at all. Please note that metadata as every other data is just written to that file. So it can be changed before it is uploaded. Said that, the amount of companies really changing that sort of data is maybe at 20%. Also you will recognize if it is empty or alike.
## How does it work? ## How does it work?
Every more complex filetype above .txt or alike uses metadata for convinience, customer support or only to spread it has been used. Every more complex filetype above .txt or alike uses metadata for convinience, customer support or only to spread it has been used.
There is a lot information about metadata in different sort of files like pictures, documents, videos, music online. This tool There is a lot information about metadata in different sort of files like pictures, documents, videos, music online. This tool
focuses on pdf only. focuses on pdf only.
If you are new to that term have a look here: If you are new to that term have a look here:
https://en.wikipedia.org/wiki/Metadata * https://en.wikipedia.org/wiki/Metadata
Also, if you are interested in a real pdf analysis, this tool will only do the basics for you. It has not been written to analyse bad, malware or even interesting files. It's purpose is to give you an idea what is used at target xyz.
If you are looking for more in-depth analysis i recommend the tools of Didier Stevens:
* https://blog.didierstevens.com/programs/pdf-tools/
## Download
```
git clone https://github.com/c0decave/pdfgrab
cd pdfgrab
python3 pdfgrab.py -h
```
## Usage ## Usage
Those are your options major options: Those are your major options:
* grab pdf from url and analyse * grab pdf from url and analyse
* search site for pdfs via google, grab and analyse * search site for pdfs via google, grab and analyse
* analyse a local pdf * analyse a local pdf
@@ -73,9 +119,17 @@ File: pdfgrab/ols2004v2.pdf
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
``` ```
### Directory Mode
```
./pdfgrab.py -F pdfgrab/
```
Will analyse all pdf's in that directory
### Google Search Mode ### Google Search Mode
``` ```
# ./pdfgrab.py -s site:kernel.org # ./pdfgrab.py -s kernel.org
``` ```
Result: Result:
``` ```
@@ -107,10 +161,30 @@ File: pdfgrab/bpf_global_data_and_static_keys.pdf
/PTEX.Fullbanner This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2 /PTEX.Fullbanner This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
``` ```
### Google Search Mode, several domains
```
# ./pdfgrab.py -s example.com,example.us
```
### Reporting
pdfgrab outputs the information in different formats. If not disabled by one of the reporting flags (see -h) you will
find in the output directory:
* html report
* text report
* text url list
* json data
* json url list
### Logging
pdfgrab creates a logfile in the running directory called "pdfgrab.log"
## Google ## Google
Search: filetype:pdf site:com * Search: filetype:pdf site:com
Results: 264.000.000 * Results: 264.000.000
## Disclaimer ## Disclaimer

43
docs/Changelog Normal file
View File

@@ -0,0 +1,43 @@
Changelog
=========
Version 4.9
-----------
* exported reporting methods to libreport.py
* added optargs for disabling different report methods
* made the html report a bit more shiny
* added function for generating html report after analysis
* exported requests and storing data to new library
* code fixes and more clear error handling
* removed necessary site: parameter at search flag -s
* updated readme
* -s flag now acceppts several domains
* console logging more clean
Version 4.8 Bugfix-PreRelease
-----------------------------
* catching google to many requests
* catching urlopen dns not resolveable error
* fixing nasty bug in store_pdf/find_name
* fixing zero size pdf error
* extra logging
Version 4.7
-----------
* added html out
* added xmp meta testing
Version 4.6
-----------
* added help for non-argument given at cli
* added googlesearch lib
Version 4.5
-----------
* exported helper functions to libs/helper.py
* added libs/liblog.py

4
docs/Todo Normal file
View File

@@ -0,0 +1,4 @@
* add xmp meta to output files
* code reordering
* clean up parsing functions
* add report formats

0
libs/__init__.py Normal file
View File

58
libs/libgoogle.py Normal file
View File

@@ -0,0 +1,58 @@
import googlesearch as gs
import urllib
from libs.libhelper import *
def get_random_agent():
return (gs.get_random_user_agent())
def hits_google(search, args):
''' the function where googlesearch from mario vilas
is called
'''
s = search.split(',')
query = 'filetype:pdf'
try:
hits = gs.hits(query, domains=s,user_agent=gs.get_random_user_agent())
except urllib.error.HTTPError as e:
return False,e
except urllib.error.URLError as e:
return False,e
except IndexError as e:
return False,e
return True,hits
def search_google(search, args):
''' the function where googlesearch from mario vilas
is called
'''
s = search.split(',')
search_stop = args.search_stop
query = 'filetype:pdf'
#query = 'site:%s filetype:pdf' % search
# print(query)
urls = []
try:
for url in gs.search(query, num=20, domains=s,stop=search_stop, user_agent=gs.get_random_user_agent()):
#print(url)
urls.append(url)
except urllib.error.HTTPError as e:
#print('Error: %s' % e)
return False,e
except urllib.error.URLError as e:
return False,e
return True,urls

55
libs/libhelper.py Normal file
View File

@@ -0,0 +1,55 @@
import os
import sys
from Crypto.Hash import SHA256
def check_file_size(filename):
''' simply check if byte size is bigger than 0 bytes
'''
fstat = os.stat(filename)
if fstat.st_size == 0:
return False
return True
def make_directory(outdir):
''' naive mkdir function '''
try:
os.mkdir(outdir)
except:
# print("[W] mkdir, some error, directory probably exists")
pass
def url_strip(url):
url = url.rstrip("\n")
url = url.rstrip("\r")
return url
def create_sha256(hdata):
''' introduced to create hashes of filenames, to have a uniqid
of course hashes of the file itself will be the next topic
'''
hobject = SHA256.new(data=hdata.encode())
return (hobject.hexdigest())
def find_name(pdf):
''' simply parses the urlencoded name and extracts the storage name
i would not be surprised this naive approach can lead to fuckups
'''
name = ''
# find the name of the file
#
name_list = pdf.split("/")
len_list = len(name)
# ugly magic ;-)
# what happens is, that files can also be behind urls like:
# http://host/pdf/
# so splitting up the url and always going with the last item after slash
# can result in that case in an empty name, so we go another field in the list back
# and use this as the name
if name_list[len_list - 1] == '':
name = name_list[len_list - 2]
else:
name = name_list[len_list - 1]
return name

19
libs/liblog.py Normal file
View File

@@ -0,0 +1,19 @@
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
file_handler = logging.FileHandler('pdfgrab.log')
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.WARNING)
file_formatter = logging.Formatter('%(asctime)s:%(name)s:%(levelname)s:%(message)s')
console_formatter = logging.Formatter('%(levelname)s:%(message)s')
file_handler.setFormatter(file_formatter)
console_handler.setFormatter(console_formatter)
logger.addHandler(file_handler)
logger.addHandler(console_handler)

173
libs/libreport.py Normal file
View File

@@ -0,0 +1,173 @@
import os
import sys
import json
from json2html import *
from libs.pdf_png import get_png_base64
def prepare_analysis_dict(ana_queue):
'''params: ana_queue - queue with collected information
'''
# initiate analysis dictionary
analysis_dict = {}
# move analysis dictionary in queue back to dictionary
while ana_queue.empty() == False:
item = ana_queue.get()
# print('item ', item)
analysis_dict.update(item)
# ana_q is empty now return the newly created dictionary
return analysis_dict
def create_txt_report(analysis_dict, outdir, out_filename):
''' create a txt report in the output directory
'''
# draw seperator lines
sep = '-' * 80 + '\n'
# create output filepath
txtout = "%s/%s.txt" % (outdir, out_filename)
# open the file and return filedescriptor
fwtxt = open(txtout, 'w')
# get the keys of the dict
for k in analysis_dict.keys():
# write seperator
fwtxt.write(sep)
# build entry filename of the pdf
fname = 'File: %s\n' % (analysis_dict[k]['filename'])
# build data entry
ddata = analysis_dict[k]['data']
# write the filename
fwtxt.write(fname)
# write the metadata
for kdata in ddata.keys():
metatxt = '%s:%s\n' % (kdata, ddata[kdata])
fwtxt.write(metatxt)
# write seperator
fwtxt.write(sep)
# close the file
fwtxt.close()
return True
def create_json_report(analysis_dict, outdir, out_filename):
''' create a jsonfile report in the output directory
'''
# build json output name
jsonout = "%s/%s.json" % (outdir, out_filename)
# open up json output file
fwjson = open(jsonout, 'w')
# convert dictionary to json data
jdata = json.dumps(analysis_dict)
# write json data to file
fwjson.write(jdata)
# close file
fwjson.close()
return True
def create_html_report(analysis_dict, outdir, out_filename):
''' create a html report from json file using json2html in the output directory
'''
# build up path for html output file
htmlout = "%s/%s.html" % (outdir, out_filename)
# open htmlout filedescriptor
fwhtml = open(htmlout,'w')
# some html stuff
pdfpng=get_png_base64('supply/pdf_base64.png')
html_style ='<style>.center { display: block; margin-left: auto;margin-right: auto;} table {border-collapse: collapse;} th, td { border: 1px solid black;text-align: left; }</style>\n'
html_head = '<html><head><title>pdfgrab - {0} item/s</title>{1}</head>\n'.format(len(analysis_dict),html_style)
html_pdf_png = '<p class="center"><img class="center" src="data:image/jpeg;base64,{0}"><br><center>pdfgrab - grab and analyse pdf files</center><br></p>'.format(pdfpng)
html_body = '<body>{0}\n'.format(html_pdf_png)
html_end = '\n<br><br><p align="center"><a href="https://github.com/c0decave/pdfgrab">pdfgrab</a> by <a href="https://twitter.com/User_to_Root">dash</a></p></body></html>\n'
# some attributes
attr = 'id="meta-data" class="table table-bordered table-hover", border=1, cellpadding=3 summary="Metadata"'
# convert dictionary to json data
# in this mode each finding gets its own table there are other possibilities
# but now i go with this
html_out = ''
for k in analysis_dict.keys():
trans = analysis_dict[k]
jdata = json.dumps(trans)
html = json2html.convert(json = jdata, table_attributes=attr)
html_out = html_out + html + "\n"
#html_out = html_out + "<p>" + html + "</p>\n"
#jdata = json.dumps(analysis_dict)
# create html
#html = json2html.convert(json = jdata, table_attributes=attr)
# write html
fwhtml.write(html_head)
fwhtml.write(html_body)
fwhtml.write(html_out)
fwhtml.write(html_end)
# close html file
fwhtml.close()
def create_url_json(url_d, outdir, out_filename):
''' create a json url file in output directory
'''
# create url savefile
jsonurlout = "%s/%s_url.json" % (outdir, out_filename)
# open up file for writting urls down
fwjson = open(jsonurlout, 'w')
# convert url dictionary to json
jdata = json.dumps(url_d)
# write json data to file
fwjson.write(jdata)
# close filedescriptor
fwjson.close()
return True
def create_url_txt(url_d, outdir, out_filename):
''' create a txt url file in output directory
'''
# build up txt out path
txtout = "%s/%s_url.txt" % (outdir, out_filename)
# open up our url txtfile
fwtxt = open(txtout, 'w')
# iterating through the keys of the url dictionary
for k in url_d.keys():
# get the entry
ddata = url_d[k]
# create meta data for saving
metatxt = '%s:%s\n' % (ddata['url'], ddata['filename'])
# write metadata to file
fwtxt.write(metatxt)
# close fd
fwtxt.close()
return True

162
libs/librequest.py Normal file
View File

@@ -0,0 +1,162 @@
import os
import sys
import json
import socket
import requests
from libs.liblog import logger
from libs.libhelper import *
from libs.libgoogle import get_random_agent
def store_file(url, data, outdir):
''' storing the downloaded data to a file
params: url - is used to create the filename
data - the data of the file
outdir - to store in which directory
returns: dict { "code":<code>, "data":<savepath>,"error":<error>} - the status code, the savepath, the errorcode
'''
logger.info('Store file {0}'.format(url))
name = find_name(url)
# only allow stored file a name with 50 chars
if len(name) > 50:
name = name[:49]
# build up the save path
save = "%s/%s" % (outdir, name)
try:
f = open(save, "wb")
except OSError as e:
logger.warning('store_file {0}'.format(e))
# return ret_dict
return {"code":False,"data":save,"error":e}
# write the data and return the written bytes
ret = f.write(data)
# check if bytes are zero
if ret == 0:
logger.warning('Written {0} bytes for file: {1}'.format(ret,save))
else:
# log to info that bytes and file has been written
logger.info('Written {0} bytes for file: {1}'.format(ret,save))
# close file descriptor
f.close()
# return ret_dict
return {"code":True,"data":save,"error":False}
def download_file(url, args, header_data):
''' downloading the file for later analysis
params: url - the url
args - argparse args namespace
header_data - pre-defined header data
returns: ret_dict
'''
# check the remote tls certificate or not?
cert_check = args.cert_check
# run our try catch routine
try:
# request the url and save the response in req
# give header data and set verify as delivered by args.cert_check
req = requests.get(url, headers=header_data, verify=cert_check)
except requests.exceptions.SSLError as e:
logger.warning('download file {0}{1}'.format(url,e))
# return retdict
return {"code":False,"data":req,"error":e}
except requests.exceptions.InvalidSchema as e:
logger.warning('download file {0}{1}'.format(url,e))
# return retdict
return {"code":False,"data":False,"error":e}
except socket.gaierror as e:
logger.warning('download file, host not known {0} {1}'.format(url,e))
return {"code":False,"data":False,"error":e}
except:
logger.warning('download file, something wrong with remote server? {0}'.format(url))
# return retdict
if not req in locals():
req = False
return {"code":False,"data":req,"error":True}
#finally:
# lets close the socket
#req.close()
# return retdict
return {"code":True,"data":req,"error":False}
def grab_run(url, args, outdir):
''' function keeping all the steps for the user call of grabbing
just one and analysing it
'''
header_data = {'User-Agent': get_random_agent()}
rd_download = download_file(url, args, header_data)
code_down = rd_download['code']
# is code True download of file was successfull
if code_down:
rd_evaluate = evaluate_response(rd_download)
code_eval = rd_evaluate['code']
# if code is True, evaluation was also successful
if code_eval:
# get the content from the evaluate dictionary request
content = rd_evaluate['data'].content
# call store file
rd_store = store_file(url, content, outdir)
# get the code
code_store = rd_store['code']
# get the savepath
savepath = rd_store['data']
# if code is True, storing of file was also successfull
if code_store:
return {"code":True,"data":savepath,"error":False}
return {"code":False,"data":False,"error":True}
def evalute_content(ret_dict):
pass
def evaluate_response(ret_dict):
''' this method comes usually after download_file,
it will evaluate what has happened and if we even have some data to process
or not
params: data - is the req object from the conducted request
return: {}
returns: dict { "code":<code>, "data":<savepath>,"error":<error>} - the status code, the savepath, the errorcode
'''
# extract data from ret_dict
req = ret_dict['data']
# get status code
url = req.url
status = req.status_code
reason = req.reason
# ahh everything is fine
if status == 200:
logger.info('download file, {0} {1} {2}'.format(url,reason,status))
return {"code":True,"data":req,"error":False}
# nah something is not like it should be
else:
logger.warning('download file, {0} {1} {2}'.format(url,reason,status))
return {"code":False,"data":req,"error":True}

5
libs/pdf_png.py Normal file
View File

@@ -0,0 +1,5 @@
def get_png_base64(filename):
fr = open(filename,'r')
buf = fr.read()
return buf

View File

@@ -1,264 +1,436 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
##################### #####################
# yay - old tool adjusted for python3, using googlesearch now
# and not some self crafted f00
#
# new features, new layout, new new :> # new features, new layout, new new :>
# dash in end of September 2019 # by dash
#
#
# TODO
# * json output
# * txt output
# * catch conn refused connections
# * set option for certificate verification, default is false
# * complete analyse.txt and seperated
# * clean up code
# * do more testing
# * add random useragent for google and website pdf gathering
#
# Done
# * add decryption routine
# * catch ssl exceptions
import os import xml
import sys
import argparse import argparse
import json
import os
import queue
import urllib
from json2html import *
import PyPDF2
# googlesearch library
import googlesearch as gs
import requests import requests
from PyPDF2 import pdf
# functions to extern files
from libs.liblog import logger
from libs.libhelper import *
from libs.libgoogle import *
from libs.libreport import *
from libs.librequest import grab_run
from IPython import embed from IPython import embed
from PyPDF2 import pdf # some variables in regard of the tool itself
import googlesearch as gs name = 'pdfgrab'
version = '0.4.9'
author = 'dash'
date = 'November 2019'
_name_ = 'pdfgrab' # queues for processing
_version_ = '0.3' # this queue holds the URL locations of files to download
_author_ = 'dash' url_q = queue.Queue()
_date_ = '2019' url_d = {}
def url_strip(url): # this queue holds the paths of files to analyse
url = url.rstrip("\n") pdf_q = queue.Queue()
url = url.rstrip("\r")
return url
# this is the analysis queue, keeping the data for further processing
ana_q = queue.Queue()
def add_queue(tqueue, data):
''' wrapper function for adding easy data to
created queues. otherwise the functions will be scattered with
endless queue commands ;)
'''
tqueue.put(data)
# d=tqueue.get()
#logging.debug(d)
return True
def process_queue_data(filename, data, queue_type):
''' main function for processing gathered data
i use this central function for it, so it is at *one* place
and it is easy to change the data handling at a later step without
deconstructing the who code
'''
ana_dict = {}
url_dict = {}
if queue_type == 'doc_info':
logger.info('Queue DocInfo Data {0}'.format(filename))
name = find_name(filename)
path = filename
# create a hash over the file path
# hm, removed for now
# path_hash = create_sha256(path)
# order data in dict for analyse queue
ana_dict = {path: {'filename': name, 'data': data}}
#print('data:',data)
#print('ana_dcit:',ana_dict)
# add the data to queue
add_queue(ana_q, ana_dict)
elif queue_type == 'doc_xmp_info':
logger.info('Queue DocXMPInfo Data {0}'.format(filename))
logger.warning('DocXMPInfo json processing not supported {0}'.format(filename))
elif queue_type == 'url':
# prepare queue entry
logger.info('Url Queue {0}'.format(data))
url_dict = {'url': data, 'filename': filename}
sha256 = create_sha256(data)
url_d[sha256] = url_dict
# add dict to queue
add_queue(url_q, url_dict)
else:
print('[-] Sorry, unknown queue. DEBUG!')
logger.critical('Unknown queue')
return False
return True
def get_xmp_meta_data(filename, filehandle):
''' get the xmp meta data
'''
err_dict = {}
real_extract = {}
xmp_dict = {}
fh = filehandle
try:
xmp_meta = fh.getXmpMetadata()
except xml.parsers.expat.ExpatError as e:
logger.warning('get_xmp_meta_data error {0}'.format(e))
err_dict = {'error': str(e)}
return -1
finally:
process_queue_data(filename, err_dict, 'doc_xmp_info')
if xmp_meta != None:
try:
print('xmp_meta: {0} {1} {2} {3} {4} {5}'.format(xmp_meta.pdf_producer,xmp_meta.pdf_pdfversion,xmp_meta.dc_contributor,xmp_meta.dc_creator,xmp_meta.dc_date,xmp_meta.dc_subject))
#print('xmp_meta cache: {0}'.format(xmp_meta.cache))
#print('xmp_meta custom properties: {0}'.format(xmp_meta.custom_properties))
#embed()
except AttributeError as e:
logger.warning('xmp_meta print {0}'.format(e))
return False
return xmp_dict
def get_DocInfo(filename, filehandle): def get_DocInfo(filename, filehandle):
''' the easy way to extract metadata
indirectObjects...
there is an interesting situation, some pdfs seem to have the same information stored
in different places, or things are overwritten or whatever
this sometimes results in an extract output with indirect objects ... this is ugly
fh = filehandle {'/Title': IndirectObject(111, 0), '/Producer': IndirectObject(112, 0), '/Creator': IndirectObject(113, 0), '/CreationDate': IndirectObject(114, 0), '/ModDate': IndirectObject(114, 0), '/Keywords': IndirectObject(115, 0), '/AAPL:Keywords': IndirectObject(116, 0)}
try:
extract = fh.documentInfo
except pdf.utils.PdfReadError as e:
print('Error: %s' % e)
return -1
print('-'*80) normally getObject() is the method to use, to fix this, however this was not working in the particular case.
print('File: %s' % filename) this thing might even bring up some more nasty things, as a (probably weak) defense and workaround
for k in extract.keys(): the pdfobject is not used anymore after this function, data is converted to strings...
edata = '%s %s' % (k,extract[k]) bad example:
print(edata) '''
print
print('-'*80) err_dict = {}
real_extract = {}
fh = filehandle
try:
extract = fh.documentInfo
except pdf.utils.PdfReadError as e:
logger.warning('get_doc_info {0}'.format(e))
err_dict = {'error': str(e)}
return -1
except PyPDF2.utils.PdfReadError as e:
logger.warning('get_doc_info {0}'.format(e))
err_dict = {'error': str(e)}
return -1
finally:
process_queue_data(filename, err_dict, 'doc_info')
print('-' * 80)
print('File: %s' % filename)
# embed()
# there are situations when documentinfo does not return anything
# and extract is None
if extract == None:
err_dict = {'error': 'getDocumentInfo() returns None'}
process_queue_data(filename, err_dict, 'doc_info')
return -1
try:
for k in extract.keys():
key = str(k)
value = str(extract[k])
edata = '%s %s' % (key, value)
print(edata)
print
real_extract[key] = value
print('-' * 80)
except PyPDF2.utils.PdfReadError as e:
logger.warning('get_doc_info {0}'.format(e))
err_dict = {'error': str(e)}
process_queue_data(filename, err_dict, 'doc_info')
return -1
process_queue_data(filename, real_extract, 'doc_info')
def decrypt_empty_pdf(filename): def decrypt_empty_pdf(filename):
''' this function simply tries to decrypt the pdf with the null password
this does work, as long as no real password has been set
if a complex password has been set -> john
'''
fr = pdf.PdfFileReader(open(filename, "rb"))
try:
fr.decrypt('')
except NotImplementedError as e:
logger.warning('decrypt_empty_pdf {0}{1}'.format(filename,e))
return -1
return fr
fr = pdf.PdfFileReader(open(filename,"rb"))
try:
fr.decrypt('')
except NotImplementedError as e:
print('Error: %s' % (e))
print('Only algorithm code 1 and 2 are supported')
return -1
return fr
def check_encryption(filename): def check_encryption(filename):
''' basic function to check if file is encrypted ''' basic function to check if file is encrypted
''' '''
print(filename) print(filename)
try: try:
fr = pdf.PdfFileReader(open(filename,"rb")) fr = pdf.PdfFileReader(open(filename, "rb"))
except pdf.utils.PdfReadError as e: print(fr)
print('Error: %s' % e) except pdf.utils.PdfReadError as e:
return -1 logger.warning('check encryption {0}'.format(e))
return -1
if fr.getIsEncrypted()==True: if fr.getIsEncrypted() == True:
print('[i] File encrypted %s' % filename) print('[i] File encrypted %s' % filename)
nfr = decrypt_empty_pdf(filename) nfr = decrypt_empty_pdf(filename)
if nfr != -1: if nfr != -1:
get_DocInfo(filename,nfr) get_DocInfo(filename, nfr)
get_xmp_meta_data(filename,nfr)
else: else:
get_DocInfo(filename,fr) get_DocInfo(filename, fr)
get_xmp_meta_data(filename,fr)
#fr.close() # fr.close()
return True return True
def find_name(pdf):
''' simply parses the urlencoded name and extracts the storage name
i would not be surprised this naive approach can lead to fuckups
'''
#find the name of the file
name = pdf.split("/")
a = len(name)
name = name[a-1]
#print(name)
return name
def make_directory(outdir):
''' naive mkdir function '''
try:
os.mkdir(outdir)
except:
#print("[W] mkdir, some error, directory probably exists")
pass
def download_pdf(url, header_data):
''' downloading the pdfile for later analysis '''
try:
req = requests.get(url,headers=header_data,verify=True)
#req = requests.get(url,headers=header_data,verify=False)
data = req.content
except requests.exceptions.SSLError as e:
print('Error: %s' % e)
return -1
except:
print('Error: Probably something wrong with remote server')
return -1
#print(len(data))
return data
def store_pdf(url,data,outdir):
''' storing the downloaded pdf data '''
name = find_name(url)
save = "%s/%s" % (outdir,name)
try:
f = open(save,"wb")
except OSError as e:
print('Error: %s' % (e))
return -1
ret=f.write(data)
print('[+] Written %d bytes for File: %s' % (ret,save))
f.close()
# return the savepath
return save
def _parse_pdf(filename): def _parse_pdf(filename):
''' the real parsing function ''' ''' the real parsing function '''
check_encryption(filename) logger.warning('{0}'.format(filename))
return True if check_file_size(filename):
ret = check_encryption(filename)
return ret
else:
logger.warning('Filesize is 0 bytes at file: {0}'.format(filename))
return False
print('[+] Opening %s' % filename) def seek_and_analyse(search, args, outdir):
pdfile = open(filename,'rb') ''' function for keeping all the steps of searching for pdfs and analysing
them together
'''
# check how many hits we got
# seems like the method is broken in googlsearch library :(
#code, hits = hits_google(search,args)
#if code:
# print('Got {0} hits'.format(hits))
try: # use the search function of googlesearch to get the results
h = pdf.PdfFileReader(pdfile) code, values=search_google(search, args)
except pdf.utils.PdfReadError as e: if not code:
print('[-] Error: %s' % (e)) if values.code == 429:
return logger.warning('[-] Too many requests, time to change ip address or use proxychains')
else:
return pdfile logger.warning('Google returned error {0}'.format(values))
return -1
for item in values:
filename = find_name(item)
process_queue_data(filename, item, 'url')
def parse_single_pdf(filename): # urls = search_pdf(search,args)
''' single parse function '''
return 123
def grab_url(url, outdir): # *if* we get an answer
''' function keeping all the steps for the user call of grabbing if url_q.empty() == False:
just one pdf and analysing it # if urls != -1:
''' # process through the list and get the pdfs
data = download_pdf(url,None) while url_q.empty() == False:
if data != -1: item = url_q.get()
savepath = store_pdf(url, data, outdir) # print(item)
_parse_pdf(savepath) url = item['url']
rd_grabrun = grab_run(url, args, outdir)
code = rd_grabrun['code']
savepath = rd_grabrun['data']
if code:
_parse_pdf(savepath)
return return True
def seek_and_analyse(search,sargs,outdir):
''' function for keeping all the steps of searching for pdfs and analysing
them together
'''
urls = search_pdf(search,sargs)
for url in urls:
grab_url(url,outdir)
def search_pdf(search, sargs):
''' the function where googlesearch from mario vilas
is called
'''
query='%s filetype:pdf' % search
#print(query)
urls = []
for url in gs.search(query,num=20,stop=sargs):
print(url)
urls.append(url)
return urls
def run(args): def run(args):
# specify output directory # initialize logger
outdir = args.outdir logger.info('{0} Started'.format(name))
# create output directory # create some variables
make_directory(outdir)
# lets see what the object is
if args.url_single:
url = args.url_single
print('[+] Grabbing %s' % (url))
grab_url(url, outdir)
elif args.file_single:
pdffile = args.file_single
print('[+] Parsing %s' % (pdffile))
_parse_pdf(pdffile)
elif args.search:
search = args.search
sargs = args.search_stop
#print(args)
print('[+] Seek and de...erm...analysing %s' % (search))
seek_and_analyse(search,sargs,outdir)
elif args.files_dir:
directory = args.files_dir
print('[+] Analyse pdfs in directory %s' % (directory))
files = os.listdir(directory)
for f in files:
fpath = '%s/%s' % (directory,f)
_parse_pdf(fpath)
# outfile name
if args.outfile:
out_filename = args.outfile
else:
out_filename = 'pdfgrab_analysis'
else: # specify output directory
print('[-] Dunno what to do, bro.') outdir = args.outdir
#logfile = "%s/%s.txt" % (out,out)
#flog = open(logfile,"w")
# create output directory
make_directory(outdir)
# lets see what the object is
if args.url_single:
url = args.url_single
logger.info('Grabbing {0}'.format(url))
logger.write_to_log('Grabbing %s' % (url))
grab_url(url, args, outdir)
elif args.file_single:
pdffile = args.file_single
logger.info('Parsing {0}'.format(pdffile))
_parse_pdf(pdffile)
elif args.search:
search = args.search
logger.info('Seek and analyse {0}'.format(search))
if not seek_and_analyse(search, args, outdir):
return -1
elif args.files_dir:
directory = args.files_dir
logger.info('Analyse pdfs in directory {0}'.format(directory))
try:
files = os.listdir(directory)
except:
logger.warning('Error in args.files_dir')
return False
for f in files:
# naive filter function, later usage of filemagic possible
if f.find('.pdf') != -1:
fpath = '%s/%s' % (directory, f)
_parse_pdf(fpath)
# simply generate html report from json outfile
elif args.gen_html_report:
fr = open(args.gen_html_report,'r')
analysis_dict = json.loads(fr.read())
if create_html_report(analysis_dict, outdir,out_filename):
logger.info('Successfully created html report')
sys.exit(0)
else:
sys.exit(1)
else:
print('[-] Dunno what to do, bro. Use help. {0} -h'.format(sys.argv[0]))
sys.exit(1)
# creating the analysis dictionary for reporting
analysis_dict = prepare_analysis_dict(ana_q)
# lets go through the different reporting types
if args.report_txt:
if create_txt_report(analysis_dict, outdir,out_filename):
logger.info('Successfully created txt report')
if args.report_json:
if create_json_report(analysis_dict, outdir,out_filename):
logger.info('Successfully created json report')
if args.report_html:
if create_html_report(analysis_dict, outdir,out_filename):
logger.info('Successfully created html report')
if args.report_url_txt:
if create_url_txt(url_d, outdir,out_filename):
logger.info('Successfully created txt url report')
if args.report_url_json:
if create_url_json(url_d, outdir,out_filename):
logger.info('Successfully created json url report')
return 42
# This is the end my friend.
def main(): def main():
parser_desc = "%s %s %s" % (_name_,_version_,_author_) parser_desc = "%s %s %s in %s" % (name, version, author, date)
parser = argparse.ArgumentParser(prog = __name__, description=parser_desc) parser = argparse.ArgumentParser(prog=name, description=parser_desc)
parser.add_argument('-o','--outdir',action='store',dest='outdir',required=False,help="define the outdirectory for downloaded files and analysis output",default='pdfgrab') parser.add_argument('-O', '--outdir', action='store', dest='outdir', required=False,
parser.add_argument('-u','--url',action='store',dest='url_single',required=False,help="grab pdf from specified url for analysis",default=None) help="define the outdirectory for downloaded files and analysis output", default='pdfgrab')
parser.add_argument('-f','--file',action='store',dest='file_single',required=False,help="specify local path of pdf for analysis",default=None) parser.add_argument('-o', '--outfile', action='store', dest='outfile', required=False,
parser.add_argument('-F','--files-dir',action='store',dest='files_dir',required=False,help="specify local path of *directory* with pdf *files* for analysis",default=None) help="define file with analysis output, if no parameter given it is outdir/pdfgrab_analysis, please note outfile is *always* written to output directory so do not add the dir as extra path")
parser.add_argument('-s','--search',action='store',dest='search',required=False,help="specify domain or tld to scrape for pdf-files",default=None) parser.add_argument('-u', '--url', action='store', dest='url_single', required=False,
parser.add_argument('-sn','--search-number',action='store',dest='search_stop',required=False,help="specify how many files are searched",default=10,type=int) help="grab pdf from specified url for analysis", default=None)
# parser.add_argument('-U','--url-list',action='store',dest='urls_many',required=False,help="specify txt file with list of pdf urls to grab",default=None)
#########
parser.add_argument('-f', '--file', action='store', dest='file_single', required=False,
help="specify local path of pdf for analysis", default=None)
parser.add_argument('-F', '--files-dir', action='store', dest='files_dir', required=False,
help="specify local path of *directory* with pdf *files* for analysis", default=None)
parser.add_argument('-s', '--search', action='store', dest='search', required=False,
help="specify domain or tld to scrape for pdf-files", default=None)
parser.add_argument('-sn', '--search-number', action='store', dest='search_stop', required=False,
help="specify how many files are searched", default=10, type=int)
parser.add_argument('-z', '--disable-cert-check', action='store_false', dest='cert_check', required=False,help="if the target domain(s) run with old or bad certificates", default=True)
parser.add_argument('-ghr', '--gen-html-report', action='store', dest='gen_html_report', required=False,help="If you want to generate the html report after editing the json outfile (parameter: pdfgrab_analysis.json)")
parser.add_argument('-rtd', '--report-text-disable', action='store_false', dest='report_txt', required=False,help="Disable txt report",default=True)
parser.add_argument('-rjd', '--report-json-disable', action='store_false', dest='report_json', required=False,help="Disable json report",default=True)
parser.add_argument('-rhd', '--report-html-disable', action='store_false', dest='report_html', required=False,help="Disable html report",default=True)
parser.add_argument('-rutd', '--report-url-text-disable', action='store_false', dest='report_url_txt', required=False,help="Disable url txt report",default=True)
parser.add_argument('-rujd', '--report-url-json-disable', action='store_false', dest='report_url_json', required=False,help="Disable url json report",default=True)
if len(sys.argv)<2:
parser.print_help(sys.stderr)
sys.exit()
args = parser.parse_args()
run(args)
args = parser.parse_args()
run(args)
if __name__ == "__main__": if __name__ == "__main__":
main() main()

BIN
supply/pdf.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

1
supply/pdf_base64.png Normal file

File diff suppressed because one or more lines are too long