release version 0.4.9

bugfix pre-release 0.4.8
updated Readme.md
2019-11-07 12:51:30 +01:00 · 2019-11-06 12:49:54 +01:00 · 2019-11-05 14:50:02 +01:00 · 2019-11-05 14:42:24 +01:00 · 2019-10-02 18:39:23 +02:00 · 2019-10-02 18:38:31 +02:00
13 changed files with 987 additions and 221 deletions
--- a/Readme.md
+++ b/Readme.md
@@ -1,11 +1,26 @@
 # pdfgrab

+* Version 0.4.9
+
 ## What is it?

 This is a reborn tool, used during the epoche dinosaurs were traipsing the earth. 
 Basically it analyses PDF files for Metadata. You can direct it to a file or directory with pdfs. 
-You can show it the url of a pdf or use the integrated googlesearch (thanx to mario vilas) class
-to search for pdfs at target site, download and analyse them
+You can show it the url of a pdf or use the integrated googlesearch (thanx to mario vilas class)
+to search for pdfs at target site, download and analyse them.
+
+## What is new in 0.4.9?
+
+* exported reporting methods to libreport.py
+* added optargs for disabling different report methods
+* made the html report a bit more shiny
+* added function for generating html report after analysis
+* exported requests and storing data to new library
+* code fixes and more clear error handling
+* removed necessary site: parameter at search flag -s
+* updated readme
+* -s flag now acceppts several domains
+* console logging more clean

 ## What information can be gathered?

@@ -22,18 +37,49 @@ However, common are the following things:

 and some more :)

+## What is this for anyways?
+
+Well, this can be used for a range of things. However, i will only focus on the 
+security part of it. Depending on your target you will get information about:
+
+* used software in company xyz
+	* possible version numbers
+		* this will help you to identify existing vulnerabilities
+	* sometimes pdfs are rendered new, for instance on uploads
+		* now you can figure what the rendering engine is and find bugs in it
+* who is the author of documents
+	* sometimes usernames are users of the OS itself
+		* congrats you just found by analysing a pdf an existing username in the domain
+		* combine the information with the first part, you know which user uses which software
+* passwords ... do i need to say more?
+
+## Is it failproof?
+
+Not at all. Please note that metadata as every other data is just written to that file. So it can be changed before it is uploaded. Said that, the amount of companies really changing that sort of data is maybe at 20%. Also you will recognize if it is empty or alike.
+
 ## How does it work?

 Every more complex filetype above .txt or alike uses metadata for convinience, customer support or only to spread it has been used.
 There is a lot information about metadata in different sort of files like pictures, documents, videos, music online. This tool
 focuses on pdf only. 
 If you are new to that term have a look here:
-https://en.wikipedia.org/wiki/Metadata
+* https://en.wikipedia.org/wiki/Metadata

+Also, if you are interested in a real pdf analysis, this tool will only do the basics for you. It has not been written to analyse bad, malware or even interesting files. It's purpose is to give you an idea what is used at target xyz. 
+If you are looking for more in-depth analysis i recommend the tools of Didier Stevens:
+* https://blog.didierstevens.com/programs/pdf-tools/
+
+## Download
+
+```
+git clone https://github.com/c0decave/pdfgrab
+cd pdfgrab
+python3 pdfgrab.py -h
+```

 ## Usage

-Those are your options major options:
+Those are your major options:
 * grab pdf from url and analyse
 * search site for pdfs via google, grab and analyse
 * analyse a local pdf
@@ -73,9 +119,17 @@ File: pdfgrab/ols2004v2.pdf
 --------------------------------------------------------------------------------
 ```

+### Directory Mode
+
+```
+./pdfgrab.py -F pdfgrab/
+```
+Will analyse all pdf's in that directory
+
+
 ### Google Search Mode
 ```
-# ./pdfgrab.py -s site:kernel.org
+# ./pdfgrab.py -s kernel.org
 ```
 Result:
 ```
@@ -107,10 +161,30 @@ File: pdfgrab/bpf_global_data_and_static_keys.pdf
 /PTEX.Fullbanner This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
 ```

+### Google Search Mode, several domains
+```
+# ./pdfgrab.py -s example.com,example.us
+```
+
+### Reporting
+
+pdfgrab outputs the information in different formats. If not disabled by one of the reporting flags (see -h) you will
+find in the output directory:
+
+* html report
+* text report
+* text url list
+* json data
+* json url list
+
+### Logging
+
+pdfgrab creates a logfile in the running directory called "pdfgrab.log"
+
 ## Google

-Search: filetype:pdf site:com
-Results: 264.000.000
+* Search: filetype:pdf site:com
+* Results: 264.000.000

 ## Disclaimer

--- a/docs/Changelog
+++ b/docs/Changelog
@@ -0,0 +1,43 @@
+Changelog
+=========
+
+Version 4.9
+-----------
+
+* exported reporting methods to libreport.py
+* added optargs for disabling different report methods
+* made the html report a bit more shiny
+* added function for generating html report after analysis
+* exported requests and storing data to new library
+* code fixes and more clear error handling
+* removed necessary site: parameter at search flag -s
+* updated readme
+* -s flag now acceppts several domains
+* console logging more clean
+
+Version 4.8 Bugfix-PreRelease
+-----------------------------
+
+* catching google to many requests
+* catching urlopen dns not resolveable error
+* fixing nasty bug in store_pdf/find_name
+* fixing zero size pdf error
+* extra logging
+
+Version 4.7
+-----------
+
+* added html out
+* added xmp meta testing
+
+Version 4.6
+-----------
+
+* added help for non-argument given at cli
+* added googlesearch lib
+
+Version 4.5
+-----------
+
+* exported helper functions to libs/helper.py
+* added libs/liblog.py
--- a/docs/Todo
+++ b/docs/Todo
@@ -0,0 +1,4 @@
+* add xmp meta to output files
+* code reordering
+* clean up parsing functions
+* add report formats
--- a/libs/init.py
+++ b/libs/init.py
--- a/libs/libgoogle.py
+++ b/libs/libgoogle.py
@@ -0,0 +1,58 @@
+import googlesearch as gs
+import urllib
+from libs.libhelper import *
+
+def get_random_agent():
+    return (gs.get_random_user_agent())
+
+def hits_google(search, args):
+    ''' the function where googlesearch from mario vilas
+		is called
+	'''
+    s = search.split(',')
+    query = 'filetype:pdf'
+
+
+    try:
+        hits = gs.hits(query, domains=s,user_agent=gs.get_random_user_agent())
+
+    except urllib.error.HTTPError as e:
+        return False,e
+
+    except urllib.error.URLError as e:
+        return False,e
+
+    except IndexError as e:
+        return False,e
+
+    return True,hits
+
+
+def search_google(search, args):
+    ''' the function where googlesearch from mario vilas
+		is called
+	'''
+
+    s = search.split(',')
+    search_stop = args.search_stop
+
+    query = 'filetype:pdf'
+    #query = 'site:%s filetype:pdf' % search
+    # print(query)
+    urls = []
+
+    try:
+        for url in gs.search(query, num=20, domains=s,stop=search_stop, user_agent=gs.get_random_user_agent()):
+            #print(url)
+            urls.append(url)
+
+    except urllib.error.HTTPError as e:
+        #print('Error: %s' % e)
+        return False,e
+
+    except urllib.error.URLError as e:
+        return False,e
+
+
+    return True,urls
+
--- a/libs/libhelper.py
+++ b/libs/libhelper.py
@@ -0,0 +1,55 @@
+import os
+import sys
+from Crypto.Hash import SHA256
+
+def check_file_size(filename):
+    ''' simply check if byte size is bigger than 0 bytes
+    '''
+    fstat = os.stat(filename)
+    if fstat.st_size == 0:
+        return False
+    return True
+
+def make_directory(outdir):
+    ''' naive mkdir function '''
+    try:
+        os.mkdir(outdir)
+    except:
+        # print("[W] mkdir, some error, directory probably exists")
+        pass
+
+def url_strip(url):
+    url = url.rstrip("\n")
+    url = url.rstrip("\r")
+    return url
+
+def create_sha256(hdata):
+    ''' introduced to create hashes of filenames, to have a uniqid
+		of course hashes of the file itself will be the next topic
+	'''
+    hobject = SHA256.new(data=hdata.encode())
+    return (hobject.hexdigest())
+
+def find_name(pdf):
+    ''' simply parses the urlencoded name and extracts the storage name
+		i would not be surprised this naive approach can lead to fuckups
+	'''
+
+    name = ''
+    # find the name of the file
+    # 
+    name_list = pdf.split("/")
+    len_list = len(name)
+    # ugly magic ;-)
+    # what happens is, that files can also be behind urls like:
+    # http://host/pdf/
+    # so splitting up the url and always going with the last item after slash
+    # can result in that case in an empty name, so we go another field in the list back
+    # and use this as the name
+    if name_list[len_list - 1] == '':
+        name = name_list[len_list - 2]
+    else:
+        name = name_list[len_list - 1]
+
+    return name
+
--- a/libs/liblog.py
+++ b/libs/liblog.py
@@ -0,0 +1,19 @@
+import logging
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)
+
+file_handler = logging.FileHandler('pdfgrab.log')
+
+console_handler = logging.StreamHandler()
+console_handler.setLevel(logging.WARNING)
+
+file_formatter = logging.Formatter('%(asctime)s:%(name)s:%(levelname)s:%(message)s')
+console_formatter = logging.Formatter('%(levelname)s:%(message)s')
+
+
+file_handler.setFormatter(file_formatter)
+console_handler.setFormatter(console_formatter)
+
+logger.addHandler(file_handler)
+logger.addHandler(console_handler)
--- a/libs/libreport.py
+++ b/libs/libreport.py
@@ -0,0 +1,173 @@
+import os
+import sys
+import json
+from json2html import * 
+from libs.pdf_png import get_png_base64
+
+def prepare_analysis_dict(ana_queue):
+    '''params: ana_queue - queue with collected information
+    '''
+    # initiate analysis dictionary
+    analysis_dict = {}
+
+    # move analysis dictionary in queue back to dictionary
+    while ana_queue.empty() == False:
+        item = ana_queue.get()
+        # print('item ', item)
+        analysis_dict.update(item)
+
+    # ana_q is empty now return the newly created dictionary
+    return analysis_dict
+
+def create_txt_report(analysis_dict, outdir, out_filename):
+    ''' create a txt report in the output directory
+    '''
+
+    # draw seperator lines
+    sep = '-' * 80 + '\n'
+
+    # create output filepath
+    txtout = "%s/%s.txt" % (outdir, out_filename)
+
+    # open the file and return filedescriptor
+    fwtxt = open(txtout, 'w')
+
+    # get the keys of the dict
+    for k in analysis_dict.keys():
+        # write seperator
+        fwtxt.write(sep)
+
+        # build entry filename of the pdf
+        fname = 'File: %s\n' % (analysis_dict[k]['filename'])
+
+        # build data entry
+        ddata = analysis_dict[k]['data']
+
+        # write the filename
+        fwtxt.write(fname)
+
+        # write the metadata
+        for kdata in ddata.keys():
+            metatxt = '%s:%s\n' % (kdata, ddata[kdata])
+            fwtxt.write(metatxt)
+
+        # write seperator
+        fwtxt.write(sep)
+
+    # close the file
+    fwtxt.close()
+
+    return True
+
+def create_json_report(analysis_dict, outdir, out_filename):
+    ''' create a jsonfile report in the output directory
+    '''
+
+    # build json output name
+    jsonout = "%s/%s.json" % (outdir, out_filename)
+
+    # open up json output file
+    fwjson = open(jsonout, 'w')
+
+    # convert dictionary to json data
+    jdata = json.dumps(analysis_dict)
+
+    # write json data to file 
+    fwjson.write(jdata)
+
+    # close file
+    fwjson.close()
+
+    return True
+
+def create_html_report(analysis_dict, outdir, out_filename):
+    ''' create a html report from json file using json2html in the output directory
+    '''
+
+    # build up path for html output file
+    htmlout = "%s/%s.html" % (outdir, out_filename)
+
+    # open htmlout filedescriptor
+    fwhtml = open(htmlout,'w')
+
+    # some html stuff
+    pdfpng=get_png_base64('supply/pdf_base64.png')
+    html_style ='<style>.center { display: block; margin-left: auto;margin-right: auto;} table {border-collapse: collapse;} th, td { border: 1px solid black;text-align: left; }</style>\n'
+    html_head = '<html><head><title>pdfgrab - {0} item/s</title>{1}</head>\n'.format(len(analysis_dict),html_style)
+    html_pdf_png = '<p class="center"><img class="center" src="data:image/jpeg;base64,{0}"><br><center>pdfgrab - grab and analyse pdf files</center><br></p>'.format(pdfpng)
+    html_body = '<body>{0}\n'.format(html_pdf_png)
+    html_end = '\n<br><br><p align="center"><a href="https://github.com/c0decave/pdfgrab">pdfgrab</a> by <a href="https://twitter.com/User_to_Root">dash</a></p></body></html>\n'
+
+    # some attributes
+    attr = 'id="meta-data" class="table table-bordered table-hover", border=1, cellpadding=3 summary="Metadata"'
+
+    # convert dictionary to json data
+    # in this mode each finding gets its own table there are other possibilities
+    # but now i go with this
+    html_out = ''
+    for k in analysis_dict.keys():
+        trans = analysis_dict[k]
+        jdata = json.dumps(trans)
+        html = json2html.convert(json = jdata, table_attributes=attr)
+        html_out = html_out + html + "\n"
+        #html_out = html_out + "<p>" + html + "</p>\n"
+    #jdata = json.dumps(analysis_dict)
+
+    # create html
+    #html = json2html.convert(json = jdata, table_attributes=attr)
+
+    # write html
+    fwhtml.write(html_head)
+    fwhtml.write(html_body)
+    fwhtml.write(html_out)
+    fwhtml.write(html_end)
+
+    # close html file
+    fwhtml.close()
+    
+def create_url_json(url_d, outdir, out_filename):
+    ''' create a json url file in output directory
+    '''
+
+    # create url savefile
+    jsonurlout = "%s/%s_url.json" % (outdir, out_filename)
+
+    # open up file for writting urls down
+    fwjson = open(jsonurlout, 'w')
+
+    # convert url dictionary to json
+    jdata = json.dumps(url_d)
+
+    # write json data to file
+    fwjson.write(jdata)
+
+    # close filedescriptor
+    fwjson.close()
+
+    return True
+
+def create_url_txt(url_d, outdir, out_filename):
+    ''' create a txt url file in output directory
+    '''
+    # build up txt out path
+    txtout = "%s/%s_url.txt" % (outdir, out_filename)
+
+    # open up our url txtfile
+    fwtxt = open(txtout, 'w')
+
+    # iterating through the keys of the url dictionary
+    for k in url_d.keys():
+
+        # get the entry
+        ddata = url_d[k]
+
+        # create meta data for saving
+        metatxt = '%s:%s\n' % (ddata['url'], ddata['filename'])
+
+        # write metadata to file
+        fwtxt.write(metatxt)
+
+    # close fd
+    fwtxt.close()
+
+    return True
--- a/libs/librequest.py
+++ b/libs/librequest.py
@@ -0,0 +1,162 @@
+import os
+import sys
+import json
+import socket
+import requests
+
+from libs.liblog import logger
+from libs.libhelper import *
+from libs.libgoogle import get_random_agent
+
+def store_file(url, data, outdir):
+    ''' storing the downloaded data to a file
+        params: url     - is used to create the filename
+                data    - the data of the file
+                outdir  - to store in which directory
+                returns: dict { "code":<code>, "data":<savepath>,"error":<error>} - the status code, the savepath, the errorcode
+    '''
+
+    logger.info('Store file {0}'.format(url))
+    name = find_name(url)
+
+    # only allow stored file a name with 50 chars
+    if len(name) > 50:
+        name = name[:49]
+
+    # build up the save path
+    save = "%s/%s" % (outdir, name)
+
+    try:
+        f = open(save, "wb")
+
+    except OSError as e:
+        logger.warning('store_file {0}'.format(e))
+        # return ret_dict
+        return {"code":False,"data":save,"error":e}
+
+    # write the data and return the written bytes
+    ret = f.write(data)
+
+    # check if bytes are zero
+    if ret == 0:
+        logger.warning('Written {0} bytes for file: {1}'.format(ret,save))
+
+    else:
+        # log to info that bytes and file has been written
+        logger.info('Written {0} bytes for file: {1}'.format(ret,save))
+
+    # close file descriptor
+    f.close()
+
+    # return ret_dict
+    return {"code":True,"data":save,"error":False}
+
+
+def download_file(url, args, header_data):
+    ''' downloading the file for later analysis 
+        params: url         - the url
+                args        - argparse args namespace
+                header_data - pre-defined header data
+        returns: ret_dict
+    '''
+
+    # check the remote tls certificate or not?
+    cert_check = args.cert_check
+
+    # run our try catch routine
+    try:
+        # request the url and save the response in req
+        # give header data and set verify as delivered by args.cert_check
+        req = requests.get(url, headers=header_data, verify=cert_check)
+
+    except requests.exceptions.SSLError as e:
+        logger.warning('download file {0}{1}'.format(url,e))
+
+        # return retdict
+        return {"code":False,"data":req,"error":e}
+
+    except requests.exceptions.InvalidSchema as e:
+        logger.warning('download file {0}{1}'.format(url,e))
+
+        # return retdict
+        return {"code":False,"data":False,"error":e}
+
+    except socket.gaierror as e:
+        logger.warning('download file, host not known {0} {1}'.format(url,e))
+        return {"code":False,"data":False,"error":e}
+
+    except:
+        logger.warning('download file, something wrong with remote server? {0}'.format(url))
+        # return retdict
+        if not req in locals():
+            req = False
+
+        return {"code":False,"data":req,"error":True}
+
+    #finally:
+        # lets close the socket
+        #req.close()
+
+    # return retdict
+    return {"code":True,"data":req,"error":False}
+
+def grab_run(url, args, outdir):
+    ''' function keeping all the steps for the user call of grabbing
+	just one and analysing it
+    '''
+    header_data = {'User-Agent': get_random_agent()}
+    rd_download = download_file(url, args, header_data)
+    code_down = rd_download['code']
+
+    # is code True download of file was successfull
+    if code_down:
+        rd_evaluate = evaluate_response(rd_download)
+        code_eval = rd_evaluate['code']
+        # if code is True, evaluation was also successful
+        if code_eval:
+            # get the content from the evaluate dictionary request
+            content = rd_evaluate['data'].content
+
+            # call store file 
+            rd_store = store_file(url, content, outdir)
+
+            # get the code
+            code_store = rd_store['code']
+
+            # get the savepath
+            savepath = rd_store['data']
+
+            # if code is True, storing of file was also successfull
+            if code_store:
+                return {"code":True,"data":savepath,"error":False}
+
+    return {"code":False,"data":False,"error":True}
+
+def evalute_content(ret_dict):
+    pass
+
+def evaluate_response(ret_dict):
+    ''' this method comes usually after download_file,
+        it will evaluate what has happened and if we even have some data to process
+        or not
+        params: data    - is the req object from the conducted request
+        return: {}
+        returns: dict { "code":<code>, "data":<savepath>,"error":<error>} - the status code, the savepath, the errorcode
+        '''
+    # extract data from ret_dict
+    req = ret_dict['data']
+
+    # get status code
+    url = req.url
+    status = req.status_code
+    reason = req.reason
+
+    # ahh everything is fine 
+    if status == 200:
+        logger.info('download file, {0} {1} {2}'.format(url,reason,status))
+        return {"code":True,"data":req,"error":False}
+
+    # nah something is not like it should be
+    else:
+        logger.warning('download file, {0} {1} {2}'.format(url,reason,status))
+        return {"code":False,"data":req,"error":True}
--- a/libs/pdf_png.py
+++ b/libs/pdf_png.py
@@ -0,0 +1,5 @@
+
+def get_png_base64(filename):
+    fr = open(filename,'r')
+    buf = fr.read()
+    return buf
--- a/pdfgrab.py
+++ b/pdfgrab.py
@@ -1,264 +1,436 @@
 #!/usr/bin/env python3
 #####################
-# yay - old tool adjusted for python3, using googlesearch now
-# and not some self crafted f00
-#
 # new features, new layout, new new :>
-# dash in end of September 2019
-# 
-#
-# TODO
-# * json output
-# * txt output
-# * catch conn refused connections
-# * set option for certificate verification, default is false
-# * complete analyse.txt and seperated
-# * clean up code
-# * do more testing
-# * add random useragent for google and website pdf gathering
-#
-# Done
-# * add decryption routine
-# * catch ssl exceptions
+# by dash

-import os
-import sys
+import xml
 import argparse
+import json
+import os
+import queue
+import urllib
+from json2html import *
+
+import PyPDF2
+
+# googlesearch library
+import googlesearch as gs
 import requests
+from PyPDF2 import pdf
+
+# functions to extern files
+from libs.liblog import logger
+from libs.libhelper import *
+from libs.libgoogle import *
+from libs.libreport import *
+from libs.librequest import grab_run

 from IPython import embed

-from PyPDF2 import pdf
-import googlesearch as gs
+# some variables in regard of the tool itself
+name = 'pdfgrab'
+version = '0.4.9'
+author = 'dash'
+date = 'November 2019'

-_name_ 		= 'pdfgrab'
-_version_ 	= '0.3'
-_author_	= 'dash'
-_date_		= '2019'
+# queues for processing
+# this queue holds the URL locations of files to download
+url_q = queue.Queue()
+url_d = {}

-def url_strip(url):
-	url = url.rstrip("\n")
-	url = url.rstrip("\r")
-	return url
+# this queue holds the paths of files to analyse
+pdf_q = queue.Queue()

+# this is the analysis queue, keeping the data for further processing
+ana_q = queue.Queue()
+
+def add_queue(tqueue, data):
+    ''' wrapper function for adding easy data to
+		created queues. otherwise the functions will be scattered with
+		endless queue commands ;)
+	'''
+
+    tqueue.put(data)
+    # d=tqueue.get()
+    #logging.debug(d)
+    return True
+
+def process_queue_data(filename, data, queue_type):
+    ''' main function for processing gathered data
+		i use this central function for it, so it is at *one* place
+		and it is easy to change the data handling at a later step without
+		deconstructing the who code
+    '''
+    ana_dict = {}
+    url_dict = {}
+
+    if queue_type == 'doc_info':
+        logger.info('Queue DocInfo Data {0}'.format(filename))
+        name = find_name(filename)
+        path = filename
+
+        # create a hash over the file path
+        # hm, removed for now
+        # path_hash = create_sha256(path)
+
+        # order data in dict for analyse queue
+        ana_dict = {path: {'filename': name, 'data': data}}
+        #print('data:',data)
+        #print('ana_dcit:',ana_dict)
+
+        # add the data to queue
+        add_queue(ana_q, ana_dict)
+
+    elif queue_type == 'doc_xmp_info':
+        logger.info('Queue DocXMPInfo Data {0}'.format(filename))
+        logger.warning('DocXMPInfo json processing not supported {0}'.format(filename))
+
+    elif queue_type == 'url':
+        # prepare queue entry
+        logger.info('Url Queue {0}'.format(data))
+        url_dict = {'url': data, 'filename': filename}
+        sha256 = create_sha256(data)
+        url_d[sha256] = url_dict
+
+        # add dict to queue
+        add_queue(url_q, url_dict)
+
+    else:
+        print('[-] Sorry, unknown queue. DEBUG!')
+        logger.critical('Unknown queue')
+        return False
+
+    return True
+
+def get_xmp_meta_data(filename, filehandle):
+    ''' get the xmp meta data
+    '''
+
+    err_dict = {}
+    real_extract = {}
+    xmp_dict = {}
+
+    fh = filehandle
+
+    try:
+        xmp_meta =  fh.getXmpMetadata()
+
+    except xml.parsers.expat.ExpatError as e:
+        logger.warning('get_xmp_meta_data error {0}'.format(e))
+        err_dict = {'error': str(e)}
+        return -1
+
+    finally:
+        process_queue_data(filename, err_dict, 'doc_xmp_info')
+
+    if xmp_meta != None:
+        try:
+
+            print('xmp_meta: {0} {1} {2} {3} {4} {5}'.format(xmp_meta.pdf_producer,xmp_meta.pdf_pdfversion,xmp_meta.dc_contributor,xmp_meta.dc_creator,xmp_meta.dc_date,xmp_meta.dc_subject))
+        #print('xmp_meta cache: {0}'.format(xmp_meta.cache))
+        #print('xmp_meta custom properties: {0}'.format(xmp_meta.custom_properties))
+        #embed()
+        except AttributeError as e:
+            logger.warning('xmp_meta print {0}'.format(e))
+            return False
+
+    return xmp_dict

 def get_DocInfo(filename, filehandle):
+    ''' the easy way to extract metadata
+		
+		indirectObjects...
+		there is an interesting situation, some pdfs seem to have the same information stored 
+		in different places, or things are overwritten or whatever
+		this sometimes results in an extract output with indirect objects ... this is ugly

-	fh = filehandle
-	try:
-		extract = fh.documentInfo
-	except pdf.utils.PdfReadError as e:
-		print('Error: %s' % e)
-		return -1
+		{'/Title': IndirectObject(111, 0), '/Producer': IndirectObject(112, 0), '/Creator': IndirectObject(113, 0), '/CreationDate': IndirectObject(114, 0), '/ModDate': IndirectObject(114, 0), '/Keywords': IndirectObject(115, 0), '/AAPL:Keywords': IndirectObject(116, 0)}

-	print('-'*80)
-	print('File: %s' % filename)
-	for k in extract.keys():
-		edata = '%s %s' % (k,extract[k])
-		print(edata)
-		print
-	print('-'*80)
+		normally getObject() is the method to use, to fix this, however this was not working in the particular case.
+		this thing might even bring up some more nasty things, as a (probably weak) defense and workaround
+		the pdfobject is not used anymore after this function, data is converted to strings...
+		bad example:
+	'''
+
+    err_dict = {}
+    real_extract = {}
+
+    fh = filehandle
+
+    try:
+        extract = fh.documentInfo
+
+    except pdf.utils.PdfReadError as e:
+        logger.warning('get_doc_info {0}'.format(e))
+        err_dict = {'error': str(e)}
+        return -1
+
+    except PyPDF2.utils.PdfReadError as e:
+        logger.warning('get_doc_info {0}'.format(e))
+        err_dict = {'error': str(e)}
+        return -1
+
+    finally:
+        process_queue_data(filename, err_dict, 'doc_info')
+
+    print('-' * 80)
+    print('File: %s' % filename)
+    #	embed()
+    # there are situations when documentinfo does not return anything
+    # and extract is None
+    if extract == None:
+        err_dict = {'error': 'getDocumentInfo() returns None'}
+        process_queue_data(filename, err_dict, 'doc_info')
+        return -1
+
+    try:
+        for k in extract.keys():
+            key = str(k)
+            value = str(extract[k])
+            edata = '%s %s' % (key, value)
+            print(edata)
+            print
+            real_extract[key] = value
+        print('-' * 80)
+
+    except PyPDF2.utils.PdfReadError as e:
+        logger.warning('get_doc_info {0}'.format(e))
+        err_dict = {'error': str(e)}
+        process_queue_data(filename, err_dict, 'doc_info')
+        return -1
+
+    process_queue_data(filename, real_extract, 'doc_info')


 def decrypt_empty_pdf(filename):
+    ''' this function simply tries to decrypt the pdf with the null password
+		this does work, as long as no real password has been set
+		if a complex password has been set -> john
+	'''
+
+    fr = pdf.PdfFileReader(open(filename, "rb"))
+    try:
+        fr.decrypt('')
+
+    except NotImplementedError as e:
+        logger.warning('decrypt_empty_pdf {0}{1}'.format(filename,e))
+        return -1
+    return fr

-	fr = pdf.PdfFileReader(open(filename,"rb"))
-	try:
-		fr.decrypt('')
-	except NotImplementedError as e:
-		print('Error: %s' % (e))
-		print('Only algorithm code 1 and 2 are supported')
-		return -1
-	return fr
-	

 def check_encryption(filename):
-	''' basic function to check if file is encrypted 
+    ''' basic function to check if file is encrypted
 	'''

-	print(filename)
-	try:
-		fr = pdf.PdfFileReader(open(filename,"rb"))
-	except pdf.utils.PdfReadError as e:
-		print('Error: %s' % e)
-		return -1
+    print(filename)
+    try:
+        fr = pdf.PdfFileReader(open(filename, "rb"))
+        print(fr)
+    except pdf.utils.PdfReadError as e:
+        logger.warning('check encryption {0}'.format(e))
+        return -1

-	if fr.getIsEncrypted()==True:
-		print('[i] File encrypted %s' % filename)
-		nfr = decrypt_empty_pdf(filename)
-		if nfr != -1:
-			get_DocInfo(filename,nfr)
+    if fr.getIsEncrypted() == True:
+        print('[i] File encrypted %s' % filename)
+        nfr = decrypt_empty_pdf(filename)
+        if nfr != -1:
+            get_DocInfo(filename, nfr)
+            get_xmp_meta_data(filename,nfr)

-	else:
-		get_DocInfo(filename,fr)
+    else:
+        get_DocInfo(filename, fr)
+        get_xmp_meta_data(filename,fr)

-	#fr.close()
+    # fr.close()

-	return True
-
-def find_name(pdf):
-	''' simply parses the urlencoded name and extracts the storage name
-		i would not be surprised this naive approach can lead to fuckups
-	'''
-
-	#find the name of the file
-	name = pdf.split("/")
-	a = len(name)
-	name = name[a-1]
-	#print(name)
-
-	return name
-
-def make_directory(outdir):
-	''' naive mkdir function '''
-	try:
-		os.mkdir(outdir)
-	except:
-		#print("[W] mkdir, some error, directory probably exists")
-		pass
-
-def download_pdf(url, header_data):
-	''' downloading the pdfile for later analysis '''
-	try:
-		req = requests.get(url,headers=header_data,verify=True)
-		#req = requests.get(url,headers=header_data,verify=False)
-		data = req.content
-	except requests.exceptions.SSLError as e:
-		print('Error: %s' % e)
-		return -1
-	except:
-		print('Error: Probably something wrong with remote server')
-		return -1
-
-	#print(len(data))
-	return data
-
-def store_pdf(url,data,outdir):
-	''' storing the downloaded pdf data '''
-	name = find_name(url)
-	save = "%s/%s" % (outdir,name)
-	try:
-		f = open(save,"wb")
-	except OSError as e:
-		print('Error: %s' % (e))
-		return -1
-
-	ret=f.write(data)
-	print('[+] Written %d bytes for File: %s' % (ret,save))
-	f.close()
-	
-	# return the savepath
-	return save
+    return True

 def _parse_pdf(filename):
-	''' the real parsing function '''
+    ''' the real parsing function '''

-	check_encryption(filename)
-	return True
+    logger.warning('{0}'.format(filename))
+    if check_file_size(filename):
+        ret = check_encryption(filename)
+        return ret
+    else:
+        logger.warning('Filesize is 0 bytes at file: {0}'.format(filename))
+        return False

-	print('[+] Opening %s' % filename)
-	pdfile = open(filename,'rb')
+def seek_and_analyse(search, args, outdir):
+    ''' function for keeping all the steps of searching for pdfs and analysing
+        them together
+    '''
+    # check how many hits we got
+    # seems like the method is broken in googlsearch library :(
+    #code, hits = hits_google(search,args)
+    #if code:
+    #    print('Got {0} hits'.format(hits))

-	try:
-		h = pdf.PdfFileReader(pdfile)
-	except pdf.utils.PdfReadError as e:
-		print('[-] Error: %s' % (e))
-		return
-	
-	return pdfile
+    # use the search function of googlesearch to get the results
+    code, values=search_google(search, args)
+    if not code:
+        if values.code == 429:
+            logger.warning('[-] Too many requests, time to change ip address or use proxychains')
+        else:
+            logger.warning('Google returned error {0}'.format(values))

+        return -1
+        
+    for item in values:
+        filename = find_name(item)
+        process_queue_data(filename, item, 'url')

-def parse_single_pdf(filename):
-	''' single parse function '''
-	return 123
+    # urls = search_pdf(search,args)

-def grab_url(url, outdir):
-	''' function keeping all the steps for the user call of grabbing 
-		just one pdf and analysing it
-	'''
-	data = download_pdf(url,None)
-	if data != -1:
-		savepath = store_pdf(url, data, outdir)
-		_parse_pdf(savepath)
+    # *if* we get an answer
+    if url_q.empty() == False:
+        # if urls != -1:
+        # process through the list and get the pdfs
+        while url_q.empty() == False:
+            item = url_q.get()
+            # print(item)
+            url = item['url']
+            rd_grabrun = grab_run(url, args, outdir)
+            code = rd_grabrun['code']
+            savepath = rd_grabrun['data']
+            if code:
+                _parse_pdf(savepath)

-	return
+    return True

-def seek_and_analyse(search,sargs,outdir):
-	''' function for keeping all the steps of searching for pdfs and analysing
-		them together
-	'''
-	urls = search_pdf(search,sargs)
-	for url in urls:
-		grab_url(url,outdir)
-
-def search_pdf(search, sargs):
-	''' the function where googlesearch from mario vilas
-		is called
-	'''
-
-	query='%s filetype:pdf' % search
-	#print(query)
-	urls = []
-	for url in gs.search(query,num=20,stop=sargs):
-		print(url)
-		urls.append(url)
-	
-	return urls

 def run(args):

-	# specify output directory
-	outdir = args.outdir
+    # initialize logger
+    logger.info('{0} Started'.format(name))

-	# create output directory
-	make_directory(outdir)
-
-	# lets see what the object is
-	if args.url_single:
-		url = args.url_single
-		print('[+] Grabbing %s' % (url))
-		grab_url(url, outdir)
-
-	elif args.file_single:
-		pdffile = args.file_single
-		print('[+] Parsing %s' % (pdffile))
-		_parse_pdf(pdffile)
-
-	elif args.search:
-		search = args.search
-		sargs = args.search_stop
-		#print(args)
-		print('[+] Seek and de...erm...analysing %s' % (search))
-		seek_and_analyse(search,sargs,outdir)
-	
-	elif args.files_dir:
-		directory = args.files_dir
-		print('[+] Analyse pdfs in directory %s' % (directory))
-		files = os.listdir(directory)
-		for f in files:
-			fpath = '%s/%s' % (directory,f)
-			_parse_pdf(fpath)
+    # create some variables


-		
+    # outfile name
+    if args.outfile:
+        out_filename = args.outfile
+    else:
+        out_filename = 'pdfgrab_analysis'

-	else:
-		print('[-] Dunno what to do, bro.')
-	#logfile = "%s/%s.txt" % (out,out)
-	#flog = open(logfile,"w")
+    # specify output directory
+    outdir = args.outdir
+
+
+    # create output directory
+    make_directory(outdir)
+
+    # lets see what the object is
+    if args.url_single:
+        url = args.url_single
+        logger.info('Grabbing {0}'.format(url))
+        logger.write_to_log('Grabbing %s' % (url))
+        grab_url(url, args, outdir)
+
+    elif args.file_single:
+        pdffile = args.file_single
+        logger.info('Parsing {0}'.format(pdffile))
+        _parse_pdf(pdffile)
+
+    elif args.search:
+        search = args.search
+        logger.info('Seek and analyse {0}'.format(search))
+        if not seek_and_analyse(search, args, outdir):
+            return -1
+
+    elif args.files_dir:
+        directory = args.files_dir
+        logger.info('Analyse pdfs in directory {0}'.format(directory))
+        try:
+            files = os.listdir(directory)
+        except:
+            logger.warning('Error in args.files_dir')
+            return False
+
+        for f in files:
+            # naive filter function, later usage of filemagic possible
+            if f.find('.pdf') != -1:
+                fpath = '%s/%s' % (directory, f)
+                _parse_pdf(fpath)
+
+    # simply generate html report from json outfile
+    elif args.gen_html_report:
+        fr = open(args.gen_html_report,'r')
+        analysis_dict = json.loads(fr.read())
+        if create_html_report(analysis_dict, outdir,out_filename):
+            logger.info('Successfully created html report') 
+            sys.exit(0)
+        else:
+            sys.exit(1)
+
+    else:
+        print('[-] Dunno what to do, bro. Use help. {0} -h'.format(sys.argv[0]))
+        sys.exit(1)
+
+    # creating the analysis dictionary for reporting
+    analysis_dict = prepare_analysis_dict(ana_q)
+
+    # lets go through the different reporting types
+    if args.report_txt:
+        if create_txt_report(analysis_dict, outdir,out_filename):
+            logger.info('Successfully created txt report') 
+
+    if args.report_json:
+        if create_json_report(analysis_dict, outdir,out_filename):
+            logger.info('Successfully created json report') 
+
+    if args.report_html:
+        if create_html_report(analysis_dict, outdir,out_filename):
+            logger.info('Successfully created html report') 
+
+    if args.report_url_txt:
+        if create_url_txt(url_d, outdir,out_filename):
+            logger.info('Successfully created txt url report') 
+
+    if args.report_url_json:
+        if create_url_json(url_d, outdir,out_filename):
+            logger.info('Successfully created json url report') 
+
+    return 42
+
+
+# This is the end my friend.

 def main():
-	parser_desc = "%s %s %s" % (_name_,_version_,_author_)
-	parser = argparse.ArgumentParser(prog = __name__, description=parser_desc)
-	parser.add_argument('-o','--outdir',action='store',dest='outdir',required=False,help="define the outdirectory for downloaded files and analysis output",default='pdfgrab')
-	parser.add_argument('-u','--url',action='store',dest='url_single',required=False,help="grab pdf from specified url for analysis",default=None)
-	parser.add_argument('-f','--file',action='store',dest='file_single',required=False,help="specify local path of pdf for analysis",default=None)
-	parser.add_argument('-F','--files-dir',action='store',dest='files_dir',required=False,help="specify local path of *directory* with pdf *files* for analysis",default=None)
-	parser.add_argument('-s','--search',action='store',dest='search',required=False,help="specify domain or tld to scrape for pdf-files",default=None)
-	parser.add_argument('-sn','--search-number',action='store',dest='search_stop',required=False,help="specify how many files are searched",default=10,type=int)
+    parser_desc = "%s %s %s in %s" % (name, version, author, date)
+    parser = argparse.ArgumentParser(prog=name, description=parser_desc)
+    parser.add_argument('-O', '--outdir', action='store', dest='outdir', required=False,
+                        help="define the outdirectory for downloaded files and analysis output", default='pdfgrab')
+    parser.add_argument('-o', '--outfile', action='store', dest='outfile', required=False,
+                        help="define file with analysis output, if no parameter given it is outdir/pdfgrab_analysis, please note outfile is *always* written to output directory so do not add the dir as extra path")
+    parser.add_argument('-u', '--url', action='store', dest='url_single', required=False,
+                        help="grab pdf from specified url for analysis", default=None)
+    # parser.add_argument('-U','--url-list',action='store',dest='urls_many',required=False,help="specify txt file with list of pdf urls to grab",default=None)
+    #########
+    parser.add_argument('-f', '--file', action='store', dest='file_single', required=False,
+                        help="specify local path of pdf for analysis", default=None)
+    parser.add_argument('-F', '--files-dir', action='store', dest='files_dir', required=False,
+                        help="specify local path of *directory* with pdf *files* for analysis", default=None)
+    parser.add_argument('-s', '--search', action='store', dest='search', required=False,
+                        help="specify domain or tld to scrape for pdf-files", default=None)
+    parser.add_argument('-sn', '--search-number', action='store', dest='search_stop', required=False,
+                        help="specify how many files are searched", default=10, type=int)
+    parser.add_argument('-z', '--disable-cert-check', action='store_false', dest='cert_check', required=False,help="if the target domain(s) run with old or bad certificates", default=True)
+                        
+    parser.add_argument('-ghr', '--gen-html-report', action='store', dest='gen_html_report', required=False,help="If you want to generate the html report after editing the json outfile (parameter: pdfgrab_analysis.json)")
+    parser.add_argument('-rtd', '--report-text-disable', action='store_false', dest='report_txt', required=False,help="Disable txt report",default=True)
+    parser.add_argument('-rjd', '--report-json-disable', action='store_false', dest='report_json', required=False,help="Disable json report",default=True)
+    parser.add_argument('-rhd', '--report-html-disable', action='store_false', dest='report_html', required=False,help="Disable html report",default=True)
+    parser.add_argument('-rutd', '--report-url-text-disable', action='store_false', dest='report_url_txt', required=False,help="Disable url txt report",default=True)
+    parser.add_argument('-rujd', '--report-url-json-disable', action='store_false', dest='report_url_json', required=False,help="Disable url json report",default=True)
+
+    if len(sys.argv)<2:
+        parser.print_help(sys.stderr)
+        sys.exit()
+
+    args = parser.parse_args()
+    run(args)

-	args = parser.parse_args()
-	run(args)

 if __name__ == "__main__":
-	main()
+    main()
--- a/supply/pdf.png
+++ b/supply/pdf.png
--- a/supply/pdf_base64.png
+++ b/supply/pdf_base64.png
Author	SHA1	Message	Date
c0decave	132750867f	release version 0.4.9	2019-11-07 12:51:30 +01:00
c0decave	a89ac93c3d	bugfix pre-release 0.4.8	2019-11-06 12:49:54 +01:00
c0decave	4f63e62690	updated Readme.md	2019-11-05 14:50:02 +01:00
c0decave	e1d7c3f760	release of version 0.4.7 added html reporting, added logging, reordered libraries, added experimental xmp meta data, fixed bug introduced due xmp meta data, added todo list	2019-11-05 14:42:24 +01:00
c0decave	fa3b925d6f	yak yak yak	2019-10-02 18:39:23 +02:00
dash	f7339cda5f	yak yak yak	2019-10-02 18:38:31 +02:00
dash	1c6d03ef0e	spelling error	2019-10-02 18:19:05 +02:00
dash	5f9cdf86d1	several changes,new features,bugfixes	2019-10-02 18:17:41 +02:00
dash	64f48eef9a	cert-check, random-user-agent, catching too many requests	2019-09-26 18:45:02 +02:00
dash	fb2dfb1527	updated readme	2019-09-26 17:50:10 +02:00