release version 0.4.9

bugfix pre-release 0.4.8
updated Readme.md
2019-11-07 12:51:30 +01:00 · 2019-11-06 12:49:54 +01:00 · 2019-11-05 14:50:02 +01:00 · 2019-11-05 14:42:24 +01:00 · 2019-10-02 18:39:23 +02:00 · 2019-10-02 18:38:31 +02:00
13 changed files with 987 additions and 221 deletions
--- a/Readme.md
+++ b/Readme.md
@@ -1,11 +1,26 @@
 # pdfgrab
 * Version 0.4.9
 ## What is it?
 This is a reborn tool, used during the epoche dinosaurs were traipsing the earth. 
 Basically it analyses PDF files for Metadata. You can direct it to a file or directory with pdfs. 
-You can show it the url of a pdf or use the integrated googlesearch (thanx to mario vilas) class
+You can show it the url of a pdf or use the integrated googlesearch (thanx to mario vilas class)
-to search for pdfs at target site, download and analyse them
+to search for pdfs at target site, download and analyse them.
 ## What is new in 0.4.9?
 * exported reporting methods to libreport.py
 * added optargs for disabling different report methods
 * made the html report a bit more shiny
 * added function for generating html report after analysis
 * exported requests and storing data to new library
 * code fixes and more clear error handling
 * removed necessary site: parameter at search flag -s
 * updated readme
 * -s flag now acceppts several domains
 * console logging more clean
 ## What information can be gathered?
@@ -22,18 +37,49 @@ However, common are the following things:
 and some more :)
 ## What is this for anyways?
 Well, this can be used for a range of things. However, i will only focus on the 
 security part of it. Depending on your target you will get information about:
 * used software in company xyz
 	* possible version numbers
 		* this will help you to identify existing vulnerabilities
 	* sometimes pdfs are rendered new, for instance on uploads
 		* now you can figure what the rendering engine is and find bugs in it
 * who is the author of documents
 	* sometimes usernames are users of the OS itself
 		* congrats you just found by analysing a pdf an existing username in the domain
 		* combine the information with the first part, you know which user uses which software
 * passwords ... do i need to say more?
 ## Is it failproof?
 Not at all. Please note that metadata as every other data is just written to that file. So it can be changed before it is uploaded. Said that, the amount of companies really changing that sort of data is maybe at 20%. Also you will recognize if it is empty or alike.
 ## How does it work?
 Every more complex filetype above .txt or alike uses metadata for convinience, customer support or only to spread it has been used.
 There is a lot information about metadata in different sort of files like pictures, documents, videos, music online. This tool
 focuses on pdf only. 
 If you are new to that term have a look here:
-https://en.wikipedia.org/wiki/Metadata
+* https://en.wikipedia.org/wiki/Metadata
 Also, if you are interested in a real pdf analysis, this tool will only do the basics for you. It has not been written to analyse bad, malware or even interesting files. It's purpose is to give you an idea what is used at target xyz. 
 If you are looking for more in-depth analysis i recommend the tools of Didier Stevens:
 * https://blog.didierstevens.com/programs/pdf-tools/
 ## Download
 ```
 git clone https://github.com/c0decave/pdfgrab
 cd pdfgrab
 python3 pdfgrab.py -h
 ```
 ## Usage
-Those are your options major options:
+Those are your major options:
 * grab pdf from url and analyse
 * search site for pdfs via google, grab and analyse
 * analyse a local pdf
@@ -73,9 +119,17 @@ File: pdfgrab/ols2004v2.pdf
 --------------------------------------------------------------------------------
 ```
 ### Directory Mode
 ```
 ./pdfgrab.py -F pdfgrab/
 ```
 Will analyse all pdf's in that directory
 ### Google Search Mode
 ```
-# ./pdfgrab.py -s site:kernel.org
+# ./pdfgrab.py -s kernel.org
 ```
 Result:
 ```
@@ -107,10 +161,30 @@ File: pdfgrab/bpf_global_data_and_static_keys.pdf
 /PTEX.Fullbanner This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
 ```
 ### Google Search Mode, several domains
 ```
 # ./pdfgrab.py -s example.com,example.us
 ```
 ### Reporting
 pdfgrab outputs the information in different formats. If not disabled by one of the reporting flags (see -h) you will
 find in the output directory:
 * html report
 * text report
 * text url list
 * json data
 * json url list
 ### Logging
 pdfgrab creates a logfile in the running directory called "pdfgrab.log"
 ## Google
-Search: filetype:pdf site:com
+* Search: filetype:pdf site:com
-Results: 264.000.000
+* Results: 264.000.000
 ## Disclaimer
--- a/docs/Changelog
+++ b/docs/Changelog
@@ -0,0 +1,43 @@
 Changelog
 =========
 Version 4.9
 -----------
 * exported reporting methods to libreport.py
 * added optargs for disabling different report methods
 * made the html report a bit more shiny
 * added function for generating html report after analysis
 * exported requests and storing data to new library
 * code fixes and more clear error handling
 * removed necessary site: parameter at search flag -s
 * updated readme
 * -s flag now acceppts several domains
 * console logging more clean
 Version 4.8 Bugfix-PreRelease
 -----------------------------
 * catching google to many requests
 * catching urlopen dns not resolveable error
 * fixing nasty bug in store_pdf/find_name
 * fixing zero size pdf error
 * extra logging
 Version 4.7
 -----------
 * added html out
 * added xmp meta testing
 Version 4.6
 -----------
 * added help for non-argument given at cli
 * added googlesearch lib
 Version 4.5
 -----------
 * exported helper functions to libs/helper.py
 * added libs/liblog.py
--- a/docs/Todo
+++ b/docs/Todo
@@ -0,0 +1,4 @@
 * add xmp meta to output files
 * code reordering
 * clean up parsing functions
 * add report formats
--- a/libs/init.py
+++ b/libs/init.py
--- a/libs/libgoogle.py
+++ b/libs/libgoogle.py
@@ -0,0 +1,58 @@
 import googlesearch as gs
 import urllib
 from libs.libhelper import *
 def get_random_agent():
    return (gs.get_random_user_agent())
 def hits_google(search, args):
    ''' the function where googlesearch from mario vilas
 		is called
 	'''
    s = search.split(',')
    query = 'filetype:pdf'
    try:
        hits = gs.hits(query, domains=s,user_agent=gs.get_random_user_agent())
    except urllib.error.HTTPError as e:
        return False,e
    except urllib.error.URLError as e:
        return False,e
    except IndexError as e:
        return False,e
    return True,hits
 def search_google(search, args):
    ''' the function where googlesearch from mario vilas
 		is called
 	'''
    s = search.split(',')
    search_stop = args.search_stop
    query = 'filetype:pdf'
    #query = 'site:%s filetype:pdf' % search
    # print(query)
    urls = []
    try:
        for url in gs.search(query, num=20, domains=s,stop=search_stop, user_agent=gs.get_random_user_agent()):
            #print(url)
            urls.append(url)
    except urllib.error.HTTPError as e:
        #print('Error: %s' % e)
        return False,e
    except urllib.error.URLError as e:
        return False,e
    return True,urls
--- a/libs/libhelper.py
+++ b/libs/libhelper.py
@@ -0,0 +1,55 @@
 import os
 import sys
 from Crypto.Hash import SHA256
 def check_file_size(filename):
    ''' simply check if byte size is bigger than 0 bytes
    '''
    fstat = os.stat(filename)
    if fstat.st_size == 0:
        return False
    return True
 def make_directory(outdir):
    ''' naive mkdir function '''
    try:
        os.mkdir(outdir)
    except:
        # print("[W] mkdir, some error, directory probably exists")
        pass
 def url_strip(url):
    url = url.rstrip("\n")
    url = url.rstrip("\r")
    return url
 def create_sha256(hdata):
    ''' introduced to create hashes of filenames, to have a uniqid
 		of course hashes of the file itself will be the next topic
 	'''
    hobject = SHA256.new(data=hdata.encode())
    return (hobject.hexdigest())
 def find_name(pdf):
    ''' simply parses the urlencoded name and extracts the storage name
 		i would not be surprised this naive approach can lead to fuckups
 	'''
    name = ''
    # find the name of the file
    # 
    name_list = pdf.split("/")
    len_list = len(name)
    # ugly magic ;-)
    # what happens is, that files can also be behind urls like:
    # http://host/pdf/
    # so splitting up the url and always going with the last item after slash
    # can result in that case in an empty name, so we go another field in the list back
    # and use this as the name
    if name_list[len_list - 1] == '':
        name = name_list[len_list - 2]
    else:
        name = name_list[len_list - 1]
    return name
--- a/libs/liblog.py
+++ b/libs/liblog.py
@@ -0,0 +1,19 @@
 import logging
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 file_handler = logging.FileHandler('pdfgrab.log')
 console_handler = logging.StreamHandler()
 console_handler.setLevel(logging.WARNING)
 file_formatter = logging.Formatter('%(asctime)s:%(name)s:%(levelname)s:%(message)s')
 console_formatter = logging.Formatter('%(levelname)s:%(message)s')
 file_handler.setFormatter(file_formatter)
 console_handler.setFormatter(console_formatter)
 logger.addHandler(file_handler)
 logger.addHandler(console_handler)
--- a/libs/libreport.py
+++ b/libs/libreport.py
@@ -0,0 +1,173 @@
 import os
 import sys
 import json
 from json2html import * 
 from libs.pdf_png import get_png_base64
 def prepare_analysis_dict(ana_queue):
    '''params: ana_queue - queue with collected information
    '''
    # initiate analysis dictionary
    analysis_dict = {}
    # move analysis dictionary in queue back to dictionary
    while ana_queue.empty() == False:
        item = ana_queue.get()
        # print('item ', item)
        analysis_dict.update(item)
    # ana_q is empty now return the newly created dictionary
    return analysis_dict
 def create_txt_report(analysis_dict, outdir, out_filename):
    ''' create a txt report in the output directory
    '''
    # draw seperator lines
    sep = '-' * 80 + '\n'
    # create output filepath
    txtout = "%s/%s.txt" % (outdir, out_filename)
    # open the file and return filedescriptor
    fwtxt = open(txtout, 'w')
    # get the keys of the dict
    for k in analysis_dict.keys():
        # write seperator
        fwtxt.write(sep)
        # build entry filename of the pdf
        fname = 'File: %s\n' % (analysis_dict[k]['filename'])
        # build data entry
        ddata = analysis_dict[k]['data']
        # write the filename
        fwtxt.write(fname)
        # write the metadata
        for kdata in ddata.keys():
            metatxt = '%s:%s\n' % (kdata, ddata[kdata])
            fwtxt.write(metatxt)
        # write seperator
        fwtxt.write(sep)
    # close the file
    fwtxt.close()
    return True
 def create_json_report(analysis_dict, outdir, out_filename):
    ''' create a jsonfile report in the output directory
    '''
    # build json output name
    jsonout = "%s/%s.json" % (outdir, out_filename)
    # open up json output file
    fwjson = open(jsonout, 'w')
    # convert dictionary to json data
    jdata = json.dumps(analysis_dict)
    # write json data to file 
    fwjson.write(jdata)
    # close file
    fwjson.close()
    return True
 def create_html_report(analysis_dict, outdir, out_filename):
    ''' create a html report from json file using json2html in the output directory
    '''
    # build up path for html output file
    htmlout = "%s/%s.html" % (outdir, out_filename)
    # open htmlout filedescriptor
    fwhtml = open(htmlout,'w')
    # some html stuff
    pdfpng=get_png_base64('supply/pdf_base64.png')
    html_style ='<style>.center { display: block; margin-left: auto;margin-right: auto;} table {border-collapse: collapse;} th, td { border: 1px solid black;text-align: left; }</style>\n'
    html_head = '<html><head><title>pdfgrab - {0} item/s</title>{1}</head>\n'.format(len(analysis_dict),html_style)
    html_pdf_png = '<p class="center"><img class="center" src="data:image/jpeg;base64,{0}"><br><center>pdfgrab - grab and analyse pdf files</center><br></p>'.format(pdfpng)
    html_body = '<body>{0}\n'.format(html_pdf_png)
    html_end = '\n<br><br><p align="center"><a href="https://github.com/c0decave/pdfgrab">pdfgrab</a> by <a href="https://twitter.com/User_to_Root">dash</a></p></body></html>\n'
    # some attributes
    attr = 'id="meta-data" class="table table-bordered table-hover", border=1, cellpadding=3 summary="Metadata"'
    # convert dictionary to json data
    # in this mode each finding gets its own table there are other possibilities
    # but now i go with this
    html_out = ''
    for k in analysis_dict.keys():
        trans = analysis_dict[k]
        jdata = json.dumps(trans)
        html = json2html.convert(json = jdata, table_attributes=attr)
        html_out = html_out + html + "\n"
        #html_out = html_out + "<p>" + html + "</p>\n"
    #jdata = json.dumps(analysis_dict)
    # create html
    #html = json2html.convert(json = jdata, table_attributes=attr)
    # write html
    fwhtml.write(html_head)
    fwhtml.write(html_body)
    fwhtml.write(html_out)
    fwhtml.write(html_end)
    # close html file
    fwhtml.close()
 def create_url_json(url_d, outdir, out_filename):
    ''' create a json url file in output directory
    '''
    # create url savefile
    jsonurlout = "%s/%s_url.json" % (outdir, out_filename)
    # open up file for writting urls down
    fwjson = open(jsonurlout, 'w')
    # convert url dictionary to json
    jdata = json.dumps(url_d)
    # write json data to file
    fwjson.write(jdata)
    # close filedescriptor
    fwjson.close()
    return True
 def create_url_txt(url_d, outdir, out_filename):
    ''' create a txt url file in output directory
    '''
    # build up txt out path
    txtout = "%s/%s_url.txt" % (outdir, out_filename)
    # open up our url txtfile
    fwtxt = open(txtout, 'w')
    # iterating through the keys of the url dictionary
    for k in url_d.keys():
        # get the entry
        ddata = url_d[k]
        # create meta data for saving
        metatxt = '%s:%s\n' % (ddata['url'], ddata['filename'])
        # write metadata to file
        fwtxt.write(metatxt)
    # close fd
    fwtxt.close()
    return True
--- a/libs/librequest.py
+++ b/libs/librequest.py
@@ -0,0 +1,162 @@
 import os
 import sys
 import json
 import socket
 import requests
 from libs.liblog import logger
 from libs.libhelper import *
 from libs.libgoogle import get_random_agent
 def store_file(url, data, outdir):
    ''' storing the downloaded data to a file
        params: url     - is used to create the filename
                data    - the data of the file
                outdir  - to store in which directory
                returns: dict { "code":<code>, "data":<savepath>,"error":<error>} - the status code, the savepath, the errorcode
    '''
    logger.info('Store file {0}'.format(url))
    name = find_name(url)
    # only allow stored file a name with 50 chars
    if len(name) > 50:
        name = name[:49]
    # build up the save path
    save = "%s/%s" % (outdir, name)
    try:
        f = open(save, "wb")
    except OSError as e:
        logger.warning('store_file {0}'.format(e))
        # return ret_dict
        return {"code":False,"data":save,"error":e}
    # write the data and return the written bytes
    ret = f.write(data)
    # check if bytes are zero
    if ret == 0:
        logger.warning('Written {0} bytes for file: {1}'.format(ret,save))
    else:
        # log to info that bytes and file has been written
        logger.info('Written {0} bytes for file: {1}'.format(ret,save))
    # close file descriptor
    f.close()
    # return ret_dict
    return {"code":True,"data":save,"error":False}
 def download_file(url, args, header_data):
    ''' downloading the file for later analysis 
        params: url         - the url
                args        - argparse args namespace
                header_data - pre-defined header data
        returns: ret_dict
    '''
    # check the remote tls certificate or not?
    cert_check = args.cert_check
    # run our try catch routine
    try:
        # request the url and save the response in req
        # give header data and set verify as delivered by args.cert_check
        req = requests.get(url, headers=header_data, verify=cert_check)
    except requests.exceptions.SSLError as e:
        logger.warning('download file {0}{1}'.format(url,e))
        # return retdict
        return {"code":False,"data":req,"error":e}
    except requests.exceptions.InvalidSchema as e:
        logger.warning('download file {0}{1}'.format(url,e))
        # return retdict
        return {"code":False,"data":False,"error":e}
    except socket.gaierror as e:
        logger.warning('download file, host not known {0} {1}'.format(url,e))
        return {"code":False,"data":False,"error":e}
    except:
        logger.warning('download file, something wrong with remote server? {0}'.format(url))
        # return retdict
        if not req in locals():
            req = False
        return {"code":False,"data":req,"error":True}
    #finally:
        # lets close the socket
        #req.close()
    # return retdict
    return {"code":True,"data":req,"error":False}
 def grab_run(url, args, outdir):
    ''' function keeping all the steps for the user call of grabbing
 	just one and analysing it
    '''
    header_data = {'User-Agent': get_random_agent()}
    rd_download = download_file(url, args, header_data)
    code_down = rd_download['code']
    # is code True download of file was successfull
    if code_down:
        rd_evaluate = evaluate_response(rd_download)
        code_eval = rd_evaluate['code']
        # if code is True, evaluation was also successful
        if code_eval:
            # get the content from the evaluate dictionary request
            content = rd_evaluate['data'].content
            # call store file 
            rd_store = store_file(url, content, outdir)
            # get the code
            code_store = rd_store['code']
            # get the savepath
            savepath = rd_store['data']
            # if code is True, storing of file was also successfull
            if code_store:
                return {"code":True,"data":savepath,"error":False}
    return {"code":False,"data":False,"error":True}
 def evalute_content(ret_dict):
    pass
 def evaluate_response(ret_dict):
    ''' this method comes usually after download_file,
        it will evaluate what has happened and if we even have some data to process
        or not
        params: data    - is the req object from the conducted request
        return: {}
        returns: dict { "code":<code>, "data":<savepath>,"error":<error>} - the status code, the savepath, the errorcode
        '''
    # extract data from ret_dict
    req = ret_dict['data']
    # get status code
    url = req.url
    status = req.status_code
    reason = req.reason
    # ahh everything is fine 
    if status == 200:
        logger.info('download file, {0} {1} {2}'.format(url,reason,status))
        return {"code":True,"data":req,"error":False}
    # nah something is not like it should be
    else:
        logger.warning('download file, {0} {1} {2}'.format(url,reason,status))
        return {"code":False,"data":req,"error":True}
--- a/libs/pdf_png.py
+++ b/libs/pdf_png.py
@@ -0,0 +1,5 @@
 def get_png_base64(filename):
    fr = open(filename,'r')
    buf = fr.read()
    return buf
--- a/pdfgrab.py
+++ b/pdfgrab.py
@@ -1,264 +1,436 @@
 #!/usr/bin/env python3
 #####################
 # yay - old tool adjusted for python3, using googlesearch now
 # and not some self crafted f00
 #
 # new features, new layout, new new :>
-# dash in end of September 2019
+# by dash
 # 
 #
 # TODO
 # * json output
 # * txt output
 # * catch conn refused connections
 # * set option for certificate verification, default is false
 # * complete analyse.txt and seperated
 # * clean up code
 # * do more testing
 # * add random useragent for google and website pdf gathering
 #
 # Done
 # * add decryption routine
 # * catch ssl exceptions
-import os
+import xml
 import sys
 import argparse
 import json
 import os
 import queue
 import urllib
 from json2html import *
 import PyPDF2
 # googlesearch library
 import googlesearch as gs
 import requests
 from PyPDF2 import pdf
 # functions to extern files
 from libs.liblog import logger
 from libs.libhelper import *
 from libs.libgoogle import *
 from libs.libreport import *
 from libs.librequest import grab_run
 from IPython import embed
-from PyPDF2 import pdf
+# some variables in regard of the tool itself
-import googlesearch as gs
+name = 'pdfgrab'
 version = '0.4.9'
 author = 'dash'
 date = 'November 2019'
-_name_ 		= 'pdfgrab'
+# queues for processing
-_version_ 	= '0.3'
+# this queue holds the URL locations of files to download
-_author_	= 'dash'
+url_q = queue.Queue()
-_date_		= '2019'
+url_d = {}
-def url_strip(url):
+# this queue holds the paths of files to analyse
-	url = url.rstrip("\n")
+pdf_q = queue.Queue()
 	url = url.rstrip("\r")
 	return url
 # this is the analysis queue, keeping the data for further processing
 ana_q = queue.Queue()
 def add_queue(tqueue, data):
    ''' wrapper function for adding easy data to
 		created queues. otherwise the functions will be scattered with
 		endless queue commands ;)
 	'''
    tqueue.put(data)
    # d=tqueue.get()
    #logging.debug(d)
    return True
 def process_queue_data(filename, data, queue_type):
    ''' main function for processing gathered data
 		i use this central function for it, so it is at *one* place
 		and it is easy to change the data handling at a later step without
 		deconstructing the who code
    '''
    ana_dict = {}
    url_dict = {}
    if queue_type == 'doc_info':
        logger.info('Queue DocInfo Data {0}'.format(filename))
        name = find_name(filename)
        path = filename
        # create a hash over the file path
        # hm, removed for now
        # path_hash = create_sha256(path)
        # order data in dict for analyse queue
        ana_dict = {path: {'filename': name, 'data': data}}
        #print('data:',data)
        #print('ana_dcit:',ana_dict)
        # add the data to queue
        add_queue(ana_q, ana_dict)
    elif queue_type == 'doc_xmp_info':
        logger.info('Queue DocXMPInfo Data {0}'.format(filename))
        logger.warning('DocXMPInfo json processing not supported {0}'.format(filename))
    elif queue_type == 'url':
        # prepare queue entry
        logger.info('Url Queue {0}'.format(data))
        url_dict = {'url': data, 'filename': filename}
        sha256 = create_sha256(data)
        url_d[sha256] = url_dict
        # add dict to queue
        add_queue(url_q, url_dict)
    else:
        print('[-] Sorry, unknown queue. DEBUG!')
        logger.critical('Unknown queue')
        return False
    return True
 def get_xmp_meta_data(filename, filehandle):
    ''' get the xmp meta data
    '''
    err_dict = {}
    real_extract = {}
    xmp_dict = {}
    fh = filehandle
    try:
        xmp_meta =  fh.getXmpMetadata()
    except xml.parsers.expat.ExpatError as e:
        logger.warning('get_xmp_meta_data error {0}'.format(e))
        err_dict = {'error': str(e)}
        return -1
    finally:
        process_queue_data(filename, err_dict, 'doc_xmp_info')
    if xmp_meta != None:
        try:
            print('xmp_meta: {0} {1} {2} {3} {4} {5}'.format(xmp_meta.pdf_producer,xmp_meta.pdf_pdfversion,xmp_meta.dc_contributor,xmp_meta.dc_creator,xmp_meta.dc_date,xmp_meta.dc_subject))
        #print('xmp_meta cache: {0}'.format(xmp_meta.cache))
        #print('xmp_meta custom properties: {0}'.format(xmp_meta.custom_properties))
        #embed()
        except AttributeError as e:
            logger.warning('xmp_meta print {0}'.format(e))
            return False
    return xmp_dict
 def get_DocInfo(filename, filehandle):
    ''' the easy way to extract metadata
 		indirectObjects...
 		there is an interesting situation, some pdfs seem to have the same information stored 
 		in different places, or things are overwritten or whatever
 		this sometimes results in an extract output with indirect objects ... this is ugly
-	fh = filehandle
+		{'/Title': IndirectObject(111, 0), '/Producer': IndirectObject(112, 0), '/Creator': IndirectObject(113, 0), '/CreationDate': IndirectObject(114, 0), '/ModDate': IndirectObject(114, 0), '/Keywords': IndirectObject(115, 0), '/AAPL:Keywords': IndirectObject(116, 0)}
 	try:
 		extract = fh.documentInfo
 	except pdf.utils.PdfReadError as e:
 		print('Error: %s' % e)
 		return -1
-	print('-'*80)
+		normally getObject() is the method to use, to fix this, however this was not working in the particular case.
-	print('File: %s' % filename)
+		this thing might even bring up some more nasty things, as a (probably weak) defense and workaround
-	for k in extract.keys():
+		the pdfobject is not used anymore after this function, data is converted to strings...
-		edata = '%s %s' % (k,extract[k])
+		bad example:
-		print(edata)
+	'''
-		print
+
-	print('-'*80)
+    err_dict = {}
    real_extract = {}
    fh = filehandle
    try:
        extract = fh.documentInfo
    except pdf.utils.PdfReadError as e:
        logger.warning('get_doc_info {0}'.format(e))
        err_dict = {'error': str(e)}
        return -1
    except PyPDF2.utils.PdfReadError as e:
        logger.warning('get_doc_info {0}'.format(e))
        err_dict = {'error': str(e)}
        return -1
    finally:
        process_queue_data(filename, err_dict, 'doc_info')
    print('-' * 80)
    print('File: %s' % filename)
    #	embed()
    # there are situations when documentinfo does not return anything
    # and extract is None
    if extract == None:
        err_dict = {'error': 'getDocumentInfo() returns None'}
        process_queue_data(filename, err_dict, 'doc_info')
        return -1
    try:
        for k in extract.keys():
            key = str(k)
            value = str(extract[k])
            edata = '%s %s' % (key, value)
            print(edata)
            print
            real_extract[key] = value
        print('-' * 80)
    except PyPDF2.utils.PdfReadError as e:
        logger.warning('get_doc_info {0}'.format(e))
        err_dict = {'error': str(e)}
        process_queue_data(filename, err_dict, 'doc_info')
        return -1
    process_queue_data(filename, real_extract, 'doc_info')
 def decrypt_empty_pdf(filename):
    ''' this function simply tries to decrypt the pdf with the null password
 		this does work, as long as no real password has been set
 		if a complex password has been set -> john
 	'''
    fr = pdf.PdfFileReader(open(filename, "rb"))
    try:
        fr.decrypt('')
    except NotImplementedError as e:
        logger.warning('decrypt_empty_pdf {0}{1}'.format(filename,e))
        return -1
    return fr
 	fr = pdf.PdfFileReader(open(filename,"rb"))
 	try:
 		fr.decrypt('')
 	except NotImplementedError as e:
 		print('Error: %s' % (e))
 		print('Only algorithm code 1 and 2 are supported')
 		return -1
 	return fr
 def check_encryption(filename):
-	''' basic function to check if file is encrypted 
+    ''' basic function to check if file is encrypted
 	'''
-	print(filename)
+    print(filename)
-	try:
+    try:
-		fr = pdf.PdfFileReader(open(filename,"rb"))
+        fr = pdf.PdfFileReader(open(filename, "rb"))
-	except pdf.utils.PdfReadError as e:
+        print(fr)
-		print('Error: %s' % e)
+    except pdf.utils.PdfReadError as e:
-		return -1
+        logger.warning('check encryption {0}'.format(e))
        return -1
-	if fr.getIsEncrypted()==True:
+    if fr.getIsEncrypted() == True:
-		print('[i] File encrypted %s' % filename)
+        print('[i] File encrypted %s' % filename)
-		nfr = decrypt_empty_pdf(filename)
+        nfr = decrypt_empty_pdf(filename)
-		if nfr != -1:
+        if nfr != -1:
-			get_DocInfo(filename,nfr)
+            get_DocInfo(filename, nfr)
            get_xmp_meta_data(filename,nfr)
-	else:
+    else:
-		get_DocInfo(filename,fr)
+        get_DocInfo(filename, fr)
        get_xmp_meta_data(filename,fr)
-	#fr.close()
+    # fr.close()
-	return True
+    return True
 def find_name(pdf):
 	''' simply parses the urlencoded name and extracts the storage name
 		i would not be surprised this naive approach can lead to fuckups
 	'''
 	#find the name of the file
 	name = pdf.split("/")
 	a = len(name)
 	name = name[a-1]
 	#print(name)
 	return name
 def make_directory(outdir):
 	''' naive mkdir function '''
 	try:
 		os.mkdir(outdir)
 	except:
 		#print("[W] mkdir, some error, directory probably exists")
 		pass
 def download_pdf(url, header_data):
 	''' downloading the pdfile for later analysis '''
 	try:
 		req = requests.get(url,headers=header_data,verify=True)
 		#req = requests.get(url,headers=header_data,verify=False)
 		data = req.content
 	except requests.exceptions.SSLError as e:
 		print('Error: %s' % e)
 		return -1
 	except:
 		print('Error: Probably something wrong with remote server')
 		return -1
 	#print(len(data))
 	return data
 def store_pdf(url,data,outdir):
 	''' storing the downloaded pdf data '''
 	name = find_name(url)
 	save = "%s/%s" % (outdir,name)
 	try:
 		f = open(save,"wb")
 	except OSError as e:
 		print('Error: %s' % (e))
 		return -1
 	ret=f.write(data)
 	print('[+] Written %d bytes for File: %s' % (ret,save))
 	f.close()
 	# return the savepath
 	return save
 def _parse_pdf(filename):
-	''' the real parsing function '''
+    ''' the real parsing function '''
-	check_encryption(filename)
+    logger.warning('{0}'.format(filename))
-	return True
+    if check_file_size(filename):
        ret = check_encryption(filename)
        return ret
    else:
        logger.warning('Filesize is 0 bytes at file: {0}'.format(filename))
        return False
-	print('[+] Opening %s' % filename)
+def seek_and_analyse(search, args, outdir):
-	pdfile = open(filename,'rb')
+    ''' function for keeping all the steps of searching for pdfs and analysing
        them together
    '''
    # check how many hits we got
    # seems like the method is broken in googlsearch library :(
    #code, hits = hits_google(search,args)
    #if code:
    #    print('Got {0} hits'.format(hits))
-	try:
+    # use the search function of googlesearch to get the results
-		h = pdf.PdfFileReader(pdfile)
+    code, values=search_google(search, args)
-	except pdf.utils.PdfReadError as e:
+    if not code:
-		print('[-] Error: %s' % (e))
+        if values.code == 429:
-		return
+            logger.warning('[-] Too many requests, time to change ip address or use proxychains')
-	
+        else:
-	return pdfile
+            logger.warning('Google returned error {0}'.format(values))
        return -1
    for item in values:
        filename = find_name(item)
        process_queue_data(filename, item, 'url')
-def parse_single_pdf(filename):
+    # urls = search_pdf(search,args)
 	''' single parse function '''
 	return 123
-def grab_url(url, outdir):
+    # *if* we get an answer
-	''' function keeping all the steps for the user call of grabbing 
+    if url_q.empty() == False:
-		just one pdf and analysing it
+        # if urls != -1:
-	'''
+        # process through the list and get the pdfs
-	data = download_pdf(url,None)
+        while url_q.empty() == False:
-	if data != -1:
+            item = url_q.get()
-		savepath = store_pdf(url, data, outdir)
+            # print(item)
-		_parse_pdf(savepath)
+            url = item['url']
            rd_grabrun = grab_run(url, args, outdir)
            code = rd_grabrun['code']
            savepath = rd_grabrun['data']
            if code:
                _parse_pdf(savepath)
-	return
+    return True
 def seek_and_analyse(search,sargs,outdir):
 	''' function for keeping all the steps of searching for pdfs and analysing
 		them together
 	'''
 	urls = search_pdf(search,sargs)
 	for url in urls:
 		grab_url(url,outdir)
 def search_pdf(search, sargs):
 	''' the function where googlesearch from mario vilas
 		is called
 	'''
 	query='%s filetype:pdf' % search
 	#print(query)
 	urls = []
 	for url in gs.search(query,num=20,stop=sargs):
 		print(url)
 		urls.append(url)
 	return urls
 def run(args):
-	# specify output directory
+    # initialize logger
-	outdir = args.outdir
+    logger.info('{0} Started'.format(name))
-	# create output directory
+    # create some variables
 	make_directory(outdir)
 	# lets see what the object is
 	if args.url_single:
 		url = args.url_single
 		print('[+] Grabbing %s' % (url))
 		grab_url(url, outdir)
 	elif args.file_single:
 		pdffile = args.file_single
 		print('[+] Parsing %s' % (pdffile))
 		_parse_pdf(pdffile)
 	elif args.search:
 		search = args.search
 		sargs = args.search_stop
 		#print(args)
 		print('[+] Seek and de...erm...analysing %s' % (search))
 		seek_and_analyse(search,sargs,outdir)
 	elif args.files_dir:
 		directory = args.files_dir
 		print('[+] Analyse pdfs in directory %s' % (directory))
 		files = os.listdir(directory)
 		for f in files:
 			fpath = '%s/%s' % (directory,f)
 			_parse_pdf(fpath)
-		
+    # outfile name
    if args.outfile:
        out_filename = args.outfile
    else:
        out_filename = 'pdfgrab_analysis'
-	else:
+    # specify output directory
-		print('[-] Dunno what to do, bro.')
+    outdir = args.outdir
-	#logfile = "%s/%s.txt" % (out,out)
+
-	#flog = open(logfile,"w")
+
    # create output directory
    make_directory(outdir)
    # lets see what the object is
    if args.url_single:
        url = args.url_single
        logger.info('Grabbing {0}'.format(url))
        logger.write_to_log('Grabbing %s' % (url))
        grab_url(url, args, outdir)
    elif args.file_single:
        pdffile = args.file_single
        logger.info('Parsing {0}'.format(pdffile))
        _parse_pdf(pdffile)
    elif args.search:
        search = args.search
        logger.info('Seek and analyse {0}'.format(search))
        if not seek_and_analyse(search, args, outdir):
            return -1
    elif args.files_dir:
        directory = args.files_dir
        logger.info('Analyse pdfs in directory {0}'.format(directory))
        try:
            files = os.listdir(directory)
        except:
            logger.warning('Error in args.files_dir')
            return False
        for f in files:
            # naive filter function, later usage of filemagic possible
            if f.find('.pdf') != -1:
                fpath = '%s/%s' % (directory, f)
                _parse_pdf(fpath)
    # simply generate html report from json outfile
    elif args.gen_html_report:
        fr = open(args.gen_html_report,'r')
        analysis_dict = json.loads(fr.read())
        if create_html_report(analysis_dict, outdir,out_filename):
            logger.info('Successfully created html report') 
            sys.exit(0)
        else:
            sys.exit(1)
    else:
        print('[-] Dunno what to do, bro. Use help. {0} -h'.format(sys.argv[0]))
        sys.exit(1)
    # creating the analysis dictionary for reporting
    analysis_dict = prepare_analysis_dict(ana_q)
    # lets go through the different reporting types
    if args.report_txt:
        if create_txt_report(analysis_dict, outdir,out_filename):
            logger.info('Successfully created txt report') 
    if args.report_json:
        if create_json_report(analysis_dict, outdir,out_filename):
            logger.info('Successfully created json report') 
    if args.report_html:
        if create_html_report(analysis_dict, outdir,out_filename):
            logger.info('Successfully created html report') 
    if args.report_url_txt:
        if create_url_txt(url_d, outdir,out_filename):
            logger.info('Successfully created txt url report') 
    if args.report_url_json:
        if create_url_json(url_d, outdir,out_filename):
            logger.info('Successfully created json url report') 
    return 42
 # This is the end my friend.
 def main():
-	parser_desc = "%s %s %s" % (_name_,_version_,_author_)
+    parser_desc = "%s %s %s in %s" % (name, version, author, date)
-	parser = argparse.ArgumentParser(prog = __name__, description=parser_desc)
+    parser = argparse.ArgumentParser(prog=name, description=parser_desc)
-	parser.add_argument('-o','--outdir',action='store',dest='outdir',required=False,help="define the outdirectory for downloaded files and analysis output",default='pdfgrab')
+    parser.add_argument('-O', '--outdir', action='store', dest='outdir', required=False,
-	parser.add_argument('-u','--url',action='store',dest='url_single',required=False,help="grab pdf from specified url for analysis",default=None)
+                        help="define the outdirectory for downloaded files and analysis output", default='pdfgrab')
-	parser.add_argument('-f','--file',action='store',dest='file_single',required=False,help="specify local path of pdf for analysis",default=None)
+    parser.add_argument('-o', '--outfile', action='store', dest='outfile', required=False,
-	parser.add_argument('-F','--files-dir',action='store',dest='files_dir',required=False,help="specify local path of *directory* with pdf *files* for analysis",default=None)
+                        help="define file with analysis output, if no parameter given it is outdir/pdfgrab_analysis, please note outfile is *always* written to output directory so do not add the dir as extra path")
-	parser.add_argument('-s','--search',action='store',dest='search',required=False,help="specify domain or tld to scrape for pdf-files",default=None)
+    parser.add_argument('-u', '--url', action='store', dest='url_single', required=False,
-	parser.add_argument('-sn','--search-number',action='store',dest='search_stop',required=False,help="specify how many files are searched",default=10,type=int)
+                        help="grab pdf from specified url for analysis", default=None)
    # parser.add_argument('-U','--url-list',action='store',dest='urls_many',required=False,help="specify txt file with list of pdf urls to grab",default=None)
    #########
    parser.add_argument('-f', '--file', action='store', dest='file_single', required=False,
                        help="specify local path of pdf for analysis", default=None)
    parser.add_argument('-F', '--files-dir', action='store', dest='files_dir', required=False,
                        help="specify local path of *directory* with pdf *files* for analysis", default=None)
    parser.add_argument('-s', '--search', action='store', dest='search', required=False,
                        help="specify domain or tld to scrape for pdf-files", default=None)
    parser.add_argument('-sn', '--search-number', action='store', dest='search_stop', required=False,
                        help="specify how many files are searched", default=10, type=int)
    parser.add_argument('-z', '--disable-cert-check', action='store_false', dest='cert_check', required=False,help="if the target domain(s) run with old or bad certificates", default=True)
    parser.add_argument('-ghr', '--gen-html-report', action='store', dest='gen_html_report', required=False,help="If you want to generate the html report after editing the json outfile (parameter: pdfgrab_analysis.json)")
    parser.add_argument('-rtd', '--report-text-disable', action='store_false', dest='report_txt', required=False,help="Disable txt report",default=True)
    parser.add_argument('-rjd', '--report-json-disable', action='store_false', dest='report_json', required=False,help="Disable json report",default=True)
    parser.add_argument('-rhd', '--report-html-disable', action='store_false', dest='report_html', required=False,help="Disable html report",default=True)
    parser.add_argument('-rutd', '--report-url-text-disable', action='store_false', dest='report_url_txt', required=False,help="Disable url txt report",default=True)
    parser.add_argument('-rujd', '--report-url-json-disable', action='store_false', dest='report_url_json', required=False,help="Disable url json report",default=True)
    if len(sys.argv)<2:
        parser.print_help(sys.stderr)
        sys.exit()
    args = parser.parse_args()
    run(args)
 	args = parser.parse_args()
 	run(args)
 if __name__ == "__main__":
-	main()
+    main()
--- a/supply/pdf.png
+++ b/supply/pdf.png
--- a/supply/pdf_base64.png
+++ b/supply/pdf_base64.png
Author	SHA1	Message	Date
c0decave	132750867f	release version 0.4.9	2019-11-07 12:51:30 +01:00
c0decave	a89ac93c3d	bugfix pre-release 0.4.8	2019-11-06 12:49:54 +01:00
c0decave	4f63e62690	updated Readme.md	2019-11-05 14:50:02 +01:00
c0decave	e1d7c3f760	release of version 0.4.7 added html reporting, added logging, reordered libraries, added experimental xmp meta data, fixed bug introduced due xmp meta data, added todo list	2019-11-05 14:42:24 +01:00
c0decave	fa3b925d6f	yak yak yak	2019-10-02 18:39:23 +02:00
dash	f7339cda5f	yak yak yak	2019-10-02 18:38:31 +02:00
dash	1c6d03ef0e	spelling error	2019-10-02 18:19:05 +02:00
dash	5f9cdf86d1	several changes,new features,bugfixes	2019-10-02 18:17:41 +02:00
dash	64f48eef9a	cert-check, random-user-agent, catching too many requests	2019-09-26 18:45:02 +02:00
dash	fb2dfb1527	updated readme	2019-09-26 17:50:10 +02:00