The following is a small extract from what will become the GPen Study Guide.
As organizations create documents, the software that they use to create these documents embeds an enormous amount of information in the document files. A good deal of metadata is also included in the file. Much of this metadata is associated with formatting and display of the other data in the file. Besides this formatting metadata, a lot of file creation and editing tools include additional metadata entries that can be very useful for penetration testers during our reconnaissance phase, such as:
· User names: Penetration testers often need user names for exploitation and password-guessing attacks
· File system paths: Knowing the full path of the original file when it was created can reveal useful tidbits about the target organization
· E-mail addresses: This data can be useful if the penetration test scope includes spear phishing tests
· Client-side software in use: Given that client-side exploitation is such a common attack vector, it can be helpful to penetration testers to know which client-side programs are in use
Almost every document type has some form of metadata, but some are richer in metadata than others. The following types of documents, generated and used by most enterprises, are of particular interest to penetration testers:
· pdf files: These files are associated with Acrobat Reader and a variety of other pdf creation and editing tools.
· doc/docx, xls/xlsx, and ppt/pptx files: These files are associated with Microsoft Office suite, but are also used by several other related tools.
· jpg and jpeg: These image files often contain a significant amount of metadata, including data about the camera used to take a picture, the file system of the machine where the image was edited, and details about the image-editing software.
· html and htm: These file types contain web pages, and may at first seem uninteresting. However, their comments and hidden form elements could contain metadata that is very useful to a penetration tester. Additionally, scripts embedded in the HTML may reveal sensitive information or undocumented features of a web application.