In general, every organization divides its information into two categories, public and private. The public category encompasses all information that can be exposed outside the organization and there is no harm if outsiders have access to it; on the other hand, private information is information that is deemed sensitive and only certain individuals are authorized to access it. This private information must be stored in a physically as well as digitally secure system. However, due to an insufficient security mechanism, lack of auditing, or even employee negligence, this sensitive information may get exposed to the public. What I am going to explain in this post is how to find such leaked information on the Internet using search engines or certain tools.
Search engines constantly scan the Internet and index all web-related data or files. Each file is accessible through its URL which, in many cases, is publicized by the web application upon the creation of that file. Thus, search engines can hold searchable indexes of files that were not intended to be in public. It is only a matter of how to skillfully query the search engines – particularly Google, to reveal those files to us. We have the option of using manual search queries to get the results we want, or using an automated tool that does the job on our behalf.
Querying Google for Document Files
Google search engine is very flexible when it comes to customizing search queries. There are many built-in operators that can be utilized to customize our search criteria and provide us with unconventional results. What is relevant to our discussion here are the following operators:
The syntax for using any operator is as follows: operator:search_term
We are interested in finding out MS Office files, Open Office files, PDF files, and TXT files since these types of files are what is used often to store sensitive information. Thus, the file extensions that we will look for are:
doc, docx, xls, xlsx, ppt, pptx, pps, ppsx, odt, ods, odp, pdf, txt, rtf
If we assume that our target domain is example.com, then, we need to issue the following query to find one particular file type:
- To find PDF files, we issue: site:example.com filetype:pdf
- To find DOC files, we issue: site:example.com filetype:doc
- To find DOCX files, we issue: site:example.com filetype:docx
However, if we would like to combine multiple file types in one search query, we need to use the OR operator (please note that it is case-sensitive). Thus, to find all file types mentioned above, we can issue the following search query:
site:example.com filetype:pdf OR filetype:doc OR filetype:docx OR filetype:xls OR filetype:xlsx OR filetype:ppt OR filetype:pptx OR filetype:pps OR filetype:ppsx OR filetype:odt OR filetype:ods OR filetype:odp OR filetype:txt OR filetype:rtf
Automatic Document Retrieval and Analysis with “Metagoofil”
Metagoofil is an open source tool that can search the Internet for certain file types at a certain domain, download these files to the local system, and then, extract and analyze the metadata inside those files. Metadata includes things like username, email address, date of creation, etc., which can help in profiling the target organization. Metagoofil comes installed on Kali Linux by default. Using this tool is actually easy and straightforward. For example, to download a maximum of 50 files that are of different types – pdf, doc, docx, xls, xlsx, and txt – from the domain example.com and save them to “mydirectory” folder, we will issue the following command:
# metagoofil -d example.com -t pdf,doc,docx,xls,xlsx,txt -n 50 -o mydirectory
Just like with the search results above, the downloaded files can be of different sensitivity levels; some of them might be public files, while others could be private files that were not kept secure. You can now perform extra analysis about the contents of those files. Probably, you may search for files with passwords, usernames, and emails.