OSINT technologies are already available that will enable corporations to gather information on their competitors. The developments will also have a significant impact on the business environment, forcing corporations to be more open and transparent about their activities and products. The opportunities to leverage OSINT are therefore already there.

In the future, OSINT technologies will become more advanced and will be capable of gathering more data on individuals and corporations. The opportunities for businesses will increase and those that miss out on the opportunities will lose out on valuable information. As we look forward to the future, we need to ask ourselves what open sources can be used for data gathering. The answer is, almost anything. With new technologies coming online daily, there is no telling where we will see OSINT in 2021 and beyond.

Definition and concepts

To define “open-source intelligence” (OSINT), it is first necessary to define “intelligence.” Military intelligence is the collection, processing, and use of information to provide guidance and direction to assist commanders in their decisions. Business intelligence is essentially the same thing, with the commanders being the business leaders.

The U.S. Department of Defense (DoD) defines OSINT as “an intelligence that is produced from publicly available information and is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement.” Open-source intelligence differs from other types of intelligence because they are publicly available, which means they are accessible to the public without breaching any copyright or privacy laws. As a result, businesses can access open-source tools to exploit information for learning about their competitors.

There are four categories of open source information and intelligence:

  • Open-source data (OSD): generic data from a primary source, such as satellite images, telephone call data and metadata, datasets, survey data, photographs, and audio and video recordings

  • Open-source information (OSINF): generic data that has been screened based on a specific criterion, such as books about a specific subject, articles, dissertations, artworks, and interviews

  • Open-source intelligence (OSINT): all the information that has been discovered and filtered to meet a specific need or purpose. OSINT can be used directly in any intelligence context – it is the output from open-source information material processing

  • Validated open-source intelligence (OSINT-V): OSINT that can be relied upon with a high degree of certainty. This is essential because some “adversaries” may spread inaccurate OSINT to mislead OSINT analysis.

It is also important to distinguish between data, information, and knowledge. Data is unvarnished facts: “the price of silver is $28 per ounce.” Information is data interpreted within a specific context: “the price of silver per ounce has risen from $25 to $28 in one week.” Lastly, knowledge is the insight learned or inferred from information and experience.

Uses Of OSINT

Corporations use OSINT tools mainly to discover opportunities for future growth and monitor competitors’ activities. Given the ubiquity of the Internet, companies with a limited budget can benefit from using open-source intelligence tools in their business strategy.

Companies use OSINT tools for other reasons, including the following:

  • To fight against data leakage, avoid the exposure of confidential information, and guard against cyberthreats that prey on the security vulnerabilities of their networks
  • To create threat intelligence strategies and develop policies related to cyber-risk management for protecting their finances, business reputation, and customer base

OSINT is specifically useful for companies working in the defense industry, as such; companies need to be fully aware of the surrounding circumstances of their customers to develop and target them with the appropriate equipment.

Penetration testers are often hired by companies to break into internal networks to show where weaknesses lie and how to keep outsiders out. Black hat hackers exploit these vulnerabilities to gain unauthorized access. Hackers and penetration tests commonly use open source information tools to learn information on targets to exploit. OSINT tools are also an effective tool to conduct social engineering attacks.

Private corporate security services employ OSINT tools as well. They conduct individual checks: with their own employees, top management and employees, and the executive officers and shareholders of their contractors. Questions asked include: “Is this an offshore company or not? Who is the real owner? Has there been any dark business? What is the source of funds?” Knowing the answers to those questions is crucial for legal compliance and performing due diligence before the execution of any major deal.

Many organizations are now using OSINT tools to investigate insurance fraud. OSINT helps in investigating insurance fraud by providing the investigator with background on the policyholder, such as whether the insured has made any claims previously. By gathering information on all the parties involved in an insurance claim, the investigator can determine whether or not the claims are valid.

Law enforcement agencies (LEAs) use open-source intelligence to investigate, prosecute and, perhaps most significantly, predict and prevent crime and social unrest. Increasingly, LEAs use publicly available social media communications or social intelligence (SOCMINT) in those efforts. SOCMINT embraces a vast amount of material, including Facebook, tweets, videos hosted on YouTube, and comments on online public newspaper/TV news sites.

The use of SOCMINT by LEAs raises one of the main legal risks faced by users of OSINT tools, namely potential violation of privacy statutes. Much of OSINT will contain “personal data” as understood in the EU General Data Protection Regulation (GDPR) and will contribute to “personal life,” within the meaning of the European Convention on Human Rights (EHCR). Those laws prohibit the collection, dissemination, and use of personal data and the failure to respect an individual’s private life.

The general legal view is that SOCMINT should have no protection under these laws because someone who has chosen to voluntarily disclose their personal and private life to the public gaze has no reasonable expectation of privacy. The law is evolving, however, and privacy advocates argue that what can be gathered from SOCMINT is not just the obvious part (text, pictures, videos, and links) but also the network of friends of “social graph” that can be extracted from a social media profile. One U.S. legal commentator has expressed the view that social media surveillance is a covert device to evade the generally strong U.S. Fourth Amendment protections against the warrantless search of private material.

Challenges Faced by The OSINT User

All intelligence gathering methodologies have some limitations. OSINT faces the following challenges:

Data overload: Open-source information produces a massive amount of valuable data to analyze. Automated tools exist for this purpose, and many organizations including governments have developed AI tools for filtering and processing acquired data. Nevertheless, the sheer volume of data will continue to be a challenge

Reliability of sources: Open-source intelligence sources require thorough verification from classified sources to be trusted. Governments may distribute inaccurate information which misleads the OSINT-gathering process

Human efforts: Humans must vet the output generated by automated tools, including AI, to determine whether the open-source information is trustworthy. This process consumes time and human resources, given that the information must be compared to classified data (in cases with military and commercial information).

OSINT Techniques and Methods

It is important to identify the different classes of data because data collection (often called “data wrangling”) is the first step in the open-source intelligence process. Structured data are data that are highly organized, such as data held in typical relational databases with an underlying data model that describes each table, field, and the relationships between them. Unstructured data have no data model defined upfront and no prerequisite organizational structure (this would include the content of web pages, books, audio, video, and other files not easily read or interpreted by machines.) Analyzing unstructured data relies heavily on natural language processing as well as image processing.

Between structured and unstructured data are semi-structured data, also sometimes called “self-describing data.” This type of data is particularly representative of the type of information accessible through the web, such as the type of data available through RESTful APIs (e.g., Twitter).

Open-source data is obtained by:

  • Manual searches (often time-consuming)
  • Web crawlers and spiders (the automation of manual searches; teaming up a web crawler with a processor that tests pages for relevance reduces the result set)
  • Web metadata (the HTML of a webpage contains tags that described the page)
  • APIs (for example, the Bing search API provides automated access to results from a specific query)
  • Open data (not open-source data but a subset thereof; published in a machine-readable format to enhance transparency)
  • Social media
  • Traditional media
  • RSS (Really Simple Syndication, a machine-readable method, based on an XML format, of publishing information about which new articles, posts, etc., have been added to a website)
  • Grey literature. (Articles, reports, white papers, and other literature that does not fall into the category of normal open sources nor into the consented data, but may still contain useful information for open-source intelligence investigations.)
  • Paid data and consented data
  • Data on the Deep and Dark Webs. (The deep web is all content on the Internet that is not indexable by Google or other search engines. The dark web is a specific part of the deep web that can only be accessed through the use of specific browsers such as Tor or even specific operating systems such as Tails.)

Information Extraction and Text Mining

If the data extracted is unstructured, it must be moved into a structured state—this process is called information extraction. The most common example of this process is the parsing of natural language text and the extraction of specific entities and events or the categorization of the text. A number of libraries and APIs exist to assist in natural language processing (NLP), such as Python’s NLTK, Gate, and the AlchemyAPI.

Main body extraction is the process of making a web page’s HTML structure and extracting from it only the text that makes up the article and not the surrounding images and links that you would see if you viewed the web page on a browser. Tools used include Flipboard, Evernote’s WebClipper, Goose, the Alchemy API, and Aylien. Entity extraction is the process of obtaining the identity of entities found in date. Entities are real objects such as people (i.e., names), organizations, and places mentioned in text. They can also include objects such as dates and times, telephone numbers, email addresses, URLs, products, and even credit card numbers. Entity extraction, also called named entity recognition, can be performed using linguistic, pattern-based, or statistical machine learning methods. Tools used include AlchemyAPI, Aylien, and Rosette.

Data Analysis

Also important is the analysis of the context in which the data is placed. Entity relation modeling uses the idea that natural language follows a specific structure: Subject—Predicate—Object. The subject is the person who carries out the action, the predicate is the action itself, and the object is the who/what/where the action was carried out. Entity relation modeling allows one to identify not only the entity but the action that it is associated with. This information is far more valuable than simple entity extraction as it immediately gives information about the context the entity appears in and provides more options for the subsequent analysis.

Once the data is extracted and structured, it must be validated and analyzed. It has been said that “the major difference between basic and excellent OSINT ‘operations’ lies in the analytical process.” The following are the main types of textual analysis or NLP:

  • Text processing (the “bag of words” method, concordance, collocations, and the vector space model)
  • Word sense disambiguation (the problem of identifying the true meaning of a word when it has multiple definitions, often resolved by machine learning techniques)
  • Sentiment analysis (analyzes the language to determine underlying emotion, which was used by London’s Metropolitan Police, who began using it after the 2011 London Riots)

Aggregation and Other Analytical Concepts

Documents can also be analyzed among themselves in a process known as aggregation. Aggregation techniques include document clustering, which uses the mathematical tools of dimension reduction, singular value decomposition, and multi-dimensional scaling. The relatedness of the data is the focus of network analysis and social network analysis. There are a number of statistics and measures associated with network analysis that provide information that helps to understand how the positions of different entities in the network affect how it works, the simplest of which is degree centrality, which is simply the entity in the network with the most connections to other entities in the network, giving one measure of which entity is the most highly connected within the network.

Other analytical concepts include:

  • Co-occurrence networks
  • Location resolution
  • Geocoding and reverse geocoding

Finally, a complete analysis of OSINT requires the validation of open-source information. NATO’s open-source information handbook suggests that one should assess:

  • The authority of the source
  • The accuracy of the source (by validating it against other sources)
  • The objectivity of the source (which is where sentiment analysis may be able to assist)
  • The currency ( the provision of a timestamp for publication and the presence of an author)
  • The coverage (the degree of relevancy)

Common OSINT Tools

Here is a list of some of the most popular OSINT tools, with a brief description of each:

  • Maltego is an open-source intelligence application used to discover, classify, and link together information from different sources, including social media. The program is designed to help security professionals quickly correlate data from disparate sources and provides a graph-based visualization to show networks of relationships and associations between people, organizations, and other subjects. The full variety of OSINT tools applied in Maltego are being used together with the Social Links Pro product, including the functionality of more than 1000 research methods and resources

  • SpiderFoot is an OSINT platform for security assessments. The tool tracks down information related to IP addresses, domain names, and email addresses. That platform is designed to allow for quick and easy collection of information on a target organization, host, or individual

  • Gamayun is an OSINT tool that makes it easy to conduct internet investigations through data collection from public websites. The tool allows for the discovery of individuals based on specific parameters, such as emails or photos, and finds relevant social network profiles

  • Spyse search engine scans the Internet for technical information and is common among hackers in cyber reconnaissance. The user can retrieve information relevant to domains, such as IPs and DNS records. The database contains information on over 1.2 billion domains

  • theHarvester is a simple tool that serves as an effective reconnaissance step prior to penetration testing, theHarvester is a reconnaissance tool designed for social engineers and security professionals. The tool functions by retrieving information from search engines including emails, names, sub-domains, and open ports

  • Creepy is an OSINT tool for information on geolocation. The information is collected from various social networking platforms. The tool presents the search results on a map. The user can enter a Twitter or Flickr username, and the tool analyzes posts to determine the location and the time

  • Recon-ng is a web reconnaissance framework that allows users to search through a variety of social media sites to collect various forms of publicly available data (e.g., usernames, domains, phone numbers, email addresses, and IP addresses). The framework can be used for social engineering, competitive intelligence, and general information gathering

  • Searchcode is a search engine that scans API documentation, code snippets, and open source repositories. The database contains over 20 billion lines of indexed open-source code. Users can search for terms included in lines of code, such as usernames, security flaws, and special characters used for launching code injection attacks

  • Shodan is a search engine that provides results on information related to assets that are connected to the network. Shodan, which is also available on SL Pro, can also be used by security professionals to find companies that are collecting and using large amounts of sensitive data, or to find insecure home devices

  • Metagoofil is a tool that has been used to collect metadata from hundreds of websites related to defense, technology, information warfare, terrorism, and others. It can find and download metadata from sites that have it available, and also retrieve the full HTML of sites that have been truncated

Notable Investigations Made with OSINT Tools

The United States Department of Justice (DOJ) regularly uses OSINT methods to identify criminals. The DOJ used OSINT techniques in the 2015 “Operation Pacifier” campaign to obtain information on a child pornography website and identify the owners.

In 2015, a Darkweb investigation was conducted of an extremist online community called the Iron March. The website hosted discussions on topics such as terrorism, political instability, and torture. OSINT methods were used to determine the average profile of an Iron March user, and how effective the platform was at connecting extremists.

Another investigation that utilized OSINT tools was in 2016 when the UK National Crime Agency (NCA) took down the Avalanche network, which was responsible for infecting over 4 million computers running mostly Microsoft Windows operating systems and generating roughly $5 million per month in fraudulent advertising revenue.

The Future of OSINT

The global market for open-source intelligence tools is expected to grow at a compound annual growth rate of 17%, reaching nearly $12 billion U.S. by 2026. The rise in open-source information coupled with increasing security threats seems likely to drive demand for OSINT. Ever more sophisticated and powerful AI should allow OSINT to become ever more insightful and valuable. Nevertheless, open-source intelligence tools, and particularly SOCMINT, face significant challenges, not the least of which is the growing pressure to protect privacy rights and the increase in bad and manipulated data strewn by bad actors throughout the Internet.