|
8 | 8 |
|
9 | 9 | In order for search engines to work, computer programs (or `robots`) regularly fetch data (referred to as [crawling](https://en.wikipedia.org/wiki/Web_crawler)) from billions of pages on the web. These programs find web content and functionality by following links from other pages, or by looking at sitemaps. If a site uses a special file called `robots.txt` to list pages that it does not want search engines to fetch, then the pages listed there will be ignored. This is a basic overview - Google offers a more in-depth explanation of [how a search engine works](https://support.google.com/webmasters/answer/70897?hl=en). |
10 | 10 |
|
11 | | -Testers can use search engines to perform reconnaissance on sites and web applications. There are direct and indirect elements to search engine discovery and reconnaissance: direct methods relate to searching the indexes and the associated content from caches, while indirect methods relate to learning sensitive design and configuration information by searching forums, newsgroups, and tendering sites. |
| 11 | +Testers can use search engines to perform reconnaissance on sites and web applications. There are direct and indirect elements to search engine discovery and reconnaissance: direct methods relate to searching the indices and the associated content from caches, while indirect methods relate to learning sensitive design and configuration information by searching forums, newsgroups, and tendering sites. |
12 | 12 |
|
13 | | -Once a search engine robot has completed crawling, it commences indexing the web content based on tags and associated attributes, such as `<TITLE>`, in order to return relevant search results. If the `robots.txt` file is not updated during the lifetime of the site, and in-line HTML meta tags that instruct robots not to index content have not been used, then it is possible for indexes to contain web content not intended to be included by the owners. Site owners may use the previously mentioned `robots.txt`, HTML meta tags, authentication, and tools provided by search engines to remove such content. |
| 13 | +Once a search engine robot has completed crawling, it commences indexing the web content based on tags and associated attributes, such as `<TITLE>`, in order to return relevant search results. If the `robots.txt` file is not updated during the lifetime of the site, and in-line HTML meta tags that instruct robots not to index content have not been used, then it is possible for indices to contain web content not intended to be included by the owners. Site owners may use the previously mentioned `robots.txt`, HTML meta tags, authentication, and tools provided by search engines to remove such content. |
14 | 14 |
|
15 | 15 | ## Test Objectives |
16 | 16 |
|
@@ -73,6 +73,31 @@ cache:owasp.org |
73 | 73 | \ |
74 | 74 | *Figure 4.1.1-2: Google Cache Operation Search Result Example* |
75 | 75 |
|
| 76 | +#### Internet Archive Wayback Machine |
| 77 | + |
| 78 | +The [Internet Archive Wayback Machine](https://archive.org/web/) is the most comprehensive tool for viewing historical snapshots of web pages. It maintains an extensive archive of web pages dating back to 1996. |
| 79 | + |
| 80 | +To view archived versions of a site, visit `https://web.archive.org/web/*/` |
| 81 | +followed by the target URL: |
| 82 | + |
| 83 | +```text |
| 84 | +https://web.archive.org/web/*/owasp.org |
| 85 | +``` |
| 86 | + |
| 87 | +This will display a calendar view showing all available snapshots of the site over time. |
| 88 | + |
| 89 | +#### Bing Cache |
| 90 | + |
| 91 | +Bing still provides cached versions of web pages. To view cached content, use the `cache:` operator: |
| 92 | +Alternatively, click the arrow next to search results in Bing and select "Cached" from the dropdown menu. |
| 93 | + |
| 94 | +#### Other Cached Content Services |
| 95 | + |
| 96 | +Additional services for viewing cached or archived web pages include: |
| 97 | + |
| 98 | +- [archive.ph](https://archive.ph) (also known as archive.md) - On-demand archiving service that creates permanent snapshots |
| 99 | +- [CachedView](https://cachedview.com/) - Aggregates cached pages from multiple sources including Google Cache historical data, Wayback Machine, and others |
| 100 | + |
76 | 101 | ### Google Hacking or Dorking |
77 | 102 |
|
78 | 103 | Searching with operators can be a very effective discovery technique when combined with the creativity of the tester. Operators can be chained to effectively discover specific kinds of sensitive files and information. This technique, called [Google hacking](https://en.wikipedia.org/wiki/Google_hacking) or Dorking, is also possible using other search engines, as long as the search operators are supported. |
|
0 commit comments