Skip to content

Commit 0847e37

Browse files
Address Google cache no longer available (#1255)
* Update cached content section with Google cache alternatives (since Google cache is no longer available) --------- Co-authored-by: Rick M <[email protected]>
1 parent 4a4a3f6 commit 0847e37

File tree

1 file changed

+27
-2
lines changed

1 file changed

+27
-2
lines changed

document/4-Web_Application_Security_Testing/01-Information_Gathering/01-Conduct_Search_Engine_Discovery_Reconnaissance_for_Information_Leakage.md

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@
88

99
In order for search engines to work, computer programs (or `robots`) regularly fetch data (referred to as [crawling](https://en.wikipedia.org/wiki/Web_crawler)) from billions of pages on the web. These programs find web content and functionality by following links from other pages, or by looking at sitemaps. If a site uses a special file called `robots.txt` to list pages that it does not want search engines to fetch, then the pages listed there will be ignored. This is a basic overview - Google offers a more in-depth explanation of [how a search engine works](https://support.google.com/webmasters/answer/70897?hl=en).
1010

11-
Testers can use search engines to perform reconnaissance on sites and web applications. There are direct and indirect elements to search engine discovery and reconnaissance: direct methods relate to searching the indexes and the associated content from caches, while indirect methods relate to learning sensitive design and configuration information by searching forums, newsgroups, and tendering sites.
11+
Testers can use search engines to perform reconnaissance on sites and web applications. There are direct and indirect elements to search engine discovery and reconnaissance: direct methods relate to searching the indices and the associated content from caches, while indirect methods relate to learning sensitive design and configuration information by searching forums, newsgroups, and tendering sites.
1212

13-
Once a search engine robot has completed crawling, it commences indexing the web content based on tags and associated attributes, such as `<TITLE>`, in order to return relevant search results. If the `robots.txt` file is not updated during the lifetime of the site, and in-line HTML meta tags that instruct robots not to index content have not been used, then it is possible for indexes to contain web content not intended to be included by the owners. Site owners may use the previously mentioned `robots.txt`, HTML meta tags, authentication, and tools provided by search engines to remove such content.
13+
Once a search engine robot has completed crawling, it commences indexing the web content based on tags and associated attributes, such as `<TITLE>`, in order to return relevant search results. If the `robots.txt` file is not updated during the lifetime of the site, and in-line HTML meta tags that instruct robots not to index content have not been used, then it is possible for indices to contain web content not intended to be included by the owners. Site owners may use the previously mentioned `robots.txt`, HTML meta tags, authentication, and tools provided by search engines to remove such content.
1414

1515
## Test Objectives
1616

@@ -73,6 +73,31 @@ cache:owasp.org
7373
![Google Cache Operation Search Result Example](images/Google_cache_Operator_Search_Results_Example_20200406.png)\
7474
*Figure 4.1.1-2: Google Cache Operation Search Result Example*
7575

76+
#### Internet Archive Wayback Machine
77+
78+
The [Internet Archive Wayback Machine](https://archive.org/web/) is the most comprehensive tool for viewing historical snapshots of web pages. It maintains an extensive archive of web pages dating back to 1996.
79+
80+
To view archived versions of a site, visit `https://web.archive.org/web/*/`
81+
followed by the target URL:
82+
83+
```text
84+
https://web.archive.org/web/*/owasp.org
85+
```
86+
87+
This will display a calendar view showing all available snapshots of the site over time.
88+
89+
#### Bing Cache
90+
91+
Bing still provides cached versions of web pages. To view cached content, use the `cache:` operator:
92+
Alternatively, click the arrow next to search results in Bing and select "Cached" from the dropdown menu.
93+
94+
#### Other Cached Content Services
95+
96+
Additional services for viewing cached or archived web pages include:
97+
98+
- [archive.ph](https://archive.ph) (also known as archive.md) - On-demand archiving service that creates permanent snapshots
99+
- [CachedView](https://cachedview.com/) - Aggregates cached pages from multiple sources including Google Cache historical data, Wayback Machine, and others
100+
76101
### Google Hacking or Dorking
77102

78103
Searching with operators can be a very effective discovery technique when combined with the creativity of the tester. Operators can be chained to effectively discover specific kinds of sensitive files and information. This technique, called [Google hacking](https://en.wikipedia.org/wiki/Google_hacking) or Dorking, is also possible using other search engines, as long as the search operators are supported.

0 commit comments

Comments
 (0)