5 Ways to Crawl a Website

From Wikipedia

A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing .

A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit.  If the crawler is performing archiving of websites it copies and saves the information as it goes. The archive is known as the repository and is designed to store and manage the collection of web pages. A repository is similar to any other system that stores data, like a modern day database.

Let’s Begin!!

Metasploit

This auxiliary module is a modular web crawler, to be used in conjuntion with wmap (someday) or standalone.

use auxiliary/crawler/msfcrawler

msf auxiliary(msfcrawler) > set rhosts www.example.com

msf auxiliary(msfcrawler) > exploit

From, screenshot you can see it has loaded crawler in order to exact hidden file from any website, for example about.php, jquery contact form, html and etc which is not possible to exact manually from website using browser. For information gathering of any website we can use it.

HTTRACK

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site’s relative link-structure. 

Type following command inside the terminal

httrack http://tptl.in –O /root/Desktop/file

It will save the output inside given directory /root/Desktop/file

From given screenshot you can observe this, it has dumb the website information inside it which consist html file as well as JavaScript and jquery.

Black Widow

This Web spider utility detects and displays detailed information for a user-selected Web page, and it offers other Web page tools.

BlackWidow’s clean, logically tabbed interface is simple enough for intermediate users to follow but offers just enough under the hood to satisfy advanced users. Simply enter your URL of choice and press Go. BlackWidow uses multithreading to quickly download all files and test the links. The operation takes only a few minutes for small Web sites.

You can download it from here.

Enter your URL http://tptl.in in Address field and press Go.

Click on start button given on left side to begin URL scanning and select a folder  to save the output file.

From screenshot you can observe that I had browse C:\Users\RAJ\Desktop\tptl in order to store output file inside it.

When you will open target folder tptl you will get entire data of website either image or content, html file, php file and JavaScript all are saved in it.

Website Ripper Copier

Website Ripper Copier (WRC) is an all-purpose, high-speed website downloader software to save website data. WRC can download website files to local drive for offline browsing, extract website files of a certain size and type, like image, video, picture, movie and music, retrieve a large number of files as a download manager with resumption support, and mirror sites. WRC is also a site link validator, explorer, and tabbed anti pop-up Web / offline browser.

Website Ripper Copier is the only website downloader tool that can resume broken downloads from HTTP, HTTPS and FTP connections, access password-protected sites, support Web cookies, analyze scripts, update retrieved sites or files, and launch more than fifty retrieval threads

You can download it from here.

 Choose “web sites for offline browsing” option.

Enter the website URL as http://tptl.in and click on next.

Mention directory path to save the output result and click run now.

When you will open selected folder tp you will get fetched css,php,html and js file inside it.

Burp Suite Spider

Burp Spider is a tool for automatically crawling web applications. While it is generally preferable to map applications manually, you can use Burp Spider to partially automate this process for very large applications, or when you are short of time.

For more detail read our privious articles from here.

From given screenshot you can observe that I had fetched the http request of http:// tptl.in; now send to spider with help of action tab.

The targeted website has been added inside the site map under target tab as a new scope for web crawling.  From screenshot you can see it started web crawling of the target website where it has collected the website information in the form of php, html and js.

Author: AArti Singh is a Researcher and Technical Writer at Hacking Articles an Information Security Consultant Social Media Lover and Gadgets. Contact here

How to Spider Web Applications using Burpsuite

Hello friends! Today we are doing web penetration testing using burp suite spider which very rapidly crawl entire web application and dump the formation of targeted web site.

Burp Spider is a tool for automatically crawling web applications. While it is generally preferable to map applications manually, you can use Burp Spider to partially automate this process for very large applications, or when you are short of time.

Source: https://portswigger.net/burp/help/spider.html

 Let’s begin!!

First attacker needs to configure browser and burp proxy to work properly, www.tetphp.vulnweb.com will my targetd web site for enumeration.

Form given below screenshot you can see currently there is no targeted website inside site map of burp suite. To add your targeted web site inside it you need to fetch the http request send by browser to web application server, using intercept option of proxy tab.

Click on proxy tab and turn on intercept in order to catch http request.

Here you can observe that I had fetched the http request of www.tetphp.vulnweb.com; now send to spider with help of action tab.

Confirm your action by making click on YES; Burp will alter the existing target scope to include the preferred item, and all sub-items contained by the site map tree.

Now choose spider tab for further step, here you will find two sub categories control tab and option.

Burp Spider – Control Tab

This tab is used to start and stop Burp Spider, monitor its progress, and define the spidering scope.

 Spider Status

Use these settings to monitor and control Burp Spider:

  • Spider is paused / running– This toggle button is used to start and stop the Spider. While the Spider is stopped it will not make any requests of its own, although it will continue to process responses generated via Burp Proxy (if passive spidering is enabled), and any newly-discovered items that are within the spidering scope will be queued to be requested if the Spider is restarted.
  • Clear queues– If you want to reprioritize your work, you can completely clear the currently queued items, so that other items can be added to the queue. Note that the cleared items may be re-queued if they remain in-scope and the Spider’s parser encounters new links to the items.

 Spider Scope

This panel lets you define exactly what is in the scope for the Spider to request.

The best way to handle spidering scope is normally using the suite-wide target scope, and by default the Spider will use that scope.

Burp Spider Options

This tab contains options for the basic crawler settings, passive spidering, form submission , application login, the Spider engine, and HTTP request headers .

You can monitor the status of the Spider when running, via the Control tab. Any newly discovered content will be added to the Target site map.

When spidering a selected branch of the site map, Burp will carry out the following actions (depending on your settings):

  • Request any unrequested URLs already present within the branch.
  • Submit any discovered forms whose action URLs lay within the branch.
  • Re-request any items in the branch that previously returned 304 status codes, to retrieve fresh (uncached) copies of the application’s responses.
  • Parse all content retrieved to identify new URLs and forms.
  • Recursively repeat these steps as new content is discovered.
  • Continue spidering all in-scope areas until no new content is discovered.

Hence you can see the targeted website has been added inside the site map as a new scope for web crawling. Choose spider this host option by making right click on selected URL which automatically start web crawling.

When you click on preferred target site map further content which has been discover by spider will get added inside it as shown in given image below.

Form screenshot you can see its dump all items of web site even by throwing request and response of host.

Author: AArti Singh is a Researcher and Technical Writer at Hacking Articles an Information Security Consultant Social Media Lover and Gadgets. Contact here

Related Posts Plugin for WordPress, Blogger...