N.Y.A.W.C does not have a CLI entry point, so you need to create one yourself. Save the code below as example.py
. The example code prints all request URLs that were found by the crawler.
# example.py
from nyawc.Options import Options
from nyawc.Crawler import Crawler
from nyawc.QueueItem import QueueItem
from nyawc.CrawlerActions import CrawlerActions
from nyawc.http.Request import Request
def cb_crawler_before_start():
print("Crawler started.")
def cb_crawler_after_finish(queue):
print("Crawler finished.")
print("Found " + str(len(queue.get_all(QueueItem.STATUS_FINISHED))) + " requests.")
def cb_request_before_start(queue, queue_item):
print("Starting: {}".format(queue_item.request.url))
return CrawlerActions.DO_CONTINUE_CRAWLING
def cb_request_after_finish(queue, queue_item, new_queue_items):
print("Finished: {}".format(queue_item.request.url))
return CrawlerActions.DO_CONTINUE_CRAWLING
options = Options()
options.callbacks.crawler_before_start = cb_crawler_before_start # Called before the crawler starts crawling. Default is a null route.
options.callbacks.crawler_after_finish = cb_crawler_after_finish # Called after the crawler finished crawling. Default is a null route.
options.callbacks.request_before_start = cb_request_before_start # Called before the crawler starts a new request. Default is a null route.
options.callbacks.request_after_finish = cb_request_after_finish # Called after the crawler finishes a request. Default is a null route.
crawler = Crawler(options)
crawler.start_with(Request("https://finnwea.com/"))
Output all contents to a file and run the process in the background.
$ python -u example.py > output.log
You can set scope options to, for example, only crawl certain subdomains or certain request methods. See this page for all the available scope options.
The kitchen sink is an example that implements all the features/options of N.Y.A.W.C. The kitchen sink is available for copy paste. Check it out!