N.Y.A.W.C does not have a CLI entry point, so you need to create one yourself. Save the code below as example.py
. The example code prints all request URLs that were found by the crawler.
# example.py
from nyawc.Options import Options
from nyawc.Crawler import Crawler
from nyawc.QueueItem import QueueItem
from nyawc.CrawlerActions import CrawlerActions
from nyawc.http.Request import Request
def cb_crawler_before_start():
print("Crawler started.")
def cb_crawler_after_finish(queue):
print("Crawler finished.")
print("Found " + str(len(queue.get_all(QueueItem.STATUS_FINISHED))) + " requests.")
def cb_request_before_start(queue, queue_item):
print("Starting: {}".format(queue_item.request.url))
return CrawlerActions.DO_CONTINUE_CRAWLING
def cb_request_after_finish(queue, queue_item, new_queue_items):
print("Finished: {}".format(queue_item.request.url))
return CrawlerActions.DO_CONTINUE_CRAWLING
options = Options()
options.callbacks.crawler_before_start = cb_crawler_before_start # Called before the crawler starts crawling. Default is a null route.
options.callbacks.crawler_after_finish = cb_crawler_after_finish # Called after the crawler finished crawling. Default is a null route.
options.callbacks.request_before_start = cb_request_before_start # Called before the crawler starts a new request. Default is a null route.
options.callbacks.request_after_finish = cb_request_after_finish # Called after the crawler finishes a request. Default is a null route.
crawler = Crawler(options)
crawler.start_with(Request("https://finnwea.com/"))
Output all contents to a file and run the process in the background.
$ python -u example.py > output.log
The kitchen sink is an example that implements all the features/options of N.Y.A.W.C. The kitchen sink is available for copy paste. Check it out!