Callbacks¶
Table of Contents
How to use callbacks¶
# callbacks_example.py
from nyawc.Options import Options
from nyawc.Crawler import Crawler
from nyawc.CrawlerActions import CrawlerActions
from nyawc.http.Request import Request
def cb_crawler_before_start():
print("Started crawling")
def cb_crawler_after_finish(queue):
print("Finished crawling")
def cb_request_before_start(queue, queue_item):
print("Making request: {}".format(queue_item.request.url))
return CrawlerActions.DO_CONTINUE_CRAWLING
def cb_request_after_finish(queue, queue_item, new_queue_items):
print("Finished request: {}".format(queue_item.request.url))
return CrawlerActions.DO_CONTINUE_CRAWLING
def cb_form_before_autofill(queue_item, elements, form_data):
return CrawlerActions.DO_AUTOFILL_FORM
def cb_form_after_autofill(queue_item, elements, form_data):
pass
options = Options()
options.callbacks.crawler_before_start = cb_crawler_before_start
options.callbacks.crawler_after_finish = cb_crawler_after_finish
options.callbacks.request_before_start = cb_request_before_start
options.callbacks.request_after_finish = cb_request_after_finish
options.callbacks.form_before_autofill = cb_form_before_autofill
options.callbacks.form_after_autofill = cb_form_after_autofill
crawler = Crawler(options)
crawler.start_with(Request("https://finnwea.com/"))
Available callbacks¶
Before crawler start¶
Can be used to run some code before the crawler starts crawling. It does not receive any arguments.
...
def cb_crawler_before_start():
print("Started crawling")
options.callbacks.crawler_before_start = cb_crawler_before_start
...
After crawler finish¶
Can be used to run some code after the crawler finished crawling. It receives one argument, nyawc.Queue
.
queue.get_all()[0].request
contains anyawc.http.Request
.queue.get_all()[0].response
contains anyawc.http.Response
.
...
def cb_crawler_after_finish(queue):
# Print the amount of request/response pairs that were found.
print("Crawler finished, found " + str(queue.get_count()) + " requests.")
# Iterate over all request/response pairs that were found.
for queue_item in queue.get_all():
print("Request method {}".format(queue_item.request.method))
print("Request URL {}".format(queue_item.request.url))
print("Request POST data {}".format(queue_item.request.data))
# print("Response body {}".format(queue_item.response.text))
options.callbacks.crawler_after_finish = cb_crawler_after_finish
...
Before request start¶
Can be used to run some code after the request started executing. It receives two arguments, nyawc.Queue
, which contains all the items currently in the queue (also finished items) and nyawc.QueueItem
, which is the item (request/response pair) in the queue that will now be executed.
- By returning
CrawlerActions.DO_SKIP_TO_NEXT
, this queue_item (request/response pair) will be skipped. - By returning
CrawlerActions.DO_STOP_CRAWLING
, the crawler will stop crawling entirely. - When returning
CrawlerActions.DO_CONTINUE_CRAWLING
, the crawler will continue like normally.
...
def cb_request_before_start(queue, queue_item):
# return CrawlerActions.DO_SKIP_TO_NEXT
# return CrawlerActions.DO_STOP_CRAWLING
return CrawlerActions.DO_CONTINUE_CRAWLING
options.callbacks.request_before_start = cb_request_before_start
...
After request finish¶
Can be used to run some code after the request finished executing. It receives three arguments, nyawc.Queue
, which contains all the items currently in the queue (also finished items), nyawc.QueueItem
, which is the item (request/response pair) in the queue that will now be executed and new_queue_items
(array of nyawc.QueueItem
), which contains the request/response pairs that were found during this request.
- By returning
CrawlerActions.DO_STOP_CRAWLING
, the crawler will stop crawling entirely. - When returning
CrawlerActions.DO_CONTINUE_CRAWLING
, the crawler will continue like normally.
...
def cb_request_after_finish(queue, queue_item, new_queue_items):
percentage = str(int(queue.get_progress()))
total_requests = str(queue.get_count())
print("At " + percentage + "% of " + total_requests + " requests ([" + str(queue_item.response.status_code) + "] " + queue_item.request.url + ").")
# return CrawlerActions.DO_STOP_CRAWLING
return CrawlerActions.DO_CONTINUE_CRAWLING
options.callbacks.request_after_finish = cb_request_after_finish
...
Before form autofill¶
Can be used to run some code before automatically filling in a form. It receives three arguments, nyawc.Queue
, which contains all the items currently in the queue (also finished items), elements
, which is an array of BeautifulSoup4 input elements found in the form and form_data
, which is the (editable) form data that will be used in the request.
- By returning
CrawlerActions.DO_AUTOFILL_FORM
, the form will be filled with random data. - By returning
CrawlerActions.DO_NOT_AUTOFILL_FORM
, only default input values will be used.
...
def cb_form_before_autofill(queue_item, elements, form_data):
# return CrawlerActions.DO_NOT_AUTOFILL_FORM
return CrawlerActions.DO_AUTOFILL_FORM
options.callbacks.form_before_autofill = cb_form_before_autofill
...
After form autofill¶
Can be used to run some code after the crawler automatically filled in a form. It receives three arguments, nyawc.Queue
, which contains all the items currently in the queue (also finished items), elements
, which is an array of BeautifulSoup4 input elements found in the form and form_data
, which is the (editable) form data that will be used in the request.
Please note that this callback will not be called if CrawlerActions.DO_NOT_AUTOFILL_FORM
was returned in the before callback.
...
def cb_form_after_autofill(queue_item, elements, form_data):
pass
options.callbacks.form_after_autofill = cb_form_after_autofill
...