nyawc package

Submodules

nyawc.Crawler module

class nyawc.Crawler.Crawler(options)[source]

Bases: object

The main Crawler class which handles the crawling recursion, queue and processes.

queue[source]

nyawc.Queue – The request/response pair queue containing everything to crawl.

__options[source]

nyawc.Options – The options to use for the current crawling runtime.

__stopping[source]

bool – If the crawler is topping the crawling process.

__stopped[source]

bool – If the crawler finished stopping the crawler process.

__lock[source]

obj – The callback lock to prevent race conditions.

_Crawler__crawler_finish()[source]

Called when the crawler is finished because there are no queued requests left or it was stopped.

_Crawler__crawler_start()[source]

Spawn the first X queued request, where X is the max threads option.

Note

The main thread will sleep until the crawler is finished. This enables quiting the application using sigints (see http://stackoverflow.com/a/11816038/2491049)

_Crawler__crawler_stop()[source]

Mark the crawler as stopped.

Note

If __stopped is True, the main thread will be stopped. Every piece of code that gets executed after __stopped is True could cause Thread exceptions and or race conditions.

_Crawler__request_finish(queue_item, new_requests, new_queue_item_status=None)[source]

Called when the crawler finished the given queued item.

Parameters:
  • queue_item (nyawc.QueueItem) – The request/response pair that finished.
  • list (new_requests) – All the requests that were found during this request.
  • new_queue_item_status (str) – The new status of the queue item (if it needs to be moved).
_Crawler__request_start(queue_item)[source]

Execute the request in given queue item.

Parameters:queue_item (nyawc.QueueItem) – The request/response pair to scrape.
_Crawler__spawn_new_request()[source]

Spawn the first queued request if there is one available.

Returns:If a new request was spawned.
Return type:bool
_Crawler__spawn_new_requests()[source]

Spawn new requests until the max processes option value is reached.

Note

If no new requests were spawned and there are no requests in progress the crawler will stop crawling.

__init__(options)[source]

Constructs a Crawler instance.

Parameters:options (nyawc.Options) – The options to use for the current crawling runtime.
start_with(request)[source]

Start the crawler using the given request.

Parameters:request (nyawc.http.Request) – The startpoint for the crawler.

nyawc.CrawlerActions module

class nyawc.CrawlerActions.CrawlerActions[source]

Bases: object

The actions that crawler callbacks can return.

DO_CONTINUE_CRAWLING[source]

int – Continue by crawling the request.

DO_SKIP_TO_NEXT[source]

int – Skip the current request and continue with the next one in line.

DO_STOP_CRAWLING[source]

int – Stop crawling and quit ongoing requests.

DO_AUTOFILL_FORM[source]

int – Autofill this form with random values.

DO_NOT_AUTOFILL_FORM[source]

int – Do not autofill this form with random values.

DO_AUTOFILL_FORM = 4[source]
DO_CONTINUE_CRAWLING = 1[source]
DO_NOT_AUTOFILL_FORM = 5[source]
DO_SKIP_TO_NEXT = 2[source]
DO_STOP_CRAWLING = 3[source]

nyawc.CrawlerThread module

class nyawc.CrawlerThread.CrawlerThread(callback, callback_lock, options, queue_item)[source]

Bases: threading.Thread

The crawler thread executes the HTTP request using the HTTP handler.

__callback[source]

obj – The method to call when finished

__callback_lock[source]

bool – The callback lock that prevents race conditions.

__options[source]

nyawc.Options – The settins/options object.

__queue_item[source]

nyawc.QueueItem – The queue item containing a request to execute.

__init__(callback, callback_lock, options, queue_item)[source]

Constructs a crawler thread instance

Parameters:
  • callback (obj) – The method to call when finished
  • callback_lock (bool) – The callback lock that prevents race conditions.
  • options (nyawc.Options) – The settins/options object.
  • queue_item (nyawc.QueueItem) – The queue item containing a request to execute.
run()[source]

Executes the HTTP call.

Note

If this and the parent handler raised an error, the queue item status will be set to errored instead of finished. This is to prevent e.g. 404 recursion.

nyawc.Options module

class nyawc.Options.Options[source]

Bases: object

The Options class contains all the crawling options.

scope[source]

nyawc.Options.OptionsScope – Can be used to define the crawling scope.

callbacks[source]

nyawc.Options.OptionsCallbacks – Can be used to define crawling callbacks.

performance[source]

nyawc.Options.OptionsPerformance – Can be used to define performance options.

identity[source]

nyawc.Options.OptionsIdentity – Can be used to define the identity/footprint options.

misc[source]

nyawc.Options.OptionsMisc – Can be used to define the other options.

__init__()[source]

Constructs an Options instance.

class nyawc.Options.OptionsCallbacks[source]

Bases: object

The OptionsCallbacks class contains all the callback methods.

crawler_before_start[source]

obj – called before the crawler starts crawling. Default is a null route to __null_route_crawler_before_start.

crawler_after_finish[source]

obj – called after the crawler finished crawling. Default is a null route to __null_route_crawler_after_finish.

request_before_start[source]

obj – called before the crawler starts a new request. Default is a null route to __null_route_request_before_start.

request_after_finish[source]

obj – called after the crawler finishes a request. Default is a null route to __null_route_request_after_finish.

request_in_thread_before_start[source]

obj – called in the crawling thread (when it started). Default is a null route to __null_route_request_in_thread_before_start.

request_in_thread_after_finish[source]

obj – called in the crawling thread (when it finished). Default is a null route to __null_route_request_in_thread_after_finish.

request_on_error[source]

obj – called if a request failed. Default is a null route to __null_route_request_on_error.

form_before_autofill[source]

obj – called before the crawler starts autofilling a form. Default is a null route to __null_route_form_before_autofill.

form_after_autofill[source]

obj – called after the crawler finishes autofilling a form. Default is a null route to __null_route_form_after_autofill.

_OptionsCallbacks__null_route_crawler_after_finish(queue)[source]

A null route for the ‘crawler after finish’ callback.

Parameters:queue (obj) – The current crawling queue.
_OptionsCallbacks__null_route_crawler_before_start()[source]

A null route for the ‘crawler before start’ callback.

_OptionsCallbacks__null_route_form_after_autofill(queue_item, elements, form_data)[source]

A null route for the ‘form after autofill’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (elements) – The soup elements found in the form.
  • form_data (obj) – The {key: value} form fields.
_OptionsCallbacks__null_route_form_before_autofill(queue_item, elements, form_data)[source]

A null route for the ‘form before autofill’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (elements) – The soup elements found in the form.
  • form_data (obj) – The {key: value} form fields to be autofilled.
Returns:

A crawler action (either DO_AUTOFILL_FORM or DO_NOT_AUTOFILL_FORM).

Return type:

str

_OptionsCallbacks__null_route_request_after_finish(queue, queue_item, new_queue_items)[source]

A null route for the ‘request after finish’ callback.

Parameters:
  • queue (nyawc.Queue) – The current crawling queue.
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (new_queue_items) – The new queue items that were found in the one that finished.
Returns:

A crawler action (either DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).

Return type:

str

_OptionsCallbacks__null_route_request_before_start(queue, queue_item)[source]

A null route for the ‘request before start’ callback.

Parameters:
Returns:

A crawler action (either DO_SKIP_TO_NEXT, DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).

Return type:

str

_OptionsCallbacks__null_route_request_in_thread_after_finish(queue_item)[source]

A null route for the ‘request in thread after finish’ callback.

Parameters:queue_item (nyawc.QueueItem) – The queue item that was finished.

Note

This method gets called in the crawling thread and is therefore not thread safe.

_OptionsCallbacks__null_route_request_in_thread_before_start(queue_item)[source]

A null route for the ‘request in thread before start’ callback.

Parameters:queue_item (nyawc.QueueItem) – The queue item that was finished.

Note

This method gets called in the crawling thread and is therefore not thread safe.

_OptionsCallbacks__null_route_request_on_error(queue_item, message)[source]

A null route for the ‘request on error’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • str (message) – The error message.
__init__()[source]

Constructs an OptionsCallbacks instance.

class nyawc.Options.OptionsIdentity[source]

Bases: object

The OptionsIdentity class contains the identity/footprint options.

auth[source]

obj – The (requests module) authentication class to use when making a request. For more information check http://docs.python-requests.org/en/master/user/authentication/.

cookies[source]

obj – The (requests module) cookie jar to use when making a request. For more information check http://docs.python-requests.org/en/master/user/quickstart/#cookies.

headers[source]

obj – The headers {key: value} to use when making a request.

proxies[source]

obj – The proxies {key: value} to use when making a request. For more information check http://docs.python-requests.org/en/master/user/advanced/#proxies.

__init__()[source]

Constructs an OptionsIdentity instance.

class nyawc.Options.OptionsMisc[source]

Bases: object

The OptionsMisc class contains all kind of misc options.

debug[source]

bool – If debug is enabled extra information will be logged to the console. Default is False.

__init__()[source]

Constructs an OptionsMisc instance.

class nyawc.Options.OptionsPerformance[source]

Bases: object

The OptionsPerformance class contains the performance options.

max_threads[source]

obj – the maximum amount of simultaneous threads to use for crawling.

__init__()[source]

Constructs an OptionsPerformance instance.

class nyawc.Options.OptionsScope[source]

Bases: object

The OptionsScope class contains the scope options.

protocol_must_match[source]

bool – only crawl pages with the same protocol as the startpoint (e.g. only https).

subdomain_must_match[source]

bool – only crawl pages with the same subdomain as the startpoint, if the startpoint is not a subdomain, no subdomains will be crawled.

hostname_must_match[source]

bool – only crawl pages with the same hostname as the startpoint (e.g. only finnwea).

tld_must_match[source]

bool – only crawl pages with the same tld as the startpoint (e.g. only .com)

max_depth[source]

obj – the maximum search depth. For example, 2 would be the startpoint and all the pages found on it. Default is None (unlimited).

__init__()[source]

Constructs an OptionsScope instance.

nyawc.Queue module

class nyawc.Queue.Queue(options)[source]

Bases: object

A ‘hash’ queue containing all the requests of the crawler.

Note

This queue uses a certain hash (from __get_hash()) to prevent duplicate entries and improve the time complexity by checking if the hash exists instead of iterating over all items.

__options[source]

nyawc.Options – The options to use (used when generating queue item hashes).

count_total[source]

int – The total count of requests in the queue.

count_queued[source]

int – The amount of queued items in the queue.

count_in_progress[source]

int – The amount of in progress items in the queue.

count_finished[source]

int – The amount of finished items in the queue.

count_cancelled[source]

int – The amount of cancelled items in the queue.

count_errored[source]

int – The amount of errored items in the queue.

items_queued list

nyawc.QueueItem – The queued items (yet to be executed).

items_in_progress list

nyawc.QueueItem – The items currently being executed.

items_finished list

nyawc.QueueItem – The finished items.

items_cancelled list

nyawc.QueueItem – Items that were cancelled.

items_errored list

nyawc.QueueItem – Items that generated an error.

_Queue__get_hash(queue_item)[source]

Generate and return the dict index hash of the given queue item.

Note

Cookies should not be included in the hash calculation because otherwise requests are crawled multiple times with e.g. different session keys, causing infinite crawling recursion.

Note

At this moment the keys do not actually get hashed since it works perfectly without and since hashing the keys requires us to built hash collision management.

Parameters:queue_item (nyawc.QueueItem) – The queue item to get the hash from.
Returns:The hash of the given queue item.
Return type:str
_Queue__get_var(name)[source]

Get an instance/class var by name.

Parameters:name (str) – The name of the variable.
Returns:I’ts value.
Return type:obj
_Queue__set_var(name, value)[source]

Set an instance/class var by name.

Parameters:
  • name (str) – The name of the variable.
  • value (obj) – I’ts new value.
__init__(options)[source]

Constructs a Queue instance.

Parameters:options (nyawc.Options) – The options to use.
add(queue_item)[source]

Add a request/response pair to the queue.

Parameters:queue_item (nyawc.QueueItem) – The queue item to add.
add_request(request)[source]

Add a request to the queue.

Parameters:request (nyawc.http.Request) – The request to add.
Returns:The created queue item.
Return type:nyawc.QueueItem
get_all(status)[source]

Get all the items in the queue that have the given status.

Parameters:status (str) – return the items with this status.
Returns:All the queue items with the given status.
Return type:list(nyawc.QueueItem)
get_first(status)[source]

Get the first item in the queue that has the given status.

Parameters:status (str) – return the first item with this status.
Returns:The first queue item with the given status.
Return type:nyawc.QueueItem
get_progress()[source]

Get the progress of the queue in percentage (float).

Returns:The ‘finished’ progress in percentage.
Return type:float
has_request(request)[source]

Check if the given request already exists in the queue.

Parameters:request (nyawc.http.Request) – The request to check.
Returns:True if already exists, False otherwise.
Return type:bool
move(queue_item, status)[source]

Move a request/response pair to another status.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item to move
  • status (str) – The new status of the queue item.

nyawc.QueueItem module

class nyawc.QueueItem.QueueItem(request, response)[source]

Bases: object

The QueueItem class keeps track of the request and response and the crawling status.

STATUS_QUEUED[source]

str – Status for when the crawler did not yet start the request.

STATUS_IN_PROGRESS[source]

str – Status for when the crawler is currently crawling the request.

STATUS_FINISHED[source]

str – Status for when the crawler has finished crawling the request.

STATUS_CANCELLED[source]

str – Status for when the crawler has cancelled the request.

STATUS_ERRORED[source]

str – Status for when the crawler could not execute the request.

STATUSES[source]

arr – All statuses.

status[source]

str – The current crawling status.

request[source]

nyawc.http.Request – The Request object.

response[source]

nyawc.http.Response – The Response object.

response_soup[source]

obj – The BeautifulSoup container for the response text.

STATUSES = ['queued', 'in_progress', 'finished', 'cancelled', 'errored'][source]
STATUS_CANCELLED = 'cancelled'[source]
STATUS_ERRORED = 'errored'[source]
STATUS_FINISHED = 'finished'[source]
STATUS_IN_PROGRESS = 'in_progress'[source]
STATUS_QUEUED = 'queued'[source]
__init__(request, response)[source]

Constructs a QueueItem instance.

Parameters:
get_soup_response()[source]

Get the response as a cached BeautifulSoup container.

Returns:The BeautifulSoup container.
Return type:obj