nyawc package

nyawc package

Submodules

nyawc.Crawler module

class nyawc.Crawler.Crawler(options)[source]

Bases: object

The main Crawler class which handles the crawling recursion, queue and processes.

queue[source]

nyawc.Queue – The request/response pair queue containing everything to crawl.

routing[source]

nyawc.Routing – A class that identifies requests based on routes from the options.

__options[source]

nyawc.Options – The options to use for the current crawling runtime.

__should_spawn_new_requests[source]

bool – If the crawler should start spwaning new requests.

__should_stop[source]

bool – If the crawler should stop the crawling process.

__stopping[source]

bool – If the crawler is stopping the crawling process.

__stopped[source]

bool – If the crawler finished stopping the crawler process.

__threads[source]

obj – All currently running threads, as queue item hash => nyawc.CrawlerThread.

__lock[source]

obj – The callback lock to prevent race conditions.

_Crawler__add_scraped_requests_to_queue(queue_item, scraped_requests)[source]

Convert the scraped requests to queue items, return them and also add them to the queue.

Parameters:
  • queue_item (nyawc.QueueItem) – The request/response pair that finished.
  • list (new_requests) – All the requests that were found during this request.
Returns:

The new queue items.

Return type:

list(nyawc.QueueItem)

_Crawler__crawler_finish()[source]

Called when the crawler is finished because there are no queued requests left or it was stopped.

_Crawler__crawler_start()[source]

Spawn the first X queued request, where X is the max threads option.

Note

The main thread will sleep until the crawler is finished. This enables quiting the application using sigints (see http://stackoverflow.com/a/11816038/2491049).

Note

__crawler_stop() and __spawn_new_requests() are called here on the main thread to prevent thread recursion and deadlocks.

_Crawler__crawler_stop()[source]

Mark the crawler as stopped.

Note

If __stopped is True, the main thread will be stopped. Every piece of code that gets executed after __stopped is True could cause Thread exceptions and or race conditions.

_Crawler__request_finish(queue_item, new_requests, request_failed=False)[source]

Called when the crawler finished the given queue item.

Parameters:
  • queue_item (nyawc.QueueItem) – The request/response pair that finished.
  • list (new_requests) – All the requests that were found during this request.
  • request_failed (bool) – True if the request failed (if needs to be moved to errored).
_Crawler__request_start(queue_item)[source]

Execute the request in given queue item.

Parameters:queue_item (nyawc.QueueItem) – The request/response pair to scrape.
_Crawler__signal_handler(signum, frame)[source]

On sigint (e.g. CTRL+C) stop the crawler.

Parameters:
  • signum (int) – The signal number.
  • frame (obj) – The current stack frame.
_Crawler__spawn_new_request()[source]

Spawn the first queued request if there is one available.

Returns:True if a new request was spawned, false otherwise.
Return type:bool
_Crawler__spawn_new_requests()[source]

Spawn new requests until the max threads option value is reached.

Note

If no new requests were spawned and there are no requests in progress the crawler will stop crawling.

_Crawler__wait_for_current_threads()[source]

Wait until all the current threads are finished.

__init__(options)[source]

Constructs a Crawler instance.

Parameters:options (nyawc.Options) – The options to use for the current crawling runtime.
start_with(request)[source]

Start the crawler using the given request.

Parameters:request (nyawc.http.Request) – The startpoint for the crawler.

nyawc.CrawlerActions module

class nyawc.CrawlerActions.CrawlerActions[source]

Bases: object

The actions that crawler callbacks can return.

DO_CONTINUE_CRAWLING[source]

int – Continue by crawling the request.

DO_SKIP_TO_NEXT[source]

int – Skip the current request and continue with the next one in line.

DO_STOP_CRAWLING[source]

int – Stop crawling and quit ongoing requests.

DO_AUTOFILL_FORM[source]

int – Autofill this form with random values.

DO_NOT_AUTOFILL_FORM[source]

int – Do not autofill this form with random values.

DO_AUTOFILL_FORM = 4[source]
DO_CONTINUE_CRAWLING = 1[source]
DO_NOT_AUTOFILL_FORM = 5[source]
DO_SKIP_TO_NEXT = 2[source]
DO_STOP_CRAWLING = 3[source]

nyawc.CrawlerThread module

class nyawc.CrawlerThread.CrawlerThread(callback, callback_lock, options, queue_item)[source]

Bases: threading.Thread

The crawler thread executes the HTTP request using the HTTP handler.

__callback[source]

obj – The method to call when finished

__callback_lock[source]

bool – The callback lock that prevents race conditions.

__options[source]

nyawc.Options – The settins/options object.

__queue_item[source]

nyawc.QueueItem – The queue item containing a request to execute.

__init__(callback, callback_lock, options, queue_item)[source]

Constructs a crawler thread instance

Parameters:
  • callback (obj) – The method to call when finished
  • callback_lock (bool) – The callback lock that prevents race conditions.
  • options (nyawc.Options) – The settins/options object.
  • queue_item (nyawc.QueueItem) – The queue item containing a request to execute.
run()[source]

Executes the HTTP call.

Note

If this and the parent handler raised an error, the queue item status will be set to errored instead of finished. This is to prevent e.g. 404 recursion.

nyawc.Options module

class nyawc.Options.Options[source]

Bases: object

The Options class contains all the crawling options.

scope[source]

nyawc.Options.OptionsScope – Can be used to define the crawling scope.

callbacks[source]

nyawc.Options.OptionsCallbacks – Can be used to define crawling callbacks.

performance[source]

nyawc.Options.OptionsPerformance – Can be used to define performance options.

identity[source]

nyawc.Options.OptionsIdentity – Can be used to define the identity/footprint options.

routing[source]

nyawc.Options.OptionsRouting – Can be used to define routes to ignore similar requests.

misc[source]

nyawc.Options.OptionsMisc – Can be used to define the other options.

__init__()[source]

Constructs an Options instance.

class nyawc.Options.OptionsCallbacks[source]

Bases: object

The OptionsCallbacks class contains all the callback methods.

crawler_before_start[source]

obj – called before the crawler starts crawling. Default is a null route to __null_route_crawler_before_start.

crawler_after_finish[source]

obj – called after the crawler finished crawling. Default is a null route to __null_route_crawler_after_finish.

request_before_start[source]

obj – called before the crawler starts a new request. Default is a null route to __null_route_request_before_start.

request_after_finish[source]

obj – called after the crawler finishes a request. Default is a null route to __null_route_request_after_finish.

request_in_thread_before_start[source]

obj – called in the crawling thread (when it started). Default is a null route to __null_route_request_in_thread_before_start.

request_in_thread_after_finish[source]

obj – called in the crawling thread (when it finished). Default is a null route to __null_route_request_in_thread_after_finish.

request_on_error[source]

obj – called if a request failed. Default is a null route to __null_route_request_on_error.

form_before_autofill[source]

obj – called before the crawler starts autofilling a form. Default is a null route to __null_route_form_before_autofill.

form_after_autofill[source]

obj – called after the crawler finishes autofilling a form. Default is a null route to __null_route_form_after_autofill.

_OptionsCallbacks__null_route_crawler_after_finish(queue)[source]

A null route for the ‘crawler after finish’ callback.

Parameters:queue (obj) – The current crawling queue.
_OptionsCallbacks__null_route_crawler_before_start()[source]

A null route for the ‘crawler before start’ callback.

_OptionsCallbacks__null_route_form_after_autofill(queue_item, elements, form_data)[source]

A null route for the ‘form after autofill’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (elements) – The soup elements found in the form.
  • form_data (obj) – The {key: value} form fields.
_OptionsCallbacks__null_route_form_before_autofill(queue_item, elements, form_data)[source]

A null route for the ‘form before autofill’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (elements) – The soup elements found in the form.
  • form_data (obj) – The {key: value} form fields to be autofilled.
Returns:

A crawler action (either DO_AUTOFILL_FORM or DO_NOT_AUTOFILL_FORM).

Return type:

str

_OptionsCallbacks__null_route_request_after_finish(queue, queue_item, new_queue_items)[source]

A null route for the ‘request after finish’ callback.

Parameters:
  • queue (nyawc.Queue) – The current crawling queue.
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (new_queue_items) – The new queue items that were found in the one that finished.
Returns:

A crawler action (either DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).

Return type:

str

_OptionsCallbacks__null_route_request_before_start(queue, queue_item)[source]

A null route for the ‘request before start’ callback.

Parameters:
Returns:

A crawler action (either DO_SKIP_TO_NEXT, DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).

Return type:

str

_OptionsCallbacks__null_route_request_in_thread_after_finish(queue_item)[source]

A null route for the ‘request in thread after finish’ callback.

Parameters:queue_item (nyawc.QueueItem) – The queue item that was finished.

Note

This method gets called in the crawling thread and is therefore not thread safe.

_OptionsCallbacks__null_route_request_in_thread_before_start(queue_item)[source]

A null route for the ‘request in thread before start’ callback.

Parameters:queue_item (nyawc.QueueItem) – The queue item that was finished.

Note

This method gets called in the crawling thread and is therefore not thread safe.

_OptionsCallbacks__null_route_request_on_error(queue_item, message)[source]

A null route for the ‘request on error’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • str (message) – The error message.
__init__()[source]

Constructs an OptionsCallbacks instance.

class nyawc.Options.OptionsIdentity[source]

Bases: object

The OptionsIdentity class contains the identity/footprint options.

auth[source]

obj – The (requests module) authentication class to use when making a request. For more information check http://docs.python-requests.org/en/master/user/authentication/.

cookies[source]

obj – The (requests module) cookie jar to use when making a request. For more information check http://docs.python-requests.org/en/master/user/quickstart/#cookies.

headers[source]

obj – The headers {key: value} to use when making a request.

proxies[source]

obj – The proxies {key: value} to use when making a request. For more information check http://docs.python-requests.org/en/master/user/advanced/#proxies.

__init__()[source]

Constructs an OptionsIdentity instance.

class nyawc.Options.OptionsMisc[source]

Bases: object

The OptionsMisc class contains all kind of misc options.

debug[source]

bool – If debug is enabled extra information will be logged to the console. Default is False.

verify_ssl_certificates[source]

bool – If verification is enabled all SSL certificates will be checked for validity. Default is True.

trusted_certificates[source]

str – You can pass the path to a CA_BUNDLE file or directory with certificates of trusted CAs. Default is None.

__init__()[source]

Constructs an OptionsMisc instance.

class nyawc.Options.OptionsPerformance[source]

Bases: object

The OptionsPerformance class contains the performance options.

max_threads[source]

obj – the maximum amount of simultaneous threads to use for crawling.

request_timeout[source]

int – the request timeout in seconds (throws an exception if exceeded).

__init__()[source]

Constructs an OptionsPerformance instance.

class nyawc.Options.OptionsRouting[source]

Bases: object

The OptionsRouting class can contain routes that prevent the crawler from crawling similar pages multiple times.

minimum_threshold[source]

int – The minimum amount of requests to crawl (matching a certain route) before ignoring the rest. Default is 20.

routes[source]

arr – The regular expressions that represent routes that should not be cralwed more times than the minimum treshold. Default is an empty array.

Note

An example would be if you have a news site with URLs like (/news/3443, news/2132, news/9475, etc). You can add a regular expression that matches this route so only X requests that match regular expression will be crawled (where X is the minimum treshold).

Note

The crawler will only stop crawling requests of certain routes at exactly the minimum treshold if the maximum threads option is set to 1. If the maximum threads option is set to a value higher than 1 the threshold will get a bit higher depending on the amount of threads used.

__init__()[source]

Constructs an OptionsRouting instance.

class nyawc.Options.OptionsScope[source]

Bases: object

The OptionsScope class contains the scope options.

protocol_must_match[source]

bool – only crawl pages with the same protocol as the startpoint (e.g. only https).

subdomain_must_match[source]

bool – only crawl pages with the same subdomain as the startpoint, if the startpoint is not a subdomain, no subdomains will be crawled.

hostname_must_match[source]

bool – only crawl pages with the same hostname as the startpoint (e.g. only finnwea).

tld_must_match[source]

bool – only crawl pages with the same tld as the startpoint (e.g. only .com)

max_depth[source]

obj – the maximum search depth. For example, 2 would be the startpoint and all the pages found on it. Default is None (unlimited).

request_methods list

str – only crawl these request methods. If empty or None all request methods will be crawled. Default is all.

__init__()[source]

Constructs an OptionsScope instance.

nyawc.Queue module

class nyawc.Queue.Queue(options)[source]

Bases: object

A ‘hash’ queue containing all the requests of the crawler.

Note

This queue uses a certain hash to prevent duplicate entries and improve the time complexity by checking if the hash exists instead of iterating over all items.

__options[source]

nyawc.Options – The options to use (used when generating queue item hashes).

count_total[source]

int – The total count of requests in the queue.

items_queued list

nyawc.QueueItem – The queued items (yet to be executed).

items_in_progress list

nyawc.QueueItem – The items currently being executed.

items_finished list

nyawc.QueueItem – The finished items.

items_cancelled list

nyawc.QueueItem – Items that were cancelled.

items_errored list

nyawc.QueueItem – Items that generated an error.

_Queue__get_var(name)[source]

Get an instance/class var by name.

Parameters:name (str) – The name of the variable.
Returns:I’ts value.
Return type:obj
_Queue__set_var(name, value)[source]

Set an instance/class var by name.

Parameters:
  • name (str) – The name of the variable.
  • value (obj) – I’ts new value.
__init__(options)[source]

Constructs a Queue instance.

Parameters:options (nyawc.Options) – The options to use.
add(queue_item)[source]

Add a request/response pair to the queue.

Parameters:queue_item (nyawc.QueueItem) – The queue item to add.
add_request(request)[source]

Add a request to the queue.

Parameters:request (nyawc.http.Request) – The request to add.
Returns:The created queue item.
Return type:nyawc.QueueItem
get_all(status)[source]

Get all the items in the queue that have the given status.

Parameters:status (str) – return the items with this status.
Returns:All the queue items with the given status.
Return type:list(nyawc.QueueItem)
get_first(status)[source]

Get the first item in the queue that has the given status.

Parameters:status (str) – return the first item with this status.
Returns:The first queue item with the given status.
Return type:nyawc.QueueItem
get_progress()[source]

Get the progress of the queue in percentage (float).

Returns:The ‘finished’ progress in percentage.
Return type:float
has_request(request)[source]

Check if the given request already exists in the queue.

Parameters:request (nyawc.http.Request) – The request to check.
Returns:True if already exists, False otherwise.
Return type:bool
move(queue_item, status)[source]

Move a request/response pair to another status.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item to move
  • status (str) – The new status of the queue item.
move_bulk(from_statuses, to_status)[source]

Move a bulk of request/response pairs to another status

Parameters:
  • list (from_statuses) – The statuses to move from
  • to_status (str) – The status to move to

nyawc.QueueItem module

class nyawc.QueueItem.QueueItem(request, response)[source]

Bases: object

The QueueItem class keeps track of the request and response and the crawling status.

STATUS_QUEUED[source]

str – Status for when the crawler did not yet start the request.

STATUS_IN_PROGRESS[source]

str – Status for when the crawler is currently crawling the request.

STATUS_FINISHED[source]

str – Status for when the crawler has finished crawling the request.

STATUS_CANCELLED[source]

str – Status for when the crawler has cancelled the request.

STATUS_ERRORED[source]

str – Status for when the crawler could not execute the request.

STATUSES[source]

arr – All statuses.

status[source]

str – The current crawling status.

decomposed[source]

bool – If the this queue item is decomposed.

request[source]

nyawc.http.Request – The Request object.

response[source]

nyawc.http.Response – The Response object.

__response_soup[source]

obj – The BeautifulSoup container for the response text.

__index_hash[source]

str – The index of the queue (if cached), otherwise None.

Note

A queue item will be decomposed (cached objects are deleted to free up memory) when it is not likeley to be used again. After decompisition variables will not be cached anymore.

STATUSES = ['queued', 'in_progress', 'finished', 'cancelled', 'errored'][source]
STATUS_CANCELLED = 'cancelled'[source]
STATUS_ERRORED = 'errored'[source]
STATUS_FINISHED = 'finished'[source]
STATUS_IN_PROGRESS = 'in_progress'[source]
STATUS_QUEUED = 'queued'[source]
__init__(request, response)[source]

Constructs a QueueItem instance.

Parameters:
decompose()[source]

Decompose this queue item (set cached variables to None) to free up memory.

Note

When setting cached variables to None memory will be released after the garbage collector ran.

get_hash()[source]

Generate and return the dict index hash of the given queue item.

Note

Cookies should not be included in the hash calculation because otherwise requests are crawled multiple times with e.g. different session keys, causing infinite crawling recursion.

Note

At this moment the keys do not actually get hashed since it works perfectly without and since hashing the keys requires us to built hash collision management.

Returns:The hash of the given queue item.
Return type:str
get_soup_response()[source]

Get the response as a cached BeautifulSoup container.

Returns:The BeautifulSoup container.
Return type:obj

nyawc.Routing module

class nyawc.Routing.Routing(options)[source]

Bases: object

The Routing class counts requests that match certain routes.

__routing_options[source]

nyawc.OptionsRouting – The options containing routing information.

__routing_count[source]

obj – The {key: value} dict that contains the amount of requests for certain routes.

__init__(options)[source]

Constructs a Crawler instance.

Parameters:options (nyawc.Options) – The options to use for the current crawling runtime.
increase_route_count(crawled_request)[source]

Increase the count that determines how many times a URL of a certain route has been crawled.

Parameters:crawled_request (nyawc.http.Request) – The request that possibly matches a route.
is_treshold_reached(scraped_request)[source]

Check if similar requests to the given requests have already been crawled X times. Where X is the minimum treshold amount from the options.

Parameters:scraped_request (nyawc.http.Request) – The request that possibly reached the minimum treshold.
Returns:True if treshold reached, false otherwise.
Return type:bool