nyawc package

nyawc package

Submodules

nyawc.Crawler module

class nyawc.Crawler.Crawler(options)[source]

Bases: object

The main Crawler class which handles the crawling recursion, queue and processes.

queue[source]

The request/response pair queue containing everything to crawl.

Type:nyawc.Queue
routing[source]

A class that identifies requests based on routes from the options.

Type:nyawc.Routing
__options[source]

The options to use for the current crawling runtime.

Type:nyawc.Options
__should_spawn_new_requests[source]

If the crawler should start spwaning new requests.

Type:bool
__should_stop[source]

If the crawler should stop the crawling process.

Type:bool
__stopping[source]

If the crawler is stopping the crawling process.

Type:bool
__stopped[source]

If the crawler finished stopping the crawler process.

Type:bool
__threads[source]

All currently running threads, as queue item hash => nyawc.CrawlerThread.

Type:obj
__lock[source]

The callback lock to prevent race conditions.

Type:obj
_Crawler__add_scraped_requests_to_queue(queue_item, scraped_requests)[source]

Convert the scraped requests to queue items, return them and also add them to the queue.

Parameters:
  • queue_item (nyawc.QueueItem) – The request/response pair that finished.
  • list (new_requests) – All the requests that were found during this request.
Returns:

The new queue items.

Return type:

list(nyawc.QueueItem)

_Crawler__crawler_finish()[source]

Called when the crawler is finished because there are no queued requests left or it was stopped.

_Crawler__crawler_start()[source]

Spawn the first X queued request, where X is the max threads option.

Note

The main thread will sleep until the crawler is finished. This enables quiting the application using sigints (see http://stackoverflow.com/a/11816038/2491049).

Note

__crawler_stop() and __spawn_new_requests() are called here on the main thread to prevent thread recursion and deadlocks.

_Crawler__crawler_stop()[source]

Mark the crawler as stopped.

Note

If __stopped is True, the main thread will be stopped. Every piece of code that gets executed after __stopped is True could cause Thread exceptions and or race conditions.

_Crawler__request_finish(queue_item, new_requests, request_failed=False)[source]

Called when the crawler finished the given queue item.

Parameters:
  • queue_item (nyawc.QueueItem) – The request/response pair that finished.
  • list (new_requests) – All the requests that were found during this request.
  • request_failed (bool) – True if the request failed (if needs to be moved to errored).
_Crawler__request_start(queue_item)[source]

Execute the request in given queue item.

Parameters:queue_item (nyawc.QueueItem) – The request/response pair to scrape.
_Crawler__signal_handler(signum, frame)[source]

On sigint (e.g. CTRL+C) stop the crawler.

Parameters:
  • signum (int) – The signal number.
  • frame (obj) – The current stack frame.
_Crawler__spawn_new_request()[source]

Spawn the first queued request if there is one available.

Returns:True if a new request was spawned, false otherwise.
Return type:bool
_Crawler__spawn_new_requests()[source]

Spawn new requests until the max threads option value is reached.

Note

If no new requests were spawned and there are no requests in progress the crawler will stop crawling.

_Crawler__wait_for_current_threads()[source]

Wait until all the current threads are finished.

__init__(options)[source]

Constructs a Crawler instance.

Parameters:options (nyawc.Options) – The options to use for the current crawling runtime.
start_with(request)[source]

Start the crawler using the given request.

Parameters:request (nyawc.http.Request) – The startpoint for the crawler.

nyawc.CrawlerActions module

class nyawc.CrawlerActions.CrawlerActions[source]

Bases: object

The actions that crawler callbacks can return.

DO_CONTINUE_CRAWLING[source]

Continue by crawling the request.

Type:int
DO_SKIP_TO_NEXT[source]

Skip the current request and continue with the next one in line.

Type:int
DO_STOP_CRAWLING[source]

Stop crawling and quit ongoing requests.

Type:int
DO_AUTOFILL_FORM[source]

Autofill this form with random values.

Type:int
DO_NOT_AUTOFILL_FORM[source]

Do not autofill this form with random values.

Type:int
DO_AUTOFILL_FORM = 4[source]
DO_CONTINUE_CRAWLING = 1[source]
DO_NOT_AUTOFILL_FORM = 5[source]
DO_SKIP_TO_NEXT = 2[source]
DO_STOP_CRAWLING = 3[source]

nyawc.CrawlerThread module

class nyawc.CrawlerThread.CrawlerThread(callback, callback_lock, options, queue_item)[source]

Bases: threading.Thread

The crawler thread executes the HTTP request using the HTTP handler.

__callback[source]

The method to call when finished

Type:obj
__callback_lock[source]

The callback lock that prevents race conditions.

Type:bool
__options[source]

The settins/options object.

Type:nyawc.Options
__queue_item[source]

The queue item containing a request to execute.

Type:nyawc.QueueItem
__init__(callback, callback_lock, options, queue_item)[source]

Constructs a crawler thread instance

Parameters:
  • callback (obj) – The method to call when finished
  • callback_lock (bool) – The callback lock that prevents race conditions.
  • options (nyawc.Options) – The settins/options object.
  • queue_item (nyawc.QueueItem) – The queue item containing a request to execute.
run()[source]

Executes the HTTP call.

Note

If this and the parent handler raised an error, the queue item status will be set to errored instead of finished. This is to prevent e.g. 404 recursion.

nyawc.Options module

class nyawc.Options.Options[source]

Bases: object

The Options class contains all the crawling options.

scope[source]

Can be used to define the crawling scope.

Type:nyawc.Options.OptionsScope
callbacks[source]

Can be used to define crawling callbacks.

Type:nyawc.Options.OptionsCallbacks
performance[source]

Can be used to define performance options.

Type:nyawc.Options.OptionsPerformance
identity[source]

Can be used to define the identity/footprint options.

Type:nyawc.Options.OptionsIdentity
routing[source]

Can be used to define routes to ignore similar requests.

Type:nyawc.Options.OptionsRouting
misc[source]

Can be used to define the other options.

Type:nyawc.Options.OptionsMisc
__init__()[source]

Constructs an Options instance.

class nyawc.Options.OptionsCallbacks[source]

Bases: object

The OptionsCallbacks class contains all the callback methods.

crawler_before_start[source]

called before the crawler starts crawling. Default is a null route to __null_route_crawler_before_start.

Type:obj
crawler_after_finish[source]

called after the crawler finished crawling. Default is a null route to __null_route_crawler_after_finish.

Type:obj
request_before_start[source]

called before the crawler starts a new request. Default is a null route to __null_route_request_before_start.

Type:obj
request_after_finish[source]

called after the crawler finishes a request. Default is a null route to __null_route_request_after_finish.

Type:obj
request_in_thread_before_start[source]

called in the crawling thread (when it started). Default is a null route to __null_route_request_in_thread_before_start.

Type:obj
request_in_thread_after_finish[source]

called in the crawling thread (when it finished). Default is a null route to __null_route_request_in_thread_after_finish.

Type:obj
request_on_error[source]

called if a request failed. Default is a null route to __null_route_request_on_error.

Type:obj
form_before_autofill[source]

called before the crawler starts autofilling a form. Default is a null route to __null_route_form_before_autofill.

Type:obj
form_after_autofill[source]

called after the crawler finishes autofilling a form. Default is a null route to __null_route_form_after_autofill.

Type:obj
_OptionsCallbacks__null_route_crawler_after_finish(queue)[source]

A null route for the ‘crawler after finish’ callback.

Parameters:queue (obj) – The current crawling queue.
_OptionsCallbacks__null_route_crawler_before_start()[source]

A null route for the ‘crawler before start’ callback.

_OptionsCallbacks__null_route_form_after_autofill(queue_item, elements, form_data)[source]

A null route for the ‘form after autofill’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (elements) – The soup elements found in the form.
  • form_data (obj) – The {key: value} form fields.
_OptionsCallbacks__null_route_form_before_autofill(queue_item, elements, form_data)[source]

A null route for the ‘form before autofill’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (elements) – The soup elements found in the form.
  • form_data (obj) – The {key: value} form fields to be autofilled.
Returns:

A crawler action (either DO_AUTOFILL_FORM or DO_NOT_AUTOFILL_FORM).

Return type:

str

_OptionsCallbacks__null_route_request_after_finish(queue, queue_item, new_queue_items)[source]

A null route for the ‘request after finish’ callback.

Parameters:
  • queue (nyawc.Queue) – The current crawling queue.
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • list (new_queue_items) – The new queue items that were found in the one that finished.
Returns:

A crawler action (either DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).

Return type:

str

_OptionsCallbacks__null_route_request_before_start(queue, queue_item)[source]

A null route for the ‘request before start’ callback.

Parameters:
Returns:

A crawler action (either DO_SKIP_TO_NEXT, DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).

Return type:

str

_OptionsCallbacks__null_route_request_in_thread_after_finish(queue_item)[source]

A null route for the ‘request in thread after finish’ callback.

Parameters:queue_item (nyawc.QueueItem) – The queue item that was finished.

Note

This method gets called in the crawling thread and is therefore not thread safe.

_OptionsCallbacks__null_route_request_in_thread_before_start(queue_item)[source]

A null route for the ‘request in thread before start’ callback.

Parameters:queue_item (nyawc.QueueItem) – The queue item that was finished.

Note

This method gets called in the crawling thread and is therefore not thread safe.

_OptionsCallbacks__null_route_request_on_error(queue_item, message)[source]

A null route for the ‘request on error’ callback.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item that was finished.
  • str (message) – The error message.
__init__()[source]

Constructs an OptionsCallbacks instance.

class nyawc.Options.OptionsIdentity[source]

Bases: object

The OptionsIdentity class contains the identity/footprint options.

auth[source]

The (requests module) authentication class to use when making a request. For more information check http://docs.python-requests.org/en/master/user/authentication/.

Type:obj
cookies[source]

The (requests module) cookie jar to use when making a request. For more information check http://docs.python-requests.org/en/master/user/quickstart/#cookies.

Type:obj
headers[source]

The headers {key: value} to use when making a request.

Type:obj
proxies[source]

The proxies {key: value} to use when making a request. For more information check http://docs.python-requests.org/en/master/user/advanced/#proxies.

Type:obj
__init__()[source]

Constructs an OptionsIdentity instance.

class nyawc.Options.OptionsMisc[source]

Bases: object

The OptionsMisc class contains all kind of misc options.

debug[source]

If debug is enabled extra information will be logged to the console. Default is False.

Type:bool
verify_ssl_certificates[source]

If verification is enabled all SSL certificates will be checked for validity. Default is True.

Type:bool
trusted_certificates[source]

You can pass the path to a CA_BUNDLE file or directory with certificates of trusted CAs. Default is None.

Type:str
__init__()[source]

Constructs an OptionsMisc instance.

class nyawc.Options.OptionsPerformance[source]

Bases: object

The OptionsPerformance class contains the performance options.

max_threads[source]

the maximum amount of simultaneous threads to use for crawling.

Type:obj
request_timeout[source]

the request timeout in seconds (throws an exception if exceeded).

Type:int
__init__()[source]

Constructs an OptionsPerformance instance.

class nyawc.Options.OptionsRouting[source]

Bases: object

The OptionsRouting class can contain routes that prevent the crawler from crawling similar pages multiple times.

minimum_threshold[source]

The minimum amount of requests to crawl (matching a certain route) before ignoring the rest. Default is 20.

Type:int
routes[source]

The regular expressions that represent routes that should not be cralwed more times than the minimum treshold. Default is an empty array.

Type:arr

Note

An example would be if you have a news site with URLs like (/news/3443, news/2132, news/9475, etc). You can add a regular expression that matches this route so only X requests that match regular expression will be crawled (where X is the minimum treshold).

Note

The crawler will only stop crawling requests of certain routes at exactly the minimum treshold if the maximum threads option is set to 1. If the maximum threads option is set to a value higher than 1 the threshold will get a bit higher depending on the amount of threads used.

__init__()[source]

Constructs an OptionsRouting instance.

class nyawc.Options.OptionsScope[source]

Bases: object

The OptionsScope class contains the scope options.

protocol_must_match[source]

only crawl pages with the same protocol as the startpoint (e.g. only https).

Type:bool
subdomain_must_match[source]

only crawl pages with the same subdomain as the startpoint, if the startpoint is not a subdomain, no subdomains will be crawled.

Type:bool
hostname_must_match[source]

only crawl pages with the same hostname as the startpoint (e.g. only finnwea).

Type:bool
tld_must_match[source]

only crawl pages with the same tld as the startpoint (e.g. only .com)

Type:bool
max_depth[source]

the maximum search depth. For example, 2 would be the startpoint and all the pages found on it. Default is None (unlimited).

Type:obj
request_methods list

only crawl these request methods. If empty or None all request methods will be crawled. Default is all.

Type:str
__init__()[source]

Constructs an OptionsScope instance.

nyawc.Queue module

class nyawc.Queue.Queue(options)[source]

Bases: object

A ‘hash’ queue containing all the requests of the crawler.

Note

This queue uses a certain hash to prevent duplicate entries and improve the time complexity by checking if the hash exists instead of iterating over all items.

__options[source]

The options to use (used when generating queue item hashes).

Type:nyawc.Options
count_total[source]

The total count of requests in the queue.

Type:int
items_queued list

The queued items (yet to be executed).

Type:nyawc.QueueItem
items_in_progress list

The items currently being executed.

Type:nyawc.QueueItem
items_finished list

The finished items.

Type:nyawc.QueueItem
items_cancelled list

Items that were cancelled.

Type:nyawc.QueueItem
items_errored list

Items that generated an error.

Type:nyawc.QueueItem
_Queue__get_var(name)[source]

Get an instance/class var by name.

Parameters:name (str) – The name of the variable.
Returns:I’ts value.
Return type:obj
_Queue__set_var(name, value)[source]

Set an instance/class var by name.

Parameters:
  • name (str) – The name of the variable.
  • value (obj) – I’ts new value.
__init__(options)[source]

Constructs a Queue instance.

Parameters:options (nyawc.Options) – The options to use.
add(queue_item)[source]

Add a request/response pair to the queue.

Parameters:queue_item (nyawc.QueueItem) – The queue item to add.
add_request(request)[source]

Add a request to the queue.

Parameters:request (nyawc.http.Request) – The request to add.
Returns:The created queue item.
Return type:nyawc.QueueItem
get_all(status)[source]

Get all the items in the queue that have the given status.

Parameters:status (str) – return the items with this status.
Returns:All the queue items with the given status.
Return type:list(nyawc.QueueItem)
get_first(status)[source]

Get the first item in the queue that has the given status.

Parameters:status (str) – return the first item with this status.
Returns:The first queue item with the given status.
Return type:nyawc.QueueItem
get_progress()[source]

Get the progress of the queue in percentage (float).

Returns:The ‘finished’ progress in percentage.
Return type:float
has_request(request)[source]

Check if the given request already exists in the queue.

Parameters:request (nyawc.http.Request) – The request to check.
Returns:True if already exists, False otherwise.
Return type:bool
move(queue_item, status)[source]

Move a request/response pair to another status.

Parameters:
  • queue_item (nyawc.QueueItem) – The queue item to move
  • status (str) – The new status of the queue item.
move_bulk(from_statuses, to_status)[source]

Move a bulk of request/response pairs to another status

Parameters:
  • list (from_statuses) – The statuses to move from
  • to_status (str) – The status to move to

nyawc.QueueItem module

class nyawc.QueueItem.QueueItem(request, response)[source]

Bases: object

The QueueItem class keeps track of the request and response and the crawling status.

STATUS_QUEUED[source]

Status for when the crawler did not yet start the request.

Type:str
STATUS_IN_PROGRESS[source]

Status for when the crawler is currently crawling the request.

Type:str
STATUS_FINISHED[source]

Status for when the crawler has finished crawling the request.

Type:str
STATUS_CANCELLED[source]

Status for when the crawler has cancelled the request.

Type:str
STATUS_ERRORED[source]

Status for when the crawler could not execute the request.

Type:str
STATUSES[source]

All statuses.

Type:arr
status[source]

The current crawling status.

Type:str
decomposed[source]

If the this queue item is decomposed.

Type:bool
request[source]

The Request object.

Type:nyawc.http.Request
response[source]

The Response object.

Type:nyawc.http.Response
__response_soup[source]

The BeautifulSoup container for the response text.

Type:obj
__index_hash[source]

The index of the queue (if cached), otherwise None.

Type:str

Note

A queue item will be decomposed (cached objects are deleted to free up memory) when it is not likeley to be used again. After decompisition variables will not be cached anymore.

STATUSES = ['queued', 'in_progress', 'finished', 'cancelled', 'errored'][source]
STATUS_CANCELLED = 'cancelled'[source]
STATUS_ERRORED = 'errored'[source]
STATUS_FINISHED = 'finished'[source]
STATUS_IN_PROGRESS = 'in_progress'[source]
STATUS_QUEUED = 'queued'[source]
__init__(request, response)[source]

Constructs a QueueItem instance.

Parameters:
decompose()[source]

Decompose this queue item (set cached variables to None) to free up memory.

Note

When setting cached variables to None memory will be released after the garbage collector ran.

get_hash()[source]

Generate and return the dict index hash of the given queue item.

Note

Cookies should not be included in the hash calculation because otherwise requests are crawled multiple times with e.g. different session keys, causing infinite crawling recursion.

Note

At this moment the keys do not actually get hashed since it works perfectly without and since hashing the keys requires us to built hash collision management.

Returns:The hash of the given queue item.
Return type:str
get_soup_response()[source]

Get the response as a cached BeautifulSoup container.

Returns:The BeautifulSoup container.
Return type:obj

nyawc.Routing module

class nyawc.Routing.Routing(options)[source]

Bases: object

The Routing class counts requests that match certain routes.

__routing_options[source]

The options containing routing information.

Type:nyawc.OptionsRouting
__routing_count[source]

The {key: value} dict that contains the amount of requests for certain routes.

Type:obj
__init__(options)[source]

Constructs a Crawler instance.

Parameters:options (nyawc.Options) – The options to use for the current crawling runtime.
increase_route_count(crawled_request)[source]

Increase the count that determines how many times a URL of a certain route has been crawled.

Parameters:crawled_request (nyawc.http.Request) – The request that possibly matches a route.
is_treshold_reached(scraped_request)[source]

Check if similar requests to the given requests have already been crawled X times. Where X is the minimum treshold amount from the options.

Parameters:scraped_request (nyawc.http.Request) – The request that possibly reached the minimum treshold.
Returns:True if treshold reached, false otherwise.
Return type:bool