nyawc package¶

Submodules¶

nyawc.Crawler module¶

class nyawc.Crawler.Crawler(options)[source]¶

Bases: object

The main Crawler class which handles the crawling recursion, queue and processes.

queue[source]¶

The request/response pair queue containing everything to crawl.

Type:	`nyawc.Queue`

routing[source]¶

A class that identifies requests based on routes from the options.

Type:	`nyawc.Routing`

__options[source]¶

The options to use for the current crawling runtime.

Type:	`nyawc.Options`

__should_spawn_new_requests[source]¶

If the crawler should start spwaning new requests.

Type:	bool

__should_stop[source]¶

If the crawler should stop the crawling process.

Type:	bool

__stopping[source]¶

If the crawler is stopping the crawling process.

Type:	bool

__stopped[source]¶

If the crawler finished stopping the crawler process.

Type:	bool

__threads[source]¶

All currently running threads, as queue item hash => nyawc.CrawlerThread.

Type:	obj

__lock[source]¶

The callback lock to prevent race conditions.

Type:	obj

_Crawler__add_scraped_requests_to_queue(queue_item, scraped_requests)[source]¶

Convert the scraped requests to queue items, return them and also add them to the queue.

Parameters:	queue_item (`nyawc.QueueItem`) – The request/response pair that finished. list (new_requests) – All the requests that were found during this request.
Returns:	The new queue items.
Return type:	list(`nyawc.QueueItem`)

_Crawler__crawler_finish()[source]¶: Called when the crawler is finished because there are no queued requests left or it was stopped.

_Crawler__crawler_start()[source]¶: Spawn the first X queued request, where X is the max threads option.

Note

The main thread will sleep until the crawler is finished. This enables quiting the application using sigints (see http://stackoverflow.com/a/11816038/2491049).

Note

__crawler_stop() and __spawn_new_requests() are called here on the main thread to prevent thread recursion and deadlocks.

_Crawler__crawler_stop()[source]¶: Mark the crawler as stopped.

Note

If __stopped is True, the main thread will be stopped. Every piece of code that gets executed after __stopped is True could cause Thread exceptions and or race conditions.

_Crawler__request_finish(queue_item, new_requests, request_failed=False)[source]¶

Called when the crawler finished the given queue item.

Parameters:	queue_item (`nyawc.QueueItem`) – The request/response pair that finished. list (new_requests) – All the requests that were found during this request. request_failed (bool) – True if the request failed (if needs to be moved to errored).

_Crawler__request_start(queue_item)[source]¶

Execute the request in given queue item.

Parameters:	queue_item (`nyawc.QueueItem`) – The request/response pair to scrape.

_Crawler__signal_handler(signum, frame)[source]¶

On sigint (e.g. CTRL+C) stop the crawler.

Parameters:	signum (int) – The signal number. frame (obj) – The current stack frame.

_Crawler__spawn_new_request()[source]¶

Spawn the first queued request if there is one available.

Returns:	True if a new request was spawned, false otherwise.
Return type:	bool

_Crawler__spawn_new_requests()[source]¶: Spawn new requests until the max threads option value is reached.

Note

If no new requests were spawned and there are no requests in progress the crawler will stop crawling.

_Crawler__wait_for_current_threads()[source]¶: Wait until all the current threads are finished.

__init__(options)[source]¶

Constructs a Crawler instance.

Parameters:	options (`nyawc.Options`) – The options to use for the current crawling runtime.

start_with(request)[source]¶

Start the crawler using the given request.

Parameters:	request (`nyawc.http.Request`) – The startpoint for the crawler.

nyawc.CrawlerActions module¶

class nyawc.CrawlerActions.CrawlerActions[source]¶

Bases: object

The actions that crawler callbacks can return.

DO_CONTINUE_CRAWLING[source]¶

Continue by crawling the request.

Type:	int

DO_SKIP_TO_NEXT[source]¶

Skip the current request and continue with the next one in line.

Type:	int

DO_STOP_CRAWLING[source]¶

Stop crawling and quit ongoing requests.

Type:	int

DO_AUTOFILL_FORM[source]¶

Autofill this form with random values.

Type:	int

DO_NOT_AUTOFILL_FORM[source]¶

Do not autofill this form with random values.

Type:	int

DO_AUTOFILL_FORM = 4[source]

DO_CONTINUE_CRAWLING = 1[source]

DO_NOT_AUTOFILL_FORM = 5[source]

DO_SKIP_TO_NEXT = 2[source]

DO_STOP_CRAWLING = 3[source]

nyawc.CrawlerThread module¶

class nyawc.CrawlerThread.CrawlerThread(callback, callback_lock, options, queue_item)[source]¶

Bases: threading.Thread

The crawler thread executes the HTTP request using the HTTP handler.

__callback[source]¶

The method to call when finished

Type:	obj

__callback_lock[source]¶

The callback lock that prevents race conditions.

Type:	bool

__options[source]¶

The settins/options object.

Type:	`nyawc.Options`

__queue_item[source]¶

The queue item containing a request to execute.

Type:	`nyawc.QueueItem`

__init__(callback, callback_lock, options, queue_item)[source]¶

Constructs a crawler thread instance

Parameters:	callback (obj) – The method to call when finished callback_lock (bool) – The callback lock that prevents race conditions. options (`nyawc.Options`) – The settins/options object. queue_item (`nyawc.QueueItem`) – The queue item containing a request to execute.

run()[source]¶: Executes the HTTP call.

Note

If this and the parent handler raised an error, the queue item status will be set to errored instead of finished. This is to prevent e.g. 404 recursion.

nyawc.Options module¶

class nyawc.Options.Options[source]¶

Bases: object

The Options class contains all the crawling options.

scope[source]¶

Can be used to define the crawling scope.

Type:	`nyawc.Options.OptionsScope`

callbacks[source]¶

Can be used to define crawling callbacks.

Type:	`nyawc.Options.OptionsCallbacks`

performance[source]¶

Can be used to define performance options.

Type:	`nyawc.Options.OptionsPerformance`

identity[source]¶

Can be used to define the identity/footprint options.

Type:	`nyawc.Options.OptionsIdentity`

routing[source]¶

Can be used to define routes to ignore similar requests.

Type:	`nyawc.Options.OptionsRouting`

misc[source]¶

Can be used to define the other options.

Type:	`nyawc.Options.OptionsMisc`

__init__()[source]¶: Constructs an Options instance.

class nyawc.Options.OptionsCallbacks[source]¶

Bases: object

The OptionsCallbacks class contains all the callback methods.

crawler_before_start[source]¶

called before the crawler starts crawling. Default is a null route to __null_route_crawler_before_start.

Type:	obj

crawler_after_finish[source]¶

called after the crawler finished crawling. Default is a null route to __null_route_crawler_after_finish.

Type:	obj

request_before_start[source]¶

called before the crawler starts a new request. Default is a null route to __null_route_request_before_start.

Type:	obj

request_after_finish[source]¶

called after the crawler finishes a request. Default is a null route to __null_route_request_after_finish.

Type:	obj

request_in_thread_before_start[source]¶

called in the crawling thread (when it started). Default is a null route to __null_route_request_in_thread_before_start.

Type:	obj

request_in_thread_after_finish[source]¶

called in the crawling thread (when it finished). Default is a null route to __null_route_request_in_thread_after_finish.

Type:	obj

request_on_error[source]¶

called if a request failed. Default is a null route to __null_route_request_on_error.

Type:	obj

form_before_autofill[source]¶

called before the crawler starts autofilling a form. Default is a null route to __null_route_form_before_autofill.

Type:	obj

form_after_autofill[source]¶

called after the crawler finishes autofilling a form. Default is a null route to __null_route_form_after_autofill.

Type:	obj

_OptionsCallbacks__null_route_crawler_after_finish(queue)[source]¶

A null route for the ‘crawler after finish’ callback.

Parameters:	queue (obj) – The current crawling queue.

_OptionsCallbacks__null_route_crawler_before_start()[source]¶: A null route for the ‘crawler before start’ callback.

_OptionsCallbacks__null_route_form_after_autofill(queue_item, elements, form_data)[source]¶

A null route for the ‘form after autofill’ callback.

Parameters:	queue_item (`nyawc.QueueItem`) – The queue item that was finished. list (elements) – The soup elements found in the form. form_data (obj) – The {key: value} form fields.

_OptionsCallbacks__null_route_form_before_autofill(queue_item, elements, form_data)[source]¶

A null route for the ‘form before autofill’ callback.

Parameters:	queue_item (`nyawc.QueueItem`) – The queue item that was finished. list (elements) – The soup elements found in the form. form_data (obj) – The {key: value} form fields to be autofilled.
Returns:	A crawler action (either DO_AUTOFILL_FORM or DO_NOT_AUTOFILL_FORM).
Return type:	str

_OptionsCallbacks__null_route_request_after_finish(queue, queue_item, new_queue_items)[source]¶

A null route for the ‘request after finish’ callback.

Parameters:	queue (`nyawc.Queue`) – The current crawling queue. queue_item (`nyawc.QueueItem`) – The queue item that was finished. list (new_queue_items) – The new queue items that were found in the one that finished.
Returns:	A crawler action (either DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).
Return type:	str

_OptionsCallbacks__null_route_request_before_start(queue, queue_item)[source]¶

A null route for the ‘request before start’ callback.

Parameters:	queue (`nyawc.Queue`) – The current crawling queue. queue_item (`nyawc.QueueItem`) – The queue item that’s about to start.
Returns:	A crawler action (either DO_SKIP_TO_NEXT, DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).
Return type:	str

_OptionsCallbacks__null_route_request_in_thread_after_finish(queue_item)[source]¶

A null route for the ‘request in thread after finish’ callback.

Parameters:	queue_item (`nyawc.QueueItem`) – The queue item that was finished.

Note

This method gets called in the crawling thread and is therefore not thread safe.

_OptionsCallbacks__null_route_request_in_thread_before_start(queue_item)[source]¶

A null route for the ‘request in thread before start’ callback.

Parameters:	queue_item (`nyawc.QueueItem`) – The queue item that was finished.

Note

This method gets called in the crawling thread and is therefore not thread safe.

_OptionsCallbacks__null_route_request_on_error(queue_item, message)[source]¶

A null route for the ‘request on error’ callback.

Parameters:	queue_item (`nyawc.QueueItem`) – The queue item that was finished. str (message) – The error message.

__init__()[source]¶: Constructs an OptionsCallbacks instance.

class nyawc.Options.OptionsIdentity[source]¶

Bases: object

The OptionsIdentity class contains the identity/footprint options.

auth[source]¶

The (requests module) authentication class to use when making a request. For more information check http://docs.python-requests.org/en/master/user/authentication/.

Type:	obj

cookies[source]¶

The (requests module) cookie jar to use when making a request. For more information check http://docs.python-requests.org/en/master/user/quickstart/#cookies.

Type:	obj

headers[source]¶

The headers {key: value} to use when making a request.

Type:	obj

proxies[source]¶

The proxies {key: value} to use when making a request. For more information check http://docs.python-requests.org/en/master/user/advanced/#proxies.

Type:	obj

__init__()[source]¶: Constructs an OptionsIdentity instance.

class nyawc.Options.OptionsMisc[source]¶

Bases: object

The OptionsMisc class contains all kind of misc options.

debug[source]¶

If debug is enabled extra information will be logged to the console. Default is False.

Type:	bool

verify_ssl_certificates[source]¶

If verification is enabled all SSL certificates will be checked for validity. Default is True.

Type:	bool

trusted_certificates[source]¶

You can pass the path to a CA_BUNDLE file or directory with certificates of trusted CAs. Default is None.

Type:	str

__init__()[source]¶: Constructs an OptionsMisc instance.

class nyawc.Options.OptionsPerformance[source]¶

Bases: object

The OptionsPerformance class contains the performance options.

max_threads[source]¶

the maximum amount of simultaneous threads to use for crawling.

Type:	obj

request_timeout[source]¶

the request timeout in seconds (throws an exception if exceeded).

Type:	int

__init__()[source]¶: Constructs an OptionsPerformance instance.

class nyawc.Options.OptionsRouting[source]¶

Bases: object

The OptionsRouting class can contain routes that prevent the crawler from crawling similar pages multiple times.

minimum_threshold[source]¶

The minimum amount of requests to crawl (matching a certain route) before ignoring the rest. Default is 20.

Type:	int

routes[source]¶

The regular expressions that represent routes that should not be cralwed more times than the minimum treshold. Default is an empty array.

Type:	arr

Note

An example would be if you have a news site with URLs like (/news/3443, news/2132, news/9475, etc). You can add a regular expression that matches this route so only X requests that match regular expression will be crawled (where X is the minimum treshold).

Note

The crawler will only stop crawling requests of certain routes at exactly the minimum treshold if the maximum threads option is set to 1. If the maximum threads option is set to a value higher than 1 the threshold will get a bit higher depending on the amount of threads used.

__init__()[source]¶: Constructs an OptionsRouting instance.

class nyawc.Options.OptionsScope[source]¶

Bases: object

The OptionsScope class contains the scope options.

protocol_must_match[source]¶

only crawl pages with the same protocol as the startpoint (e.g. only https).

Type:	bool

subdomain_must_match[source]¶

only crawl pages with the same subdomain as the startpoint, if the startpoint is not a subdomain, no subdomains will be crawled.

Type:	bool

hostname_must_match[source]¶

only crawl pages with the same hostname as the startpoint (e.g. only finnwea).

Type:	bool

tld_must_match[source]¶

only crawl pages with the same tld as the startpoint (e.g. only .com)

Type:	bool

max_depth[source]¶

the maximum search depth. For example, 2 would be the startpoint and all the pages found on it. Default is None (unlimited).

Type:	obj

request_methods list

only crawl these request methods. If empty or None all request methods will be crawled. Default is all.

Type:	str

__init__()[source]¶: Constructs an OptionsScope instance.

nyawc.Queue module¶

class nyawc.Queue.Queue(options)[source]¶

Bases: object

A ‘hash’ queue containing all the requests of the crawler.

Note

This queue uses a certain hash to prevent duplicate entries and improve the time complexity by checking if the hash exists instead of iterating over all items.

__options[source]¶

The options to use (used when generating queue item hashes).

Type:	`nyawc.Options`

count_total[source]¶

The total count of requests in the queue.

Type:	int

items_queued list

The queued items (yet to be executed).

Type:	`nyawc.QueueItem`

items_in_progress list

The items currently being executed.

Type:	`nyawc.QueueItem`

items_finished list

The finished items.

Type:	`nyawc.QueueItem`

items_cancelled list

Items that were cancelled.

Type:	`nyawc.QueueItem`

items_errored list

Items that generated an error.

Type:	`nyawc.QueueItem`

_Queue__get_var(name)[source]¶

Get an instance/class var by name.

Parameters:	name (str) – The name of the variable.
Returns:	I’ts value.
Return type:	obj

_Queue__set_var(name, value)[source]¶

Set an instance/class var by name.

Parameters:	name (str) – The name of the variable. value (obj) – I’ts new value.

__init__(options)[source]¶

Constructs a Queue instance.

Parameters:	options (`nyawc.Options`) – The options to use.

add(queue_item)[source]¶

Add a request/response pair to the queue.

Parameters:	queue_item (`nyawc.QueueItem`) – The queue item to add.

add_request(request)[source]¶

Add a request to the queue.

Parameters:	request (`nyawc.http.Request`) – The request to add.
Returns:	The created queue item.
Return type:	`nyawc.QueueItem`

get_all(status)[source]¶

Get all the items in the queue that have the given status.

Parameters:	status (str) – return the items with this status.
Returns:	All the queue items with the given status.
Return type:	list(`nyawc.QueueItem`)

get_first(status)[source]¶

Get the first item in the queue that has the given status.

Parameters:	status (str) – return the first item with this status.
Returns:	The first queue item with the given status.
Return type:	`nyawc.QueueItem`

get_progress()[source]¶

Get the progress of the queue in percentage (float).

Returns:	The ‘finished’ progress in percentage.
Return type:	float

has_request(request)[source]¶

Check if the given request already exists in the queue.

Parameters:	request (`nyawc.http.Request`) – The request to check.
Returns:	True if already exists, False otherwise.
Return type:	bool

move(queue_item, status)[source]¶

Move a request/response pair to another status.

Parameters:	queue_item (`nyawc.QueueItem`) – The queue item to move status (str) – The new status of the queue item.

move_bulk(from_statuses, to_status)[source]¶

Move a bulk of request/response pairs to another status

Parameters:	list (from_statuses) – The statuses to move from to_status (str) – The status to move to

nyawc.QueueItem module¶

class nyawc.QueueItem.QueueItem(request, response)[source]¶

Bases: object

The QueueItem class keeps track of the request and response and the crawling status.

STATUS_QUEUED[source]¶

Status for when the crawler did not yet start the request.

Type:	str

STATUS_IN_PROGRESS[source]¶

Status for when the crawler is currently crawling the request.

Type:	str

STATUS_FINISHED[source]¶

Status for when the crawler has finished crawling the request.

Type:	str

STATUS_CANCELLED[source]¶

Status for when the crawler has cancelled the request.

Type:	str

STATUS_ERRORED[source]¶

Status for when the crawler could not execute the request.

Type:	str

STATUSES[source]¶

All statuses.

Type:	arr

status[source]¶

The current crawling status.

Type:	str

decomposed[source]¶

If the this queue item is decomposed.

Type:	bool

request[source]¶

The Request object.

Type:	`nyawc.http.Request`

response[source]¶

The Response object.

Type:	`nyawc.http.Response`

__response_soup[source]¶

The BeautifulSoup container for the response text.

Type:	obj

__index_hash[source]¶

The index of the queue (if cached), otherwise None.

Type:	str

Note

A queue item will be decomposed (cached objects are deleted to free up memory) when it is not likeley to be used again. After decompisition variables will not be cached anymore.

STATUSES = ['queued', 'in_progress', 'finished', 'cancelled', 'errored'][source]

STATUS_CANCELLED = 'cancelled'[source]

STATUS_ERRORED = 'errored'[source]

STATUS_FINISHED = 'finished'[source]

STATUS_IN_PROGRESS = 'in_progress'[source]

STATUS_QUEUED = 'queued'[source]

__init__(request, response)[source]¶

Constructs a QueueItem instance.

Parameters:	request (`nyawc.http.Request`) – The Request object. response (`nyawc.http.Response`) – The Response object (empty object when initialized).

decompose()[source]¶: Decompose this queue item (set cached variables to None) to free up memory.

Note

When setting cached variables to None memory will be released after the garbage collector ran.

get_hash()[source]¶

Generate and return the dict index hash of the given queue item.

Note

Cookies should not be included in the hash calculation because otherwise requests are crawled multiple times with e.g. different session keys, causing infinite crawling recursion.

Note

At this moment the keys do not actually get hashed since it works perfectly without and since hashing the keys requires us to built hash collision management.

Returns:	The hash of the given queue item.
Return type:	str

get_soup_response()[source]¶

Get the response as a cached BeautifulSoup container.

Returns:	The BeautifulSoup container.
Return type:	obj

nyawc.Routing module¶

class nyawc.Routing.Routing(options)[source]¶

Bases: object

The Routing class counts requests that match certain routes.

__routing_options[source]¶

The options containing routing information.

Type:	`nyawc.OptionsRouting`

__routing_count[source]¶

The {key: value} dict that contains the amount of requests for certain routes.

Type:	obj

__init__(options)[source]¶

Constructs a Crawler instance.

Parameters:	options (`nyawc.Options`) – The options to use for the current crawling runtime.

increase_route_count(crawled_request)[source]¶

Increase the count that determines how many times a URL of a certain route has been crawled.

Parameters:	crawled_request (`nyawc.http.Request`) – The request that possibly matches a route.

is_treshold_reached(scraped_request)[source]¶

Check if similar requests to the given requests have already been crawled X times. Where X is the minimum treshold amount from the options.

Parameters:	scraped_request (`nyawc.http.Request`) – The request that possibly reached the minimum treshold.
Returns:	True if treshold reached, false otherwise.
Return type:	bool

nyawc package

nyawc package¶

Subpackages¶

Submodules¶

nyawc.Crawler module¶

nyawc.CrawlerActions module¶

nyawc.CrawlerThread module¶

nyawc.Options module¶

nyawc.Queue module¶

nyawc.QueueItem module¶

nyawc.Routing module¶