nyawc package¶
Submodules¶
nyawc.Crawler module¶
-
class
nyawc.Crawler.
Crawler
(options)[source]¶ Bases:
object
The main Crawler class which handles the crawling recursion, queue and processes.
-
queue
[source]¶ nyawc.Queue
– The request/response pair queue containing everything to crawl.
-
__options
[source]¶ nyawc.Options
– The options to use for the current crawling runtime.
-
_Crawler__crawler_finish
()[source]¶ Called when the crawler is finished because there are no queued requests left or it was stopped.
-
_Crawler__crawler_start
()[source]¶ Spawn the first X queued request, where X is the max threads option.
Note
The main thread will sleep until the crawler is finished. This enables quiting the application using sigints (see http://stackoverflow.com/a/11816038/2491049)
-
_Crawler__request_finish
(queue_item, new_requests, new_queue_item_status=None)[source]¶ Called when the crawler finished the given queued item.
Parameters: - queue_item (
nyawc.QueueItem
) – The request/response pair that finished. - list (new_requests) – All the requests that were found during this request.
- new_queue_item_status (str) – The new status of the queue item (if it needs to be moved).
- queue_item (
-
_Crawler__request_start
(queue_item)[source]¶ Execute the request in given queue item.
Parameters: queue_item ( nyawc.QueueItem
) – The request/response pair to scrape.
-
_Crawler__spawn_new_request
()[source]¶ Spawn the first queued request if there is one available.
Returns: If a new request was spawned. Return type: bool
-
_Crawler__spawn_new_requests
()[source]¶ Spawn new requests until the max processes option value is reached.
Note
If no new requests were spawned and there are no requests in progress the crawler will stop crawling.
-
__init__
(options)[source]¶ Constructs a Crawler instance.
Parameters: options ( nyawc.Options
) – The options to use for the current crawling runtime.
-
start_with
(request)[source]¶ Start the crawler using the given request.
Parameters: request ( nyawc.http.Request
) – The startpoint for the crawler.
-
nyawc.CrawlerActions module¶
nyawc.CrawlerThread module¶
-
class
nyawc.CrawlerThread.
CrawlerThread
(callback, callback_lock, options, queue_item)[source]¶ Bases:
threading.Thread
The crawler thread executes the HTTP request using the HTTP handler.
-
__options
[source]¶ nyawc.Options
– The settins/options object.
-
__queue_item
[source]¶ nyawc.QueueItem
– The queue item containing a request to execute.
-
__init__
(callback, callback_lock, options, queue_item)[source]¶ Constructs a crawler thread instance
Parameters: - callback (obj) – The method to call when finished
- callback_lock (bool) – The callback lock that prevents race conditions.
- options (
nyawc.Options
) – The settins/options object. - queue_item (
nyawc.QueueItem
) – The queue item containing a request to execute.
-
nyawc.Options module¶
-
class
nyawc.Options.
Options
[source]¶ Bases:
object
The Options class contains all the crawling options.
-
scope
[source]¶ nyawc.Options.OptionsScope
– Can be used to define the crawling scope.
-
callbacks
[source]¶ nyawc.Options.OptionsCallbacks
– Can be used to define crawling callbacks.
-
performance
[source]¶ nyawc.Options.OptionsPerformance
– Can be used to define performance options.
-
identity
[source]¶ nyawc.Options.OptionsIdentity
– Can be used to define the identity/footprint options.
-
misc
[source]¶ nyawc.Options.OptionsMisc
– Can be used to define the other options.
-
-
class
nyawc.Options.
OptionsCallbacks
[source]¶ Bases:
object
The OptionsCallbacks class contains all the callback methods.
-
crawler_before_start
[source]¶ obj – called before the crawler starts crawling. Default is a null route to
__null_route_crawler_before_start
.
-
crawler_after_finish
[source]¶ obj – called after the crawler finished crawling. Default is a null route to
__null_route_crawler_after_finish
.
-
request_before_start
[source]¶ obj – called before the crawler starts a new request. Default is a null route to
__null_route_request_before_start
.
-
request_after_finish
[source]¶ obj – called after the crawler finishes a request. Default is a null route to
__null_route_request_after_finish
.
-
request_on_error
[source]¶ obj – called if a request failed. Default is a null route to
__null_route_request_on_error
.
-
form_before_autofill
[source]¶ obj – called before the crawler starts autofilling a form. Default is a null route to
__null_route_form_before_autofill
.
-
form_after_autofill
[source]¶ obj – called after the crawler finishes autofilling a form. Default is a null route to
__null_route_form_after_autofill
.
-
_OptionsCallbacks__null_route_crawler_after_finish
(queue)[source]¶ A null route for the ‘crawler after finish’ callback.
Parameters: queue (obj) – The current crawling queue.
-
_OptionsCallbacks__null_route_crawler_before_start
()[source]¶ A null route for the ‘crawler before start’ callback.
-
_OptionsCallbacks__null_route_form_after_autofill
(queue_item, elements, form_data)[source]¶ A null route for the ‘form after autofill’ callback.
Parameters: - queue_item (
nyawc.QueueItem
) – The queue item that was finished. - list (elements) – The soup elements found in the form.
- form_data (obj) – The {key: value} form fields.
- queue_item (
-
_OptionsCallbacks__null_route_form_before_autofill
(queue_item, elements, form_data)[source]¶ A null route for the ‘form before autofill’ callback.
Parameters: - queue_item (
nyawc.QueueItem
) – The queue item that was finished. - list (elements) – The soup elements found in the form.
- form_data (obj) – The {key: value} form fields to be autofilled.
Returns: A crawler action (either DO_AUTOFILL_FORM or DO_NOT_AUTOFILL_FORM).
Return type: str
- queue_item (
-
_OptionsCallbacks__null_route_request_after_finish
(queue, queue_item, new_queue_items)[source]¶ A null route for the ‘request after finish’ callback.
Parameters: - queue (
nyawc.Queue
) – The current crawling queue. - queue_item (
nyawc.QueueItem
) – The queue item that was finished. - list (new_queue_items) – The new queue items that were found in the one that finished.
Returns: A crawler action (either DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).
Return type: str
- queue (
-
_OptionsCallbacks__null_route_request_before_start
(queue, queue_item)[source]¶ A null route for the ‘request before start’ callback.
Parameters: - queue (
nyawc.Queue
) – The current crawling queue. - queue_item (
nyawc.QueueItem
) – The queue item that’s about to start.
Returns: A crawler action (either DO_SKIP_TO_NEXT, DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING).
Return type: str
- queue (
-
_OptionsCallbacks__null_route_request_on_error
(queue_item, message)[source]¶ A null route for the ‘request on error’ callback.
Parameters: - queue_item (
nyawc.QueueItem
) – The queue item that was finished. - str (message) – The error message.
- queue_item (
-
-
class
nyawc.Options.
OptionsIdentity
[source]¶ Bases:
object
The OptionsIdentity class contains the identity/footprint options.
-
auth
[source]¶ obj – The (requests module) authentication class to use when making a request. For more information check http://docs.python-requests.org/en/master/user/authentication/.
obj – The (requests module) cookie jar to use when making a request. For more information check http://docs.python-requests.org/en/master/user/quickstart/#cookies.
-
proxies
[source]¶ obj – The proxies {key: value} to use when making a request. For more information check http://docs.python-requests.org/en/master/user/advanced/#proxies.
-
-
class
nyawc.Options.
OptionsMisc
[source]¶ Bases:
object
The OptionsMisc class contains all kind of misc options.
-
class
nyawc.Options.
OptionsPerformance
[source]¶ Bases:
object
The OptionsPerformance class contains the performance options.
-
class
nyawc.Options.
OptionsScope
[source]¶ Bases:
object
The OptionsScope class contains the scope options.
-
protocol_must_match
[source]¶ bool – only crawl pages with the same protocol as the startpoint (e.g. only https).
-
subdomain_must_match
[source]¶ bool – only crawl pages with the same subdomain as the startpoint, if the startpoint is not a subdomain, no subdomains will be crawled.
-
hostname_must_match
[source]¶ bool – only crawl pages with the same hostname as the startpoint (e.g. only finnwea).
-
tld_must_match
[source]¶ bool – only crawl pages with the same tld as the startpoint (e.g. only .com)
-
nyawc.Queue module¶
-
class
nyawc.Queue.
Queue
(options)[source]¶ Bases:
object
A ‘hash’ queue containing all the requests of the crawler.
Note
This queue uses a certain hash (from
__get_hash()
) to prevent duplicate entries and improve the time complexity by checking if the hash exists instead of iterating over all items.-
__options
[source]¶ nyawc.Options
– The options to use (used when generating queue item hashes).
-
items_queued list
nyawc.QueueItem
– The queued items (yet to be executed).
-
items_in_progress list
nyawc.QueueItem
– The items currently being executed.
-
items_finished list
nyawc.QueueItem
– The finished items.
-
items_cancelled list
nyawc.QueueItem
– Items that were cancelled.
-
items_errored list
nyawc.QueueItem
– Items that generated an error.
-
_Queue__get_hash
(queue_item)[source]¶ Generate and return the dict index hash of the given queue item.
Note
Cookies should not be included in the hash calculation because otherwise requests are crawled multiple times with e.g. different session keys, causing infinite crawling recursion.
Note
At this moment the keys do not actually get hashed since it works perfectly without and since hashing the keys requires us to built hash collision management.
Parameters: queue_item ( nyawc.QueueItem
) – The queue item to get the hash from.Returns: The hash of the given queue item. Return type: str
-
_Queue__get_var
(name)[source]¶ Get an instance/class var by name.
Parameters: name (str) – The name of the variable. Returns: I’ts value. Return type: obj
-
_Queue__set_var
(name, value)[source]¶ Set an instance/class var by name.
Parameters: - name (str) – The name of the variable.
- value (obj) – I’ts new value.
-
__init__
(options)[source]¶ Constructs a Queue instance.
Parameters: options ( nyawc.Options
) – The options to use.
-
add
(queue_item)[source]¶ Add a request/response pair to the queue.
Parameters: queue_item ( nyawc.QueueItem
) – The queue item to add.
-
add_request
(request)[source]¶ Add a request to the queue.
Parameters: request ( nyawc.http.Request
) – The request to add.Returns: The created queue item. Return type: nyawc.QueueItem
-
get_all
(status)[source]¶ Get all the items in the queue that have the given status.
Parameters: status (str) – return the items with this status. Returns: All the queue items with the given status. Return type: list( nyawc.QueueItem
)
-
get_first
(status)[source]¶ Get the first item in the queue that has the given status.
Parameters: status (str) – return the first item with this status. Returns: The first queue item with the given status. Return type: nyawc.QueueItem
-
get_progress
()[source]¶ Get the progress of the queue in percentage (float).
Returns: The ‘finished’ progress in percentage. Return type: float
-
has_request
(request)[source]¶ Check if the given request already exists in the queue.
Parameters: request ( nyawc.http.Request
) – The request to check.Returns: True if already exists, False otherwise. Return type: bool
-
move
(queue_item, status)[source]¶ Move a request/response pair to another status.
Parameters: - queue_item (
nyawc.QueueItem
) – The queue item to move - status (str) – The new status of the queue item.
- queue_item (
-
nyawc.QueueItem module¶
-
class
nyawc.QueueItem.
QueueItem
(request, response)[source]¶ Bases:
object
The QueueItem class keeps track of the request and response and the crawling status.
-
request
[source]¶ nyawc.http.Request
– The Request object.
-
response
[source]¶ nyawc.http.Response
– The Response object.
-
STATUSES
= ['queued', 'in_progress', 'finished', 'cancelled', 'errored'][source]
-
STATUS_CANCELLED
= 'cancelled'[source]
-
STATUS_ERRORED
= 'errored'[source]
-
STATUS_FINISHED
= 'finished'[source]
-
STATUS_IN_PROGRESS
= 'in_progress'[source]
-
STATUS_QUEUED
= 'queued'[source]
-
__init__
(request, response)[source]¶ Constructs a QueueItem instance.
Parameters: - request (
nyawc.http.Request
) – The Request object. - response (
nyawc.http.Response
) – The Response object (empty object when initialized).
- request (
-