nyawc.Crawler.
Crawler
(options)[source]¶Bases: object
The main Crawler class which handles the crawling recursion, queue and processes.
queue
[source]¶nyawc.Queue
– The request/response pair queue containing everything to crawl.
routing
[source]¶nyawc.Routing
– A class that identifies requests based on routes from the options.
__options
[source]¶nyawc.Options
– The options to use for the current crawling runtime.
__threads
[source]¶obj – All currently running threads, as queue item hash => nyawc.CrawlerThread
.
_Crawler__add_scraped_requests_to_queue
(queue_item, scraped_requests)[source]¶Convert the scraped requests to queue items, return them and also add them to the queue.
Parameters: |
|
---|---|
Returns: | The new queue items. |
Return type: | list( |
_Crawler__crawler_finish
()[source]¶Called when the crawler is finished because there are no queued requests left or it was stopped.
_Crawler__crawler_start
()[source]¶Spawn the first X queued request, where X is the max threads option.
Note
The main thread will sleep until the crawler is finished. This enables quiting the application using sigints (see http://stackoverflow.com/a/11816038/2491049).
Note
__crawler_stop() and __spawn_new_requests() are called here on the main thread to prevent thread recursion and deadlocks.
_Crawler__request_finish
(queue_item, new_requests, request_failed=False)[source]¶Called when the crawler finished the given queue item.
Parameters: |
|
---|
_Crawler__request_start
(queue_item)[source]¶Execute the request in given queue item.
Parameters: | queue_item (nyawc.QueueItem ) – The request/response pair to scrape. |
---|
_Crawler__signal_handler
(signum, frame)[source]¶On sigint (e.g. CTRL+C) stop the crawler.
Parameters: |
|
---|
_Crawler__spawn_new_request
()[source]¶Spawn the first queued request if there is one available.
Returns: | True if a new request was spawned, false otherwise. |
---|---|
Return type: | bool |
_Crawler__spawn_new_requests
()[source]¶Spawn new requests until the max threads option value is reached.
Note
If no new requests were spawned and there are no requests in progress the crawler will stop crawling.
__init__
(options)[source]¶Constructs a Crawler instance.
Parameters: | options (nyawc.Options ) – The options to use for the current crawling runtime. |
---|
start_with
(request)[source]¶Start the crawler using the given request.
Parameters: | request (nyawc.http.Request ) – The startpoint for the crawler. |
---|
nyawc.CrawlerThread.
CrawlerThread
(callback, callback_lock, options, queue_item)[source]¶Bases: threading.Thread
The crawler thread executes the HTTP request using the HTTP handler.
__options
[source]¶nyawc.Options
– The settins/options object.
__queue_item
[source]¶nyawc.QueueItem
– The queue item containing a request to execute.
__init__
(callback, callback_lock, options, queue_item)[source]¶Constructs a crawler thread instance
Parameters: |
|
---|
nyawc.Options.
Options
[source]¶Bases: object
The Options class contains all the crawling options.
scope
[source]¶nyawc.Options.OptionsScope
– Can be used to define the crawling scope.
callbacks
[source]¶nyawc.Options.OptionsCallbacks
– Can be used to define crawling callbacks.
performance
[source]¶nyawc.Options.OptionsPerformance
– Can be used to define performance options.
identity
[source]¶nyawc.Options.OptionsIdentity
– Can be used to define the identity/footprint options.
routing
[source]¶nyawc.Options.OptionsRouting
– Can be used to define routes to ignore similar requests.
misc
[source]¶nyawc.Options.OptionsMisc
– Can be used to define the other options.
nyawc.Options.
OptionsCallbacks
[source]¶Bases: object
The OptionsCallbacks class contains all the callback methods.
crawler_before_start
[source]¶obj – called before the crawler starts crawling. Default is a null route to __null_route_crawler_before_start
.
crawler_after_finish
[source]¶obj – called after the crawler finished crawling. Default is a null route to __null_route_crawler_after_finish
.
request_before_start
[source]¶obj – called before the crawler starts a new request. Default is a null route to __null_route_request_before_start
.
request_after_finish
[source]¶obj – called after the crawler finishes a request. Default is a null route to __null_route_request_after_finish
.
request_in_thread_before_start
[source]¶obj – called in the crawling thread (when it started). Default is a null route to __null_route_request_in_thread_before_start
.
request_in_thread_after_finish
[source]¶obj – called in the crawling thread (when it finished). Default is a null route to __null_route_request_in_thread_after_finish
.
request_on_error
[source]¶obj – called if a request failed. Default is a null route to __null_route_request_on_error
.
form_before_autofill
[source]¶obj – called before the crawler starts autofilling a form. Default is a null route to __null_route_form_before_autofill
.
form_after_autofill
[source]¶obj – called after the crawler finishes autofilling a form. Default is a null route to __null_route_form_after_autofill
.
_OptionsCallbacks__null_route_crawler_after_finish
(queue)[source]¶A null route for the ‘crawler after finish’ callback.
Parameters: | queue (obj) – The current crawling queue. |
---|
_OptionsCallbacks__null_route_crawler_before_start
()[source]¶A null route for the ‘crawler before start’ callback.
_OptionsCallbacks__null_route_form_after_autofill
(queue_item, elements, form_data)[source]¶A null route for the ‘form after autofill’ callback.
Parameters: |
|
---|
_OptionsCallbacks__null_route_form_before_autofill
(queue_item, elements, form_data)[source]¶A null route for the ‘form before autofill’ callback.
Parameters: |
|
---|---|
Returns: | A crawler action (either DO_AUTOFILL_FORM or DO_NOT_AUTOFILL_FORM). |
Return type: | str |
_OptionsCallbacks__null_route_request_after_finish
(queue, queue_item, new_queue_items)[source]¶A null route for the ‘request after finish’ callback.
Parameters: |
|
---|---|
Returns: | A crawler action (either DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING). |
Return type: | str |
_OptionsCallbacks__null_route_request_before_start
(queue, queue_item)[source]¶A null route for the ‘request before start’ callback.
Parameters: |
|
---|---|
Returns: | A crawler action (either DO_SKIP_TO_NEXT, DO_STOP_CRAWLING or DO_CONTINUE_CRAWLING). |
Return type: | str |
_OptionsCallbacks__null_route_request_in_thread_after_finish
(queue_item)[source]¶A null route for the ‘request in thread after finish’ callback.
Parameters: | queue_item (nyawc.QueueItem ) – The queue item that was finished. |
---|
Note
This method gets called in the crawling thread and is therefore not thread safe.
_OptionsCallbacks__null_route_request_in_thread_before_start
(queue_item)[source]¶A null route for the ‘request in thread before start’ callback.
Parameters: | queue_item (nyawc.QueueItem ) – The queue item that was finished. |
---|
Note
This method gets called in the crawling thread and is therefore not thread safe.
_OptionsCallbacks__null_route_request_on_error
(queue_item, message)[source]¶A null route for the ‘request on error’ callback.
Parameters: |
|
---|
nyawc.Options.
OptionsIdentity
[source]¶Bases: object
The OptionsIdentity class contains the identity/footprint options.
auth
[source]¶obj – The (requests module) authentication class to use when making a request. For more information check http://docs.python-requests.org/en/master/user/authentication/.
obj – The (requests module) cookie jar to use when making a request. For more information check http://docs.python-requests.org/en/master/user/quickstart/#cookies.
proxies
[source]¶obj – The proxies {key: value} to use when making a request. For more information check http://docs.python-requests.org/en/master/user/advanced/#proxies.
nyawc.Options.
OptionsMisc
[source]¶Bases: object
The OptionsMisc class contains all kind of misc options.
debug
[source]¶bool – If debug is enabled extra information will be logged to the console. Default is False.
verify_ssl_certificates
[source]¶bool – If verification is enabled all SSL certificates will be checked for validity. Default is True.
nyawc.Options.
OptionsPerformance
[source]¶Bases: object
The OptionsPerformance class contains the performance options.
nyawc.Options.
OptionsRouting
[source]¶Bases: object
The OptionsRouting class can contain routes that prevent the crawler from crawling similar pages multiple times.
minimum_threshold
[source]¶int – The minimum amount of requests to crawl (matching a certain route) before ignoring the rest. Default is 20.
routes
[source]¶arr – The regular expressions that represent routes that should not be cralwed more times than the minimum treshold. Default is an empty array.
Note
An example would be if you have a news site with URLs like (/news/3443, news/2132, news/9475, etc). You can add a regular expression that matches this route so only X requests that match regular expression will be crawled (where X is the minimum treshold).
Note
The crawler will only stop crawling requests of certain routes at exactly the minimum treshold if the maximum threads option is set to 1. If the maximum threads option is set to a value higher than 1 the threshold will get a bit higher depending on the amount of threads used.
nyawc.Options.
OptionsScope
[source]¶Bases: object
The OptionsScope class contains the scope options.
protocol_must_match
[source]¶bool – only crawl pages with the same protocol as the startpoint (e.g. only https).
subdomain_must_match
[source]¶bool – only crawl pages with the same subdomain as the startpoint, if the startpoint is not a subdomain, no subdomains will be crawled.
hostname_must_match
[source]¶bool – only crawl pages with the same hostname as the startpoint (e.g. only finnwea).
tld_must_match
[source]¶bool – only crawl pages with the same tld as the startpoint (e.g. only .com)
max_depth
[source]¶obj – the maximum search depth. For example, 2 would be the startpoint and all the pages found on it. Default is None (unlimited).
request_methods list
str – only crawl these request methods. If empty or None
all request methods will be crawled. Default is all.
nyawc.Queue.
Queue
(options)[source]¶Bases: object
A ‘hash’ queue containing all the requests of the crawler.
Note
This queue uses a certain hash to prevent duplicate entries and improve the time complexity by checking if the hash exists instead of iterating over all items.
__options
[source]¶nyawc.Options
– The options to use (used when generating queue item hashes).
items_queued list
nyawc.QueueItem
– The queued items (yet to be executed).
items_in_progress list
nyawc.QueueItem
– The items currently being executed.
items_finished list
nyawc.QueueItem
– The finished items.
items_cancelled list
nyawc.QueueItem
– Items that were cancelled.
items_errored list
nyawc.QueueItem
– Items that generated an error.
_Queue__get_var
(name)[source]¶Get an instance/class var by name.
Parameters: | name (str) – The name of the variable. |
---|---|
Returns: | I’ts value. |
Return type: | obj |
_Queue__set_var
(name, value)[source]¶Set an instance/class var by name.
Parameters: |
|
---|
__init__
(options)[source]¶Constructs a Queue instance.
Parameters: | options (nyawc.Options ) – The options to use. |
---|
add
(queue_item)[source]¶Add a request/response pair to the queue.
Parameters: | queue_item (nyawc.QueueItem ) – The queue item to add. |
---|
add_request
(request)[source]¶Add a request to the queue.
Parameters: | request (nyawc.http.Request ) – The request to add. |
---|---|
Returns: | The created queue item. |
Return type: | nyawc.QueueItem |
get_all
(status)[source]¶Get all the items in the queue that have the given status.
Parameters: | status (str) – return the items with this status. |
---|---|
Returns: | All the queue items with the given status. |
Return type: | list(nyawc.QueueItem ) |
get_first
(status)[source]¶Get the first item in the queue that has the given status.
Parameters: | status (str) – return the first item with this status. |
---|---|
Returns: | The first queue item with the given status. |
Return type: | nyawc.QueueItem |
get_progress
()[source]¶Get the progress of the queue in percentage (float).
Returns: | The ‘finished’ progress in percentage. |
---|---|
Return type: | float |
has_request
(request)[source]¶Check if the given request already exists in the queue.
Parameters: | request (nyawc.http.Request ) – The request to check. |
---|---|
Returns: | True if already exists, False otherwise. |
Return type: | bool |
move
(queue_item, status)[source]¶Move a request/response pair to another status.
Parameters: |
|
---|
nyawc.QueueItem.
QueueItem
(request, response)[source]¶Bases: object
The QueueItem class keeps track of the request and response and the crawling status.
request
[source]¶nyawc.http.Request
– The Request object.
response
[source]¶nyawc.http.Response
– The Response object.
Note
A queue item will be decomposed (cached objects are deleted to free up memory) when it is not likeley to be used again. After decompisition variables will not be cached anymore.
STATUSES
= ['queued', 'in_progress', 'finished', 'cancelled', 'errored'][source]STATUS_CANCELLED
= 'cancelled'[source]STATUS_ERRORED
= 'errored'[source]STATUS_FINISHED
= 'finished'[source]STATUS_IN_PROGRESS
= 'in_progress'[source]STATUS_QUEUED
= 'queued'[source]__init__
(request, response)[source]¶Constructs a QueueItem instance.
Parameters: |
|
---|
decompose
()[source]¶Decompose this queue item (set cached variables to None) to free up memory.
Note
When setting cached variables to None memory will be released after the garbage collector ran.
get_hash
()[source]¶Generate and return the dict index hash of the given queue item.
Note
Cookies should not be included in the hash calculation because otherwise requests are crawled multiple times with e.g. different session keys, causing infinite crawling recursion.
Note
At this moment the keys do not actually get hashed since it works perfectly without and since hashing the keys requires us to built hash collision management.
Returns: | The hash of the given queue item. |
---|---|
Return type: | str |
nyawc.Routing.
Routing
(options)[source]¶Bases: object
The Routing class counts requests that match certain routes.
__routing_count
[source]¶obj – The {key: value} dict that contains the amount of requests for certain routes.
__init__
(options)[source]¶Constructs a Crawler instance.
Parameters: | options (nyawc.Options ) – The options to use for the current crawling runtime. |
---|
increase_route_count
(crawled_request)[source]¶Increase the count that determines how many times a URL of a certain route has been crawled.
Parameters: | crawled_request (nyawc.http.Request ) – The request that possibly matches a route. |
---|
is_treshold_reached
(scraped_request)[source]¶Check if similar requests to the given requests have already been crawled X times. Where X is the minimum treshold amount from the options.
Parameters: | scraped_request (nyawc.http.Request ) – The request that possibly reached the minimum treshold. |
---|---|
Returns: | True if treshold reached, false otherwise. |
Return type: | bool |