nyawc.scrapers package¶
Submodules¶
nyawc.scrapers.CSSRegexLinkScraper module¶
-
class
nyawc.scrapers.CSSRegexLinkScraper.
CSSRegexLinkScraper
(options, queue_item)[source]¶ Bases:
object
The CSSRegexLinkScraper finds absolute and relative URLs in Cascading Style Sheets.
-
content_types list
str – The supported content types.
-
__expressions list
obj – The regular expressions to execute.
-
__options
[source]¶ nyawc.Options
– The settins/options object.
-
__queue_item
[source]¶ nyawc.QueueItem
– The queue item containing the response to scrape.
-
__init__
(options, queue_item)[source]¶ Construct the CSSRegexLinkScraper instance.
Parameters: - options (
nyawc.Options
) – The settins/options object. - queue_item (
nyawc.QueueItem
) – The queue item containing a response the scrape.
- options (
-
get_requests
()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests. Return type: list( nyawc.http.Request
)
-
get_requests_from_content
(host, content)[source]¶ Find new requests from the given content.
Parameters: - host (str) – The parent request URL.
- content (obj) – The HTML content.
Returns: A list of new requests that were found.
Return type: list(
nyawc.http.Request
)
-
nyawc.scrapers.HTMLSoupFormScraper module¶
-
class
nyawc.scrapers.HTMLSoupFormScraper.
HTMLSoupFormScraper
(options, queue_item)[source]¶ Bases:
object
The HTMLSoupFormScraper finds requests from forms in HTML using BeautifulSoup.
-
content_types list
str – The supported content types.
-
__options
[source]¶ nyawc.Options
– The settins/options object.
-
__queue_item
[source]¶ nyawc.QueueItem
– The queue item containing the response to scrape.
-
_HTMLSoupFormScraper__autofill_form_data
(form_data, elements)[source]¶ Autofill empty form data with random data.
Parameters: - form_data (obj) – The {key: value} form data
- list (elements) – Soup elements.
Returns: The {key: value}
Return type: obj
-
_HTMLSoupFormScraper__get_default_form_data_input
(elements)[source]¶ Get the default form data {key: value} for the given elements.
Parameters: list (elements) – Soup elements. Returns: The {key: value} form data Return type: obj
-
_HTMLSoupFormScraper__get_default_value_from_element
(element)[source]¶ Get the default value of a form element
Parameters: elements (obj) – The soup element. Returns: The default value Return type: str
-
_HTMLSoupFormScraper__get_form_data
(soup)[source]¶ Build a form data dict from the given form.
Parameters: soup (obj) – The BeautifulSoup form. Returns: The form data (key/value). Return type: obj
-
_HTMLSoupFormScraper__get_request
(host, soup)[source]¶ Build a request from the given soup form.
Parameters: - str (host) – The URL of the current queue item.
- soup (obj) – The BeautifulSoup form.
Returns: The new Request.
Return type:
-
_HTMLSoupFormScraper__get_valid_form_data_elements
(soup)[source]¶ Get all valid form input elements.
Note
An element is valid when the value can be updated client-side and the element has a name attribute.
Parameters: soup (obj) – The BeautifulSoup form. Returns: Soup elements. Return type: list(obj)
-
_HTMLSoupFormScraper__trim_grave_accent
(href)[source]¶ Trim grave accents manually (because BeautifulSoup doesn”t support it).
Parameters: href (str) – The BeautifulSoup href value. Returns: The BeautifulSoup href value without grave accents. Return type: str
-
__init__
(options, queue_item)[source]¶ Construct the HTMLSoupFormScraper instance.
Parameters: - options (
nyawc.Options
) – The settins/options object. - queue_item (
nyawc.QueueItem
) – The queue item containing a response the scrape.
- options (
-
get_requests
()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests. Return type: list( nyawc.http.Request
)
-
get_requests_from_content
(host, content)[source]¶ Find new requests from the given content.
Parameters: - host (str) – The parent request URL.
- content (obj) – The HTML content.
Returns: A list of new requests that were found.
Return type: list(
nyawc.http.Request
)
-
nyawc.scrapers.HTMLSoupLinkScraper module¶
-
class
nyawc.scrapers.HTMLSoupLinkScraper.
HTMLSoupLinkScraper
(options, queue_item)[source]¶ Bases:
object
The HTMLSoupLinkScraper finds URLs from href attributes in HTML using BeautifulSoup.
-
content_types list
str – The supported content types.
-
__options
[source]¶ nyawc.Options
– The settins/options object.
-
__queue_item
[source]¶ nyawc.QueueItem
– The queue item containing the response to scrape.
-
_HTMLSoupLinkScraper__trim_grave_accent
(href)[source]¶ Trim grave accents manually (because BeautifulSoup doesn’t support it).
Parameters: href (str) – The BeautifulSoup href value. Returns: The BeautifulSoup href value without grave accents. Return type: str
-
__init__
(options, queue_item)[source]¶ Construct the HTMLSoupLinkScraper instance.
Parameters: - options (
nyawc.Options
) – The settins/options object. - queue_item (
nyawc.QueueItem
) – The queue item containing a response the scrape.
- options (
-
get_requests
()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests. Return type: list( nyawc.http.Request
)
-
get_requests_from_content
(host, content)[source]¶ Find new requests from the given content.
Parameters: - host (str) – The parent request URL.
- content (obj) – The HTML content.
Returns: A list of new requests that were found.
Return type: list(
nyawc.http.Request
)
-
nyawc.scrapers.JSONRegexLinkScraper module¶
-
class
nyawc.scrapers.JSONRegexLinkScraper.
JSONRegexLinkScraper
(options, queue_item)[source]¶ Bases:
object
The JSONRegexLinkScraper finds absolute and relative URLs in JSON keys and values.
-
content_types list
str – The supported content types.
-
__expressions list
obj – The regular expressions to execute.
-
__options
[source]¶ nyawc.Options
– The settins/options object.
-
__queue_item
[source]¶ nyawc.QueueItem
– The queue item containing the response to scrape.
-
__init__
(options, queue_item)[source]¶ Construct the JSONRegexLinkScraper instance.
Parameters: - options (
nyawc.Options
) – The settins/options object. - queue_item (
nyawc.QueueItem
) – The queue item containing a response the scrape.
- options (
-
get_requests
()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests. Return type: list( nyawc.http.Request
)
-
get_requests_from_content
(host, content)[source]¶ Find new requests from the given content.
Parameters: - host (str) – The parent request URL.
- content (obj) – The HTML content.
Returns: A list of new requests that were found.
Return type: list(
nyawc.http.Request
)
-
nyawc.scrapers.XMLRegexLinkScraper module¶
-
class
nyawc.scrapers.XMLRegexLinkScraper.
XMLRegexLinkScraper
(options, queue_item)[source]¶ Bases:
object
The XMLRegexLinkScraper finds absolute and relative URLs in XML values.
-
content_types list
str – The supported content types.
-
__expressions list
obj – The regular expressions to execute.
-
__options
[source]¶ nyawc.Options
– The settins/options object.
-
__queue_item
[source]¶ nyawc.QueueItem
– The queue item containing the response to scrape.
-
__init__
(options, queue_item)[source]¶ Construct the XMLRegexLinkScraper instance.
Parameters: - options (
nyawc.Options
) – The settins/options object. - queue_item (
nyawc.QueueItem
) – The queue item containing a response the scrape.
- options (
-
get_requests
()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests. Return type: list( nyawc.http.Request
)
-
get_requests_from_content
(host, content)[source]¶ Find new requests from the given content.
Parameters: - host (str) – The parent request URL.
- content (obj) – The HTML content.
Returns: A list of new requests that were found.
Return type: list(
nyawc.http.Request
)
-