nyawc.scrapers package¶
Submodules¶
nyawc.scrapers.CSSRegexLinkScraper module¶
-
class
nyawc.scrapers.CSSRegexLinkScraper.CSSRegexLinkScraper(options, queue_item)[source]¶ Bases:
objectThe CSSRegexLinkScraper finds absolute and relative URLs in Cascading Style Sheets.
-
content_types list str – The supported content types.
-
__expressions list obj – The regular expressions to execute.
-
__options[source]¶ nyawc.Options– The settins/options object.
-
__queue_item[source]¶ nyawc.QueueItem– The queue item containing the response to scrape.
-
__init__(options, queue_item)[source]¶ Construct the CSSRegexLinkScraper instance.
Parameters: - options (
nyawc.Options) – The settins/options object. - queue_item (
nyawc.QueueItem) – The queue item containing a response the scrape.
- options (
-
get_requests()[source]¶ Get all the new requests that were found in the response.
Parameters: - host (str) – The parent request URL.
- content (obj) – The HTML content.
Returns: A list of new requests that were found.
Return type: list(
nyawc.http.Request)
-
nyawc.scrapers.HTMLSoupFormScraper module¶
-
class
nyawc.scrapers.HTMLSoupFormScraper.HTMLSoupFormScraper(options, queue_item)[source]¶ Bases:
objectThe HTMLSoupFormScraper finds requests from forms in HTML using BeautifulSoup.
-
content_types list str – The supported content types.
-
__options[source]¶ nyawc.Options– The settins/options object.
-
__queue_item[source]¶ nyawc.QueueItem– The queue item containing the response to scrape.
-
_HTMLSoupFormScraper__autofill_form_data(form_data, elements)[source]¶ Autofill empty form data with random data.
Parameters: - form_data (obj) – The {key: value} form data
- list (elements) – Soup elements.
Returns: The {key: value}
Return type: obj
-
_HTMLSoupFormScraper__get_default_form_data_input(elements)[source]¶ Get the default form data {key: value} for the given elements.
Parameters: list (elements) – Soup elements. Returns: The {key: value} form data Return type: obj
-
_HTMLSoupFormScraper__get_default_value_from_element(element)[source]¶ Get the default value of a form element
Parameters: elements (obj) – The soup element. Returns: The default value Return type: str
-
_HTMLSoupFormScraper__get_form_data(soup)[source]¶ Build a form data dict from the given form.
Parameters: soup (obj) – The BeautifulSoup form. Returns: The form data (key/value). Return type: obj
-
_HTMLSoupFormScraper__get_request(host, soup)[source]¶ Build a request from the given soup form.
Parameters: - str (host) – The URL of the current queue item.
- soup (obj) – The BeautifulSoup form.
Returns: The new Request.
Return type:
-
_HTMLSoupFormScraper__get_valid_form_data_elements(soup)[source]¶ Get all valid form input elements.
Note
An element is valid when the value can be updated client-side and the element has a name attribute.
Parameters: soup (obj) – The BeautifulSoup form. Returns: Soup elements. Return type: list(obj)
-
_HTMLSoupFormScraper__trim_grave_accent(href)[source]¶ Trim grave accents manually (because BeautifulSoup doesn”t support it).
Parameters: href (str) – The BeautifulSoup href value. Returns: The BeautifulSoup href value without grave accents. Return type: str
-
__init__(options, queue_item)[source]¶ Construct the HTMLSoupFormScraper instance.
Parameters: - options (
nyawc.Options) – The settins/options object. - queue_item (
nyawc.QueueItem) – The queue item containing a response the scrape.
- options (
-
get_requests()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests that were found. Return type: list( nyawc.http.Request)
-
nyawc.scrapers.HTMLSoupLinkScraper module¶
-
class
nyawc.scrapers.HTMLSoupLinkScraper.HTMLSoupLinkScraper(options, queue_item)[source]¶ Bases:
objectThe HTMLSoupLinkScraper finds URLs from href attributes in HTML using BeautifulSoup.
-
content_types list str – The supported content types.
-
__options[source]¶ nyawc.Options– The settins/options object.
-
__queue_item[source]¶ nyawc.QueueItem– The queue item containing the response to scrape.
-
_HTMLSoupLinkScraper__trim_grave_accent(href)[source]¶ Trim grave accents manually (because BeautifulSoup doesn’t support it).
Parameters: href (str) – The BeautifulSoup href value. Returns: The BeautifulSoup href value without grave accents. Return type: str
-
__init__(options, queue_item)[source]¶ Construct the HTMLSoupLinkScraper instance.
Parameters: - options (
nyawc.Options) – The settins/options object. - queue_item (
nyawc.QueueItem) – The queue item containing a response the scrape.
- options (
-
get_requests()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests that were found. Return type: list( nyawc.http.Request)
-
nyawc.scrapers.JSONRegexLinkScraper module¶
-
class
nyawc.scrapers.JSONRegexLinkScraper.JSONRegexLinkScraper(options, queue_item)[source]¶ Bases:
objectThe JSONRegexLinkScraper finds absolute and relative URLs in JSON keys and values.
-
content_types list str – The supported content types.
-
__expressions list obj – The regular expressions to execute.
-
__options[source]¶ nyawc.Options– The settins/options object.
-
__queue_item[source]¶ nyawc.QueueItem– The queue item containing the response to scrape.
-
__init__(options, queue_item)[source]¶ Construct the JSONRegexLinkScraper instance.
Parameters: - options (
nyawc.Options) – The settins/options object. - queue_item (
nyawc.QueueItem) – The queue item containing a response the scrape.
- options (
-
get_requests()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests that were found. Return type: list( nyawc.http.Request)
-
nyawc.scrapers.XMLRegexLinkScraper module¶
-
class
nyawc.scrapers.XMLRegexLinkScraper.XMLRegexLinkScraper(options, queue_item)[source]¶ Bases:
objectThe XMLRegexLinkScraper finds absolute and relative URLs in XML values.
-
content_types list str – The supported content types.
-
__expressions list obj – The regular expressions to execute.
-
__options[source]¶ nyawc.Options– The settins/options object.
-
__queue_item[source]¶ nyawc.QueueItem– The queue item containing the response to scrape.
-
__init__(options, queue_item)[source]¶ Construct the XMLRegexLinkScraper instance.
Parameters: - options (
nyawc.Options) – The settins/options object. - queue_item (
nyawc.QueueItem) – The queue item containing a response the scrape.
- options (
-
get_requests()[source]¶ Get all the new requests that were found in the response.
Returns: A list of new requests that were found. Return type: list( nyawc.http.Request)
-