nyawc.scrapers package

Submodules

nyawc.scrapers.CSSRegexLinkScraper module

class nyawc.scrapers.CSSRegexLinkScraper.CSSRegexLinkScraper(options, queue_item)[source]

Bases: object

The CSSRegexLinkScraper finds absolute and relative URLs in Cascading Style Sheets.

content_types list

str – The supported content types.

__expressions list

obj – The regular expressions to execute.

__options[source]

nyawc.Options – The settins/options object.

__queue_item[source]

nyawc.QueueItem – The queue item containing the response to scrape.

__init__(options, queue_item)[source]

Construct the CSSRegexLinkScraper instance.

Parameters:
content_types = ['text/css'][source]
get_requests()[source]

Get all the new requests that were found in the response.

Parameters:
  • host (str) – The parent request URL.
  • content (obj) – The HTML content.
Returns:

A list of new requests that were found.

Return type:

list(nyawc.http.Request)

nyawc.scrapers.HTMLSoupFormScraper module

class nyawc.scrapers.HTMLSoupFormScraper.HTMLSoupFormScraper(options, queue_item)[source]

Bases: object

The HTMLSoupFormScraper finds requests from forms in HTML using BeautifulSoup.

content_types list

str – The supported content types.

__options[source]

nyawc.Options – The settins/options object.

__queue_item[source]

nyawc.QueueItem – The queue item containing the response to scrape.

_HTMLSoupFormScraper__autofill_form_data(form_data, elements)[source]

Autofill empty form data with random data.

Parameters:
  • form_data (obj) – The {key: value} form data
  • list (elements) – Soup elements.
Returns:

The {key: value}

Return type:

obj

_HTMLSoupFormScraper__get_default_form_data_input(elements)[source]

Get the default form data {key: value} for the given elements.

Parameters:list (elements) – Soup elements.
Returns:The {key: value} form data
Return type:obj
_HTMLSoupFormScraper__get_default_value_from_element(element)[source]

Get the default value of a form element

Parameters:elements (obj) – The soup element.
Returns:The default value
Return type:str
_HTMLSoupFormScraper__get_form_data(soup)[source]

Build a form data dict from the given form.

Parameters:soup (obj) – The BeautifulSoup form.
Returns:The form data (key/value).
Return type:obj
_HTMLSoupFormScraper__get_request(host, soup)[source]

Build a request from the given soup form.

Parameters:
  • str (host) – The URL of the current queue item.
  • soup (obj) – The BeautifulSoup form.
Returns:

The new Request.

Return type:

nyawc.http.Request

_HTMLSoupFormScraper__get_valid_form_data_elements(soup)[source]

Get all valid form input elements.

Note

An element is valid when the value can be updated client-side and the element has a name attribute.

Parameters:soup (obj) – The BeautifulSoup form.
Returns:Soup elements.
Return type:list(obj)
_HTMLSoupFormScraper__trim_grave_accent(href)[source]

Trim grave accents manually (because BeautifulSoup doesn”t support it).

Parameters:href (str) – The BeautifulSoup href value.
Returns:The BeautifulSoup href value without grave accents.
Return type:str
__init__(options, queue_item)[source]

Construct the HTMLSoupFormScraper instance.

Parameters:
content_types = ['text/html', 'application/xhtml+xml'][source]
get_requests()[source]

Get all the new requests that were found in the response.

Returns:A list of new requests that were found.
Return type:list(nyawc.http.Request)

nyawc.scrapers.HTMLSoupLinkScraper module

class nyawc.scrapers.HTMLSoupLinkScraper.HTMLSoupLinkScraper(options, queue_item)[source]

Bases: object

The HTMLSoupLinkScraper finds URLs from href attributes in HTML using BeautifulSoup.

content_types list

str – The supported content types.

__options[source]

nyawc.Options – The settins/options object.

__queue_item[source]

nyawc.QueueItem – The queue item containing the response to scrape.

_HTMLSoupLinkScraper__trim_grave_accent(href)[source]

Trim grave accents manually (because BeautifulSoup doesn’t support it).

Parameters:href (str) – The BeautifulSoup href value.
Returns:The BeautifulSoup href value without grave accents.
Return type:str
__init__(options, queue_item)[source]

Construct the HTMLSoupLinkScraper instance.

Parameters:
content_types = ['text/html', 'application/xhtml+xml'][source]
get_requests()[source]

Get all the new requests that were found in the response.

Returns:A list of new requests that were found.
Return type:list(nyawc.http.Request)

nyawc.scrapers.JSONRegexLinkScraper module

class nyawc.scrapers.JSONRegexLinkScraper.JSONRegexLinkScraper(options, queue_item)[source]

Bases: object

The JSONRegexLinkScraper finds absolute and relative URLs in JSON keys and values.

content_types list

str – The supported content types.

__expressions list

obj – The regular expressions to execute.

__options[source]

nyawc.Options – The settins/options object.

__queue_item[source]

nyawc.QueueItem – The queue item containing the response to scrape.

__init__(options, queue_item)[source]

Construct the JSONRegexLinkScraper instance.

Parameters:
content_types = ['application/json'][source]
get_requests()[source]

Get all the new requests that were found in the response.

Returns:A list of new requests that were found.
Return type:list(nyawc.http.Request)

nyawc.scrapers.XMLRegexLinkScraper module

class nyawc.scrapers.XMLRegexLinkScraper.XMLRegexLinkScraper(options, queue_item)[source]

Bases: object

The XMLRegexLinkScraper finds absolute and relative URLs in XML values.

content_types list

str – The supported content types.

__expressions list

obj – The regular expressions to execute.

__options[source]

nyawc.Options – The settins/options object.

__queue_item[source]

nyawc.QueueItem – The queue item containing the response to scrape.

__init__(options, queue_item)[source]

Construct the XMLRegexLinkScraper instance.

Parameters:
content_types = ['text/xml', 'application/xml', 'image/svg+xml'][source]
get_requests()[source]

Get all the new requests that were found in the response.

Returns:A list of new requests that were found.
Return type:list(nyawc.http.Request)