nyawc.scrapers package¶

Submodules¶

nyawc.scrapers.CSSRegexLinkScraper module¶

class nyawc.scrapers.CSSRegexLinkScraper.CSSRegexLinkScraper(options, queue_item)[source]¶

Bases: object

The CSSRegexLinkScraper finds absolute and relative URLs in Cascading Style Sheets.

content_types list: str – The supported content types.

__expressions list: obj – The regular expressions to execute.

__options[source]¶: nyawc.Options – The settins/options object.

__queue_item[source]¶: nyawc.QueueItem – The queue item containing the response to scrape.

__init__(options, queue_item)[source]¶

Construct the CSSRegexLinkScraper instance.

Parameters:	options (`nyawc.Options`) – The settins/options object. queue_item (`nyawc.QueueItem`) – The queue item containing a response the scrape.

content_types = ['text/css'][source]¶

get_requests()[source]¶

Get all the new requests that were found in the response.

Parameters:	host (str) – The parent request URL. content (obj) – The HTML content.
Returns:	A list of new requests that were found.
Return type:	list(`nyawc.http.Request`)

nyawc.scrapers.HTMLSoupFormScraper module¶

class nyawc.scrapers.HTMLSoupFormScraper.HTMLSoupFormScraper(options, queue_item)[source]¶

Bases: object

The HTMLSoupFormScraper finds requests from forms in HTML using BeautifulSoup.

content_types list: str – The supported content types.

__options[source]¶: nyawc.Options – The settins/options object.

__queue_item[source]¶: nyawc.QueueItem – The queue item containing the response to scrape.

_HTMLSoupFormScraper__autofill_form_data(form_data, elements)[source]¶

Autofill empty form data with random data.

Parameters:	form_data (obj) – The {key: value} form data list (elements) – Soup elements.
Returns:	The {key: value}
Return type:	obj

_HTMLSoupFormScraper__get_default_form_data_input(elements)[source]¶

Get the default form data {key: value} for the given elements.

Parameters:	list (elements) – Soup elements.
Returns:	The {key: value} form data
Return type:	obj

_HTMLSoupFormScraper__get_default_value_from_element(element)[source]¶

Get the default value of a form element

Parameters:	elements (obj) – The soup element.
Returns:	The default value
Return type:	str

_HTMLSoupFormScraper__get_form_data(soup)[source]¶

Build a form data dict from the given form.

Parameters:	soup (obj) – The BeautifulSoup form.
Returns:	The form data (key/value).
Return type:	obj

_HTMLSoupFormScraper__get_request(host, soup)[source]¶

Build a request from the given soup form.

Parameters:	str (host) – The URL of the current queue item. soup (obj) – The BeautifulSoup form.
Returns:	The new Request.
Return type:	`nyawc.http.Request`

_HTMLSoupFormScraper__get_valid_form_data_elements(soup)[source]¶

Get all valid form input elements.

Note

An element is valid when the value can be updated client-side and the element has a name attribute.

Parameters:	soup (obj) – The BeautifulSoup form.
Returns:	Soup elements.
Return type:	list(obj)

_HTMLSoupFormScraper__trim_grave_accent(href)[source]¶

Trim grave accents manually (because BeautifulSoup doesn”t support it).

Parameters:	href (str) – The BeautifulSoup href value.
Returns:	The BeautifulSoup href value without grave accents.
Return type:	str

__init__(options, queue_item)[source]¶

Construct the HTMLSoupFormScraper instance.

Parameters:	options (`nyawc.Options`) – The settins/options object. queue_item (`nyawc.QueueItem`) – The queue item containing a response the scrape.

content_types = ['text/html', 'application/xhtml+xml'][source]¶

get_requests()[source]¶

Get all the new requests that were found in the response.

Returns:	A list of new requests that were found.
Return type:	list(`nyawc.http.Request`)

nyawc.scrapers.HTMLSoupLinkScraper module¶

class nyawc.scrapers.HTMLSoupLinkScraper.HTMLSoupLinkScraper(options, queue_item)[source]¶

Bases: object

The HTMLSoupLinkScraper finds URLs from href attributes in HTML using BeautifulSoup.

content_types list: str – The supported content types.

__options[source]¶: nyawc.Options – The settins/options object.

__queue_item[source]¶: nyawc.QueueItem – The queue item containing the response to scrape.

_HTMLSoupLinkScraper__trim_grave_accent(href)[source]¶

Trim grave accents manually (because BeautifulSoup doesn’t support it).

Parameters:	href (str) – The BeautifulSoup href value.
Returns:	The BeautifulSoup href value without grave accents.
Return type:	str

__init__(options, queue_item)[source]¶

Construct the HTMLSoupLinkScraper instance.

Parameters:	options (`nyawc.Options`) – The settins/options object. queue_item (`nyawc.QueueItem`) – The queue item containing a response the scrape.

content_types = ['text/html', 'application/xhtml+xml'][source]¶

get_requests()[source]¶

Get all the new requests that were found in the response.

Returns:	A list of new requests that were found.
Return type:	list(`nyawc.http.Request`)

nyawc.scrapers.JSONRegexLinkScraper module¶

class nyawc.scrapers.JSONRegexLinkScraper.JSONRegexLinkScraper(options, queue_item)[source]¶

Bases: object

The JSONRegexLinkScraper finds absolute and relative URLs in JSON keys and values.

content_types list: str – The supported content types.

__expressions list: obj – The regular expressions to execute.

__options[source]¶: nyawc.Options – The settins/options object.

__queue_item[source]¶: nyawc.QueueItem – The queue item containing the response to scrape.

__init__(options, queue_item)[source]¶

Construct the JSONRegexLinkScraper instance.

Parameters:	options (`nyawc.Options`) – The settins/options object. queue_item (`nyawc.QueueItem`) – The queue item containing a response the scrape.

content_types = ['application/json'][source]¶

get_requests()[source]¶

Get all the new requests that were found in the response.

Returns:	A list of new requests that were found.
Return type:	list(`nyawc.http.Request`)

nyawc.scrapers.XMLRegexLinkScraper module¶

class nyawc.scrapers.XMLRegexLinkScraper.XMLRegexLinkScraper(options, queue_item)[source]¶

Bases: object

The XMLRegexLinkScraper finds absolute and relative URLs in XML values.

content_types list: str – The supported content types.

__expressions list: obj – The regular expressions to execute.

__options[source]¶: nyawc.Options – The settins/options object.

__queue_item[source]¶: nyawc.QueueItem – The queue item containing the response to scrape.

__init__(options, queue_item)[source]¶

Construct the XMLRegexLinkScraper instance.

Parameters:	options (`nyawc.Options`) – The settins/options object. queue_item (`nyawc.QueueItem`) – The queue item containing a response the scrape.

content_types = ['text/xml', 'application/xml', 'image/svg+xml'][source]¶

get_requests()[source]¶

Get all the new requests that were found in the response.

Returns:	A list of new requests that were found.
Return type:	list(`nyawc.http.Request`)

Not Your Average Web Crawler

nyawc.scrapers package¶

Submodules¶

nyawc.scrapers.CSSRegexLinkScraper module¶

nyawc.scrapers.HTMLSoupFormScraper module¶

nyawc.scrapers.HTMLSoupLinkScraper module¶

nyawc.scrapers.JSONRegexLinkScraper module¶

nyawc.scrapers.XMLRegexLinkScraper module¶

Introduction

Options

API