nyawc.scrapers.BaseScraper.BaseScraper(options, queue_item)[source]¶Bases: object
The BaseScraper can be used to create other scrapers.
__options[source]¶nyawc.Options – The settins/options object.
__queue_item[source]¶nyawc.QueueItem – The queue item containing the response to scrape.
__init__(options, queue_item)[source]¶Construct the HTMLSoupLinkScraper instance.
| Parameters: |
|
|---|
get_requests()[source]¶Get all the new requests that were found in the response.
| Returns: | A list of new requests that were found. |
|---|---|
| Return type: | list(nyawc.http.Request) |
nyawc.scrapers.CSSRegexLinkScraper.CSSRegexLinkScraper(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The CSSRegexLinkScraper finds absolute and relative URLs in Cascading Style Sheets.
content_types liststr – The supported content types.
__expressions listobj – The regular expressions to execute.
derived_get_requests()[source]¶Get all the new requests that were found in the response.
| Returns: | A list of new requests that were found. |
|---|---|
| Return type: | list(nyawc.http.Request) |
nyawc.scrapers.HTMLSoupFormScraper.HTMLSoupFormScraper(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The HTMLSoupFormScraper finds requests from forms in HTML using BeautifulSoup.
content_types liststr – The supported content types.
_HTMLSoupFormScraper__autofill_form_data(form_data, elements)[source]¶Autofill empty form data with random data.
| Parameters: |
|
|---|---|
| Returns: | The {key: value} |
| Return type: | obj |
_HTMLSoupFormScraper__get_default_form_data_input(elements)[source]¶Get the default form data {key: value} for the given elements.
| Parameters: | list (elements) – Soup elements. |
|---|---|
| Returns: | The {key: value} form data |
| Return type: | obj |
_HTMLSoupFormScraper__get_default_value_from_element(element)[source]¶Get the default value of a form element
| Parameters: | elements (obj) – The soup element. |
|---|---|
| Returns: | The default value |
| Return type: | str |
_HTMLSoupFormScraper__get_form_data(soup)[source]¶Build a form data dict from the given form.
| Parameters: | soup (obj) – The BeautifulSoup form. |
|---|---|
| Returns: | The form data (key/value). |
| Return type: | obj |
_HTMLSoupFormScraper__get_request(host, soup)[source]¶Build a request from the given soup form.
| Parameters: |
|
|---|---|
| Returns: | The new Request. |
| Return type: |
_HTMLSoupFormScraper__get_valid_form_data_elements(soup)[source]¶Get all valid form input elements.
Note
An element is valid when the value can be updated client-side and the element has a name attribute.
| Parameters: | soup (obj) – The BeautifulSoup form. |
|---|---|
| Returns: | Soup elements. |
| Return type: | list(obj) |
_HTMLSoupFormScraper__trim_grave_accent(href)[source]¶Trim grave accents manually (because BeautifulSoup doesn”t support it).
| Parameters: | href (str) – The BeautifulSoup href value. |
|---|---|
| Returns: | The BeautifulSoup href value without grave accents. |
| Return type: | str |
derived_get_requests()[source]¶Get all the new requests that were found in the response.
| Returns: | A list of new requests that were found. |
|---|---|
| Return type: | list(nyawc.http.Request) |
nyawc.scrapers.HTMLSoupLinkScraper.HTMLSoupLinkScraper(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The HTMLSoupLinkScraper finds URLs from href attributes in HTML using BeautifulSoup.
content_types liststr – The supported content types.
_HTMLSoupLinkScraper__trim_grave_accent(href)[source]¶Trim grave accents manually (because BeautifulSoup doesn’t support it).
| Parameters: | href (str) – The BeautifulSoup href value. |
|---|---|
| Returns: | The BeautifulSoup href value without grave accents. |
| Return type: | str |
derived_get_requests()[source]¶Get all the new requests that were found in the response.
| Returns: | A list of new requests that were found. |
|---|---|
| Return type: | list(nyawc.http.Request) |
nyawc.scrapers.JSONRegexLinkScraper.JSONRegexLinkScraper(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The JSONRegexLinkScraper finds absolute and relative URLs in JSON keys and values.
content_types liststr – The supported content types.
__expressions listobj – The regular expressions to execute.
derived_get_requests()[source]¶Get all the new requests that were found in the response.
| Returns: | A list of new requests that were found. |
|---|---|
| Return type: | list(nyawc.http.Request) |
nyawc.scrapers.XMLRegexLinkScraper.XMLRegexLinkScraper(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The XMLRegexLinkScraper finds absolute and relative URLs in XML values.
content_types liststr – The supported content types.
__expressions listobj – The regular expressions to execute.
derived_get_requests()[source]¶Get all the new requests that were found in the response.
| Returns: | A list of new requests that were found. |
|---|---|
| Return type: | list(nyawc.http.Request) |