nyawc.scrapers.BaseScraper.
BaseScraper
(options, queue_item)[source]¶Bases: object
The BaseScraper can be used to create other scrapers.
__options
[source]¶nyawc.Options
– The settins/options object.
__queue_item
[source]¶nyawc.QueueItem
– The queue item containing the response to scrape.
__init__
(options, queue_item)[source]¶Construct the HTMLSoupLinkScraper instance.
Parameters: |
|
---|
get_requests
()[source]¶Get all the new requests that were found in the response.
Returns: | A list of new requests that were found. |
---|---|
Return type: | list(nyawc.http.Request ) |
nyawc.scrapers.CSSRegexLinkScraper.
CSSRegexLinkScraper
(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The CSSRegexLinkScraper finds absolute and relative URLs in Cascading Style Sheets.
content_types list
str – The supported content types.
__expressions list
obj – The regular expressions to execute.
derived_get_requests
()[source]¶Get all the new requests that were found in the response.
Returns: | A list of new requests that were found. |
---|---|
Return type: | list(nyawc.http.Request ) |
nyawc.scrapers.HTMLSoupFormScraper.
HTMLSoupFormScraper
(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The HTMLSoupFormScraper finds requests from forms in HTML using BeautifulSoup.
content_types list
str – The supported content types.
_HTMLSoupFormScraper__autofill_form_data
(form_data, elements)[source]¶Autofill empty form data with random data.
Parameters: |
|
---|---|
Returns: | The {key: value} |
Return type: | obj |
_HTMLSoupFormScraper__get_default_form_data_input
(elements)[source]¶Get the default form data {key: value} for the given elements.
Parameters: | list (elements) – Soup elements. |
---|---|
Returns: | The {key: value} form data |
Return type: | obj |
_HTMLSoupFormScraper__get_default_value_from_element
(element)[source]¶Get the default value of a form element
Parameters: | elements (obj) – The soup element. |
---|---|
Returns: | The default value |
Return type: | str |
_HTMLSoupFormScraper__get_form_data
(soup)[source]¶Build a form data dict from the given form.
Parameters: | soup (obj) – The BeautifulSoup form. |
---|---|
Returns: | The form data (key/value). |
Return type: | obj |
_HTMLSoupFormScraper__get_request
(host, soup)[source]¶Build a request from the given soup form.
Parameters: |
|
---|---|
Returns: | The new Request. |
Return type: |
_HTMLSoupFormScraper__get_valid_form_data_elements
(soup)[source]¶Get all valid form input elements.
Note
An element is valid when the value can be updated client-side and the element has a name attribute.
Parameters: | soup (obj) – The BeautifulSoup form. |
---|---|
Returns: | Soup elements. |
Return type: | list(obj) |
_HTMLSoupFormScraper__trim_grave_accent
(href)[source]¶Trim grave accents manually (because BeautifulSoup doesn”t support it).
Parameters: | href (str) – The BeautifulSoup href value. |
---|---|
Returns: | The BeautifulSoup href value without grave accents. |
Return type: | str |
derived_get_requests
()[source]¶Get all the new requests that were found in the response.
Returns: | A list of new requests that were found. |
---|---|
Return type: | list(nyawc.http.Request ) |
nyawc.scrapers.HTMLSoupLinkScraper.
HTMLSoupLinkScraper
(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The HTMLSoupLinkScraper finds URLs from href attributes in HTML using BeautifulSoup.
content_types list
str – The supported content types.
_HTMLSoupLinkScraper__trim_grave_accent
(href)[source]¶Trim grave accents manually (because BeautifulSoup doesn’t support it).
Parameters: | href (str) – The BeautifulSoup href value. |
---|---|
Returns: | The BeautifulSoup href value without grave accents. |
Return type: | str |
derived_get_requests
()[source]¶Get all the new requests that were found in the response.
Returns: | A list of new requests that were found. |
---|---|
Return type: | list(nyawc.http.Request ) |
nyawc.scrapers.JSONRegexLinkScraper.
JSONRegexLinkScraper
(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The JSONRegexLinkScraper finds absolute and relative URLs in JSON keys and values.
content_types list
str – The supported content types.
__expressions list
obj – The regular expressions to execute.
derived_get_requests
()[source]¶Get all the new requests that were found in the response.
Returns: | A list of new requests that were found. |
---|---|
Return type: | list(nyawc.http.Request ) |
nyawc.scrapers.XMLRegexLinkScraper.
XMLRegexLinkScraper
(options, queue_item)[source]¶Bases: nyawc.scrapers.BaseScraper.BaseScraper
The XMLRegexLinkScraper finds absolute and relative URLs in XML values.
content_types list
str – The supported content types.
__expressions list
obj – The regular expressions to execute.
derived_get_requests
()[source]¶Get all the new requests that were found in the response.
Returns: | A list of new requests that were found. |
---|---|
Return type: | list(nyawc.http.Request ) |