Crawling scope¶
Table of Contents
How to use scope options¶
# scope_example.py
from nyawc.Options import Options
from nyawc.Crawler import Crawler
from nyawc.http.Request import Request
options = Options()
options.scope.protocol_must_match = False
options.scope.subdomain_must_match = True
options.scope.hostname_must_match = True
options.scope.tld_must_match = True
options.scope.max_depth = None
crawler = Crawler(options)
crawler.start_with(Request("https://finnwea.com/"))
Available scope options¶
Protocol must match
Only crawl pages with the same protocol as the startpoint (e.g. only https) if True. Default is False.
options.scope.protocol_must_match = False
Subdomain must match
Only crawl pages with the same subdomain as the startpoint if True. If the startpoint is not a subdomain, no subdomains will be crawled. Default is True.
options.scope.subdomain_must_match = True
Hostname must match
Only crawl pages with the same hostname as the startpoint (e.g. only finnwea) if True. Default is True.
Please note that if you set this to false, chances are that it never stops crawling.
options.scope.hostname_must_match = True
TLD must match
Only crawl pages with the same tld as the startpoint (e.g. only .com) if True. Default is True.
options.scope.tld_must_match = True
Maximum crawling depth
The maximum search depth. Default is None (unlimited).
- 0 will only crawl the start request.
- 1 will also crawl all requests found on the start request.
- 2 will go one level deeper.
- And so on...
options.scope.max_depth = None