# misc_example.py
from nyawc.Options import Options
from nyawc.Crawler import Crawler
from nyawc.http.Request import Request
options = Options()
options.routing.minimum_threshold = 4
options.routing.routes = [
"^(https?:\/\/)?(www\.)?finnwea\.com\/blog\/[^\n \/]+\/$"
]
crawler = Crawler(options)
crawler.start_with(Request("https://finnwea.com/"))
The minimum amount of requests to crawl (matching a certain route) before ignoring the rest. Default is 20.
For example, lets say we have these rquests;
https://finnwea.com/blog/1
https://finnwea.com/blog/2
https://finnwea.com/blog/3
...
https://finnwea.com/blog/54
It will only crawl the first 20 requests. After that it ignores the rest of the blog posts.
Please note that it will probably crawl a bit more than the minimum threshold depending on the maximum amount of threads to use.
options.routing.minimum_threshold = 20
The regular expressions that represent routes that should not be cralwed more times than the minimum treshold. Default is an empty array.
For example the route below represents http://finnwea.com/blog/{a-variable-blog-alias}/
.
options.routing.routes = ["^(https?:\/\/)?(www\.)?finnwea\.com\/blog\/[^\n \/]+\/$"]