I need a web crawler that can find links on a page and list them. Even links that are hidden by javascript onclick events.
It must
1) log the status code of the url given and any urls redirected through - example if given a url that redirects to another url with a 301 status code I need the 301 code and the 200 that it redirects to.
2) List the urls in a redirect chain if there is a chain.
3) Get all the links on the page given even ones hidden in onclick divs or other methods.
4) list all the rel, anchor text and image url elements for each link if they exist
5) follow redirects if required by meta redirects or [login to view URL] and list the urls in the redirect
6) We need to be able to run this from command line on a linux machine. I don't care too much what language but we need to be able to use it with php. Previously we were running HTML unit through shell_exec in php and then capturing what was echoed to the command line. Continuing like this is fine.
We had some luck with HTML unit but we have not got enough experience to get all our requirements.
I have lots of experience with writing web automation software, please see PMB for examples of my previous projects related to web automation. Available to start immediately and finish as soon as possible.
Best Regards,
Zeke
£500 GBP in 10 days
4.9 (20 reviews)
4.8
4.8
8 freelancers are bidding on average £390 GBP for this job