A set of crawlers based on Scrapy framework that can download and synchronize all of products' firmware (including all versions) from web pages of a given list of predefined vendors and store the firmware information (meta data) in SQLite DB. The mandatory metadata fields include ( Manufacturer, Model, Version, Type, Name, Release Date (if available), Download link, ( calculated Sha2 hash of the file)i.e. ( Cisco, Video Surveillance 6030 IP Camera, 2.7.0, IP Camera, [login to view URL], 21/08/2015, "link", “Sha2” ) There is a non-mandatory binary field which indicates if the device is discontinued or not depending on the availability of such information on the website of the vendor. The firmware files itself will be stored in the file system and will be referenced in SQLite. The developer is required to follow DB schema and code templates provided by us. It's also the responsibility of the developer to test crawler and ensure completeness of the solution in terms of full coverage of the firmware files and product pages.
There are no GUI components on the server that runs crawlers. Therefore, headless browsing mode should be used.
1. Crawlers will be written per vendor. This is required because each vendor website will have its own implementation of the firmware download page.
2. The user should be able to pause and resume crawling jobs.
3. Crawlers should detect previously downloaded files and only download updated and new content and firmware files. At first execution of each crawler, it will download all the available firmware files but the subsequent crawler runs will only download new firmware files which are added since the last crawling. This will be achieved by analysing data available in SQLite and skipping the files that have already been downloaded and processed.
3. The developer is required to manually analyze each provided vendor site before writing a crawler to identify the following required information:
a. URLs for the firmware download page including all of the firmware versions for each product
b. URLs/files for each product that include these info which are required to be scraped: "Manufacturer", "Model", "Version", "Type", "Release Date", "if the product is discontinued"
c. Credential Requirements (Simple Signups, Specific Signups, No Signups)
d. Any Captcha on the page
e. Any honeypot traps
4. If a vendor site requires credential for firmware download, the developer is required to sign up an account using a gmail address dedicated for this project
5. Script will try to imitate human like behaviour (to a limit) while scraping the web page as well as using Tor if required, so that if the vendor site has scraper/crawler detection logic implemented, it can be skipped. This can be achieved by adding random delays, random view time, avoiding honeypot traps through manual analysis
*The crawler set is expected to contain 100 vendors ( each vendor could be pretty different from the others ) and the milestones are defined per vendor and each milestone is max 50€ which is paid after we verify the completeness of each crawler and see no errors. The developer MUST test the completeness of each crawler before delivering to us and present test completion evidence in the form of a populated SQLite database of that vendor.
*The NDA must be signed before the beginning of the project.
*Please only apply when you fully read and understand the project and agree with the conditions.
19 freelancers are bidding on average €3950 for this job
Hi I understand the firmware and I have built a couple projects using a scrappy and python and I can crawl thousand some website with my current system please send me a private chat message
Hi I read your all project requirements carefully. I can start with NDA sign and finish it with correct scraping. After build scraping tool, I can manage and improve it's functions for a long term. Regards.