The project is designed to create a piece of software that will save all content of webpages (including FRAMES) for any given list of URLs. Basically, the software should do the following:
• After execution, it should ask the user to paste a list of URLs from Excel
• For each URL, it should save the full contents (including content of all FRAMES) of the page located at that URL into a separate folder on the hard drive
Now the full details:
• The software has to be Windows-based
• It can be written using any programming language
• For this reason, it might be easier (might not be – we don’t know the best way and this is just an option to consider) to create this software in a form of a Google Chrome extension or a Mozilla Firefox add-on, because both Chrome and Firefox can save all contents of pages as they displays them – with frames, images, etc. (Chrome’s default “Save As” does that, while Firefox uses another add-on – “Mozilla Archive Format” – to save pages “faithfully”). However, we are not sure if Chrome and Firefox have any disk write APIs, so this might not work. For your own testing purposes, it might be a good idea to compare the results with the way Chrome saves pages.
• The software must have the following adjustable parameters:
- Minimum pause between processing next URL (in seconds) – MIN_WAIT
- Maximum pause between processing next URL (in seconds) – MAX_WAIT
- Download folder (folder on the hard drive)
• This is how the software should work:
- User starts the software
- The software asks for a list of URLs
- It should be capable of accepting lists of up to 10,000 URLs
- We need to make input easy. We produce links in Excel, so we should simply select a range of cells with URLs (in one column), copy them and paste them into the software.
- Then we should be able to set two pause parameters – MIN_WAIT and MAX_WAIT - min and max pause between finishing processing one URL and moving on to the next one. For example, MIN_WAIT =2sec, MAX_WAIT =10sec. Then for each URL that the software is about to load, it should wait a random amount of seconds between the MIN_WAIT and MAX_WAIT number of seconds before attempting to open and save it.
- Then we should be able to select the download folder. By default, the software should remember previous choice.
- Then we should hit a “start” button and for each URL the software should do the following:
a) Create a new folder for the contents of this URL within the Download folder. The individual folder’s name should follow this format: “YYYY-MM-DD-HH-MM-SS”, which is basically the time of creation.
b) Save all contents of this URL into this individual folder.
c) Add a line to the program log (see below).
d) Generate a random number of seconds between MIN_WAIT and MAX_WAIT and wait that number of seconds before moving on the next URL
e) Logging. The software should maintain a log file (text file) of all URLs that have been processed. For each URL it should save one line of text using the following format: “YYYY-MM-DD-HH-MM-SS: URL” – the timestamp should be same as the timestamp in the folder name for any given URL
- The software must be able to work “quietly” – either in the tray or (if part of a browser) in the taskbar. Basically, it shouldn’t pop up for each URL or anything like it – the user should be able to use the PC for other tasks while the software is running.
- Finally, the software should have a line with progress text to show that, for example, “120 or 1500 URLs processed”.
Please see more details in the attached file.