Basic Image Crawler Tool

$500-5000 USD

Cancelled

Posted

over 13 years ago

$500-5000 USD

Paid on delivery

**Background**: We aggregate info about events, venues and artists. As part of this info we have images. Before using any images, we must verify that they are taken from an official site. The image we have will be called a “hint??, and our goal is to find the original. We need a command line tool that would crawl a set of URLs, starting from certain depth and download all of the images. As it is downloading the images, it would compare them against the "hint" image we have to see if we have a match. The comparison part is taken care of by a separate command line tool, and is not part of this project. **Input**: The tool should take the following arguments as input: - path of the hint image - table name (event, venue or performer) - record id ??" the id of the event, venue or performer - maximum depth to crawl for the images - expire time ??" how long ago we must have crawled a url in order to crawl it again -- list of URLs on where to search for the image (via STDIN) **Crawling:** The tool will receive as input a list of possible URLs on where to look for the image, as well as a maximum depth to search. The tool should crawl the URLs breadth-first. This means it should first search all of the URLs provided, then search the next depth for all of the urls provided, and then the next, etc. The tool should only follow links within the same domain name as the starting URL ??" i.e., never follow external links to other websites. The logic should be: 1) start with the provided list of URLs (all of these are considered depth 0). If the url is already in the image_crawler_urls table then only follow it if it is older than expire time. 2) Load and parse the first page, add it to the image_crawler_urls (if not there already) table 3) For every image found in the page, download the image, add it to image_crawler_images table (if not there already) and compare it against the hint. If we have a match there is no need to go further. 4) If the current depth is less then the maximum crawl depth then extract all the links from the page and append them to the URL list. 5) Continue to the next URL in the list ## Deliverables **DB Structure: ** The system will need to maintain a database with all the visited URLs and images so that it will not crawl the same sites and images multiple times. The following table needs to be maintained: Table Name: image_crawler_urls url ??" the actual url crawled ??" this will be the primary key start_url ??" the url which we started crawling from, can leave it blank if this is the start url depth ??" the depth we went to get to this url, 0 if it is the start url first_crawled ??" date/time the url was first crawled last_crawled ??" date/time the url was most recently craweled Table Name: image_crawler_images image_id ??" unique ID of the image and primary key. url ??" the full url of the image parent_url ??" the url where the image was found (from image_crawler_urls table) image_path ??" the local directory path of the image file **Directory Structure:** All of the hints will come from “/image_data/hints/XX/XX/image_name?? and the full path will be provided to the crawler tool. When crawling, the images should all be downloaded to “/image_data/download/XX/XX/[login to view URL]?. Do not use the original image name as it may contain illegal characters/spaces and may not be unique. The “XX/XX?? should be the last four digits of the image id, this is to make sure that there aren't too many files in each directory and images are distributed evenly. For example, if the image ID is 418901 and the image was a jpg the image path would look like: “/image_data/download/89/01/[login to view URL]? Make sure to always maintain the extension matching the originally downloaded image, it should be jpg, gif or png. **Image Comparison: ** For image comparison you will be using a simple command line tool, as input it will take the name of the hint and the name of the image to compare it to, such as: compareimage /image_data/hints/XX/XX/[login to view URL] /image_data/download/XX/XX/[login to view URL] If the images do not match the command will return false, if they do it will return true as well as generate a new cropped version of the image to be used and output the file name and specs. **Output: ** This tool should either return false if no matching results were found or return true if a valid image was found.

Basic Image Crawler Tool

$500-5000 USD

$500-5000 USD

About the project

Looking to make some money?

Benefits of bidding on Freelancer

About the client

Client Verification

Other jobs from this client

Similar jobs