The project\'s goal is to develop a focused web crawler that produces categorizable web links to be structured within an open source database
€1500-3000 EUR
Closed
Posted over 10 years ago
€1500-3000 EUR
Paid on delivery
Project Summary:
The project\'s goal is to develop a focused web crawler that produces categorizable web links to be structured within an open source database. The web crawler should be language independent, and allow high user flexibility both in terms of sources and keyword combinations to be crawled on a daily basis.
From an IT perspective, it could mean to program a \"focused web crawler\" that can search specific domains (mostly news and specific industry sources), index the resulting pages\' content und filter these content\'s based on an intelligent algorithm (\"text search\") that takes into account a given selection of keyword combinations. We are open to discuss other ways of realizing the project in case the freelancer is able to convincingly argue a better/ easier/ more cost efficient methodology.
A typical scope of a daily crawl for one language could involve about 500 sources and about 200 keyword combinations. As a result, we would expect the crawler to find about 5-50 new results (links) for each of such daily crawls. The resulting links and meta data (such as frequency of keywords found, date, source, mime-type) should subsequently be stored in a database to be further analyzed.
Required capabilities:
• Experience in Python as the preferred programming language, alternatively Java
• Experience in Lucene/ Solr/ Nutch as the preferred frameworks and technologies to be used. Potentially alternative search technologies.
• Experience with necessary open source databases for the input (keyword combinations, web sources) and the output data (links, meta data)
Contracting:
• The project\'s time frame is estimated to be around 4 weeks, 12 days for developing the application and 8 days for testing/ modifying.
• The proposed fee would range between 1.500-3.000€, depending on the candidate\'s experience. Part of the fee will also depend on the quality and completeness of the results links.
• The IP rights and the entire code on the final product will stay with the customer. During the testing phase, the customer should have full access to the final test version, without any limitations.
If interested, please write an e-mail to: [REMOVED BY FREELANCER.COM ADMIN] with your comments and conditions.
Thank you
Hello. We have a senior software developer who can develop the web crawler you need. Please check the PM to see the CV of our developer and reasons why to choose us. Looking forward to your reply.
I'm very interested in this project. I'm a telecommunications engineer currently doing a PhD in complex systems sciences. I'm very proficient in maths, algorithms, data mining and statistics. Good programmer in Java and Python, also Matlab/Octave and R. Very used to do database projects. Check my profile to see the work I've done.