HTML Scraping Tool

Completed Posted Mar 16, 2009 Paid on delivery
Completed Paid on delivery

Looking for a developer who can create an HTML scraping tool to collect keyword research data. The application does not require any sort of GUI. I envision only needing to have a list of keyword phrases in a file simple text file (for example: .txt file). The tool can be a windows desktop application (run from the window command prompt or run using a GUI) or it can run as a script (PHP or other) on a third-party hosted Apache server. I am open to either type of solution, and am most intersted in whichever solution is most reliable.? ? ?

Please include the following information along with your bid:?

1. Estimated time needed to complete project (just looking for a rough estimate, to ensure that my expectations aren’t unrealistic).?

2. Coding language to be used.

3. General explanation of how the application will be structured and will function (nothing too detailed here ??" just trying to get idea of how the app will work).?

4. Any potential issues that could be problematic with the successful development of this [url removed, login to view] you for your time and response to this RFP.

## Deliverables

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.

2) Deliverables must be in ready-to-run condition (depending on the nature? of the deliverables).

3) Functional Requirements:The application will submit a list of keyword phrases (at least 100 keyword phrases at a time) to the Google and Yahoo! search engines and perform the following:?

? ? 3a) Scrape pertinent keyword data (see detailed requirements below)

? ? 3b) Output all data for each keyword phrase into a spreadsheet format (csv or xls format -- output format shown below).?

4) Google Data Requirements:? The application will perform a broad match search in Google and then scrape (and calculate) the following information from the top 10 search results on the Google SERP:

? ? ---The URLs of the top ten listings on the SERP

? ? ---Number of Sponsored Links (top)

? ? ---Number of Sponsored Links (side)

? ? ---Minimum PageRank among the top ten results

? ? ---Maximum PageRank among the top ten results

? ? ---Average PageRank of the top ten results

? ? ---Strict match:[(number of times the exact keyword search phrase appears in top ten result titles) x (number of words in the keyword search phrase) / by total number of words in all titles for top ten results) ]?

? ? --- Loose match: (number of times any part of the keyword phrase appears in result titles) / (total number of words in all titles for top ten results.)?

5) The tool will need to perform two additional search queries in Google for each keyword search phrase and retrieve the total number of results from Google. For example, Google displays “Results 1-10 of about 2,340,000 for ‘keyword phrase’ “). The tool must only scrape the actual number: 2,340,000.

? ? 5a) First Query ??" Phrase competitiono Google query operator & format: allinanchor:"kw search phrase"

? ? ?

? ? 5b) Second Query ??" Keyword Competitiono Google query operator & format: allinanchor:kw search phraseYahoo! Data Requirements

6) Using the SERP information from the Google broad match queries, the tool will need to perform two queries in Yahoo for each of the top ten URLs found in the Google SERP (from the initial broad match query).

? ? ? 6a) Yahoo! Domain Links (Y! links)

? ? ? ?

? ? ? ? ? ? 6a1) Yahoo query operator & format: linkdomain:http://URL

?

? ? ? ? ? ? 6a2)Scrape the ‘inlinks’ number using the following parameters:

? ? ? ? ? ? ? ? --- Show links: ‘From All Pages’?

? ? ? ? ? ? ? ? ---to: 'Entire Site'

? ? ? 6b) Yahoo! PageLinks (Y! page links)

? ? ? ? ? ? 6b1) Yahoo query operator & format: link:http://page URL

? ? ? ? ? ?

? ? ? ? ? ? 6b2) Scrape the ‘inlinks’ number using the following parameters:

? ? ? ? ? ? ? ? ? ---Show links: ‘From All Pages’

? ? ? ? ? ? ? ? ? ---to: ‘Only This URL’?

7) The min, max and average need to be calculated and be included in the output for each set of 10 Yahoo queries. The output data will include the following:?

? ? 7a) Average Yahoo! Page Links

? ? 7b) Minimum Yahoo! Page Links

? ? 7c) Maximum Yahoo! Page Links

? ? 7d) Average Yahoo! Domain Links

? ? 7e) Minimum Yahoo! Domain Links?

? ? 7f) Maximum Yahoo! Domain LinksData Output and Format?

8)The output from the tool needs to be in spreadsheet format. An example of the output is attached.?

All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).

* * *This broadcast message was sent to all bidders on Tuesday Mar 17, 2009 1:55:38 PM:

Thank you to everyone so far who has submitted a bid and/or comment. Below is a little more information regarding my expectations of the tool. FYI - I've made a few changes to the original RFP to be more clear about the expectations, and make the requirements easier to read. Please note that I have NOT changed any of the functional or data requirements of the tool. Please feel free to re-adjust your bid if necessary. I don't expect that I will exceed Google's 1000 queries per day (per IP address) limit. By this standard, and the expectation that three Google queries will need to occur for each keyword; I would think that the max number of keywords I can run thru the tool per day will be approximately 333 (3 queries x 333 keywords = 999 queries). I don't expect that this tool will be used more than once or twice per week. I am hoping that the tool can run for 200-400 keywords during each use. I do understand that applications can get caught in loops when going through the same loop(s) many times, which is why listed a 100 keyword per batch requirement. If a 100 keyword batch is too much (or a larger batch can be accommodated), I am open to reasonable suggestions. There have been a number of comments suggesting that PHP may not be the appropriate scripting language for this type of application. I am somewhat familiar with PHP, which is why I suggested using this scripting language. However I am open to any language (scripting or other) that will produce a reliable tool, and be flexible enough in the code to make future changes if/when the SERPs change how they display results. I am also flexible regarding platform. I am open to the tool being a windows desktop app, or an executable file run from the windows command prompt...OR...I am also open to the tool being run as a script on a third-party Apache server (ie a webhosting company's server). Again, I am most interested in whichever solution is more reliable and will be easier to make changes to the code if/when the SERP displays change. To be more clear about the format of the output, I've attached an actual example to the bid request of what I would like to see from the tool. Lastly, I am hoping that you all could provide me with the estimated time needed to complete the project. I am aware that software development can run into unexpected issues, but I'm just trying to get a feel of how long something like this app would take to develop. Thanks again, for your interest in my project and your insightful questions/comments. I would like to get this project started within the next week or two.

## Platform

The application does not require any sort of GUI. I envision only needing to have a list of keyword phrases in a .txt file, and then running an executable file via the Windows command prompt; or possibly running the application from an Apache server with a barebones web interface. I am fairly open to the development language for the application, however an application developed in PHP would be preferred (if possible). Operating System: Windows or... Linux (the app would be hosted on an third-party Apache server)

Amazon Web Services Engineering Internet Marketing MySQL PHP Software Architecture Software Testing Web Hosting Website Management Website Testing

Project ID: #3731295

About the project

4 proposals Remote project Active Mar 18, 2009

Awarded to:

finayev

See private message.

$170 USD in 14 days
(31 Reviews)
8.6

4 freelancers are bidding on average $114 for this job

ayozerb

See private message.

$144.5 USD in 14 days
(4 Reviews)
4.7
smartboredc

See private message.

$127.5 USD in 14 days
(3 Reviews)
2.2
thepatient

See private message.

$12.75 USD in 14 days
(2 Reviews)
1.1