I would like a program/script written in PHP and/or Java (or ANY other language) that will help someone locate instances of copyright infringement on the Internet of documents that the author has written.
This project would be divided into THREE PHASES. In your formal bid, please bid *ONLY* for Phase ONE and TWO. If you can ALSO complete Phase THREE for me, please privately indicate your price for Phase THREE in the PMB.
In PHASE ONE of this project, you would develop the core program code. The program would do the following:
1. Search through a specified folder (on a web server's hard drive) for any Microsoft Word, RTF, PDF, or plain text documents that may exist in that folder (and, optionally, any sub-folders). The software would then take a user-defined number of random samples of contiguous text (i.e., a user-defined number of consecutive words) for EACH document in that folder/sub-folder.
a. The user should have the ability to specify the amount of "distance" (measured in words) between the samples that are taken, i.e. the user can specify that the samples are to be taken every 250 words, 500 words, 750 words, etc.
b. The user should be able to select the document types that are searched for (Microsoft Word, RTF, PDF, or plain ASCII text documents).
c. If you have a BETTER IDEA as to HOW to take a more accurate, more reliable "footprint" of the documents, then PLEASE POST A PMB message describing your alternative idea/methodology. I am not certain if the method I describe in #1 above is the best way to take a "footprint" of the documents, and I am open to other creative ways to achieve my goal.
2. The software would then take those samples (strings of text) and query Google and/or Docstoc to identify and find any matches (instances of copyright infringement). (Docstoc is a web site that hosts documents.)
3. For any matches are found, the software would then log:
a. the URL of the page on which the match is found
b. the TITLE of the html page on which the match is found
c. the file name of the source document from the user's local hard drive
d. the date and time of the query in which the match was found on the offending web site
e. whether the match was found on Google or Docstoc
f. the exact string of text that was discovered on Google or Docstoc.
g. If the infringment was found on Docstoc, the software should also log the username of the Docstoc user who posted the infringing document. The total number of Views and Downloads of that infringing document should also be logged. Also, the date the document was posted on Docstoc should also be logged.
4. The log generated in #3 above should then be exportable as a comma delimited text file (CSV file). It should also be displayed on screen. The user will select if he wants to view the report on screen or export to CSV.
*** REQUESTED PMB COMMENTS ***
-- I do not really care what programming language you use, although I tend to prefer PHP and Java. If you think another programming language would be better than PHP, please specify in the PMB what language you would use and WHY that would be better than PHP.
-- Please state in the PMB how many DAYS it would take you to complete PHASE ONE of this project.
-- Please also state if you think the methodology for taking a "footprint" of the documents (described in #1 above) is the best way, or please present your idea if you want to suggest a better way.
In the SECOND PHASE of the project, I would like for you to add the capability to the web script so that the program/script can query the user's local hard drive and retrieve the samples (footprints) of the documents on the user's local computer, and then relay those samples/footprints back to the web server for processing and querying on the Internet. The CSV file and screen report would then be accessible from the web server to the user.
OPTIONAL: (Please indicate your price for Phase 3 in the PMB, if you are willing and able to do it.) In the THIRD PHASE of this project, I would like you to develop a Windows compatible stand-alone application that the user could install/execute on his computer that will do the same thing as 1-4 above, except that the initial queries and sampling of the documents would take place ONLY on the user's LOCAL hard drive. No text would be relayed back to the web server. The program would still search for the matches on the Internet in the same way and create a CSV report and screen report.
NOTE: I am very price sensitive for this project, so please bid accordingly. If someone bids the right amount and can commit to finishing the job quickly, I will likely end the bidding process early and select you. Please write the word "Excel" in your bid comments so that I know read these specs carefully and that you understand English. Thank you for your interest, and I look forward to working with you!