in the process of building a desktop search engine, we need a file system crawler. it should be designed such that additional filter modules can be added to support indexing of new document types. in addition, it should support recursive crawling. for example, it should be able to index a word document within an email attachment of a zipped outlook PST file if all necessary filter modules have been installed. it should also support multi-language.
interested bidders should describe your experience in this field along with a proposed project plan covering the crawler design (with diagrams preferred), timeline and bid price. we also prefer bidders who provides idea on integrating the crawler with existing open source search engines such as SWISH++, etc. the use of 3rd party filter libraries is encouraged to speed up development.
[url removed, login to view]
the following is the list of formats the crawler should support:
Adobe Acrobat Reader (.pdf)
Adobe PageMaker 4.0, 5.0, 6.0, 6.5 (.pm4, .pm5, pm6, .p65, .pmd)
Compressed HTML (.chm)
Hyper Text Markup Language (.html)
Help files (.hlp)
Microsoft Excel 2, 3, 4, 5, 95, 97, 2000, XP (.xls)
Microsoft Power Point (.ppt)
Microsoft Word (.doc)
Microsoft Word for Macintosh (.mcw)
Microsoft Word Templates (.dot)
Microsoft Write (.wri)
Plain Text (.txt)
PROMPT translator (.std)
Rich Text Format (.rtf)
Word and Deed (.w&d)
Word Perfect (.wpd)
Works for Windows (.wps)
XML Extensible Markup Language (.xml)
Microsoft Exchange 95/97/98/2000/2001/2002/2003
.MSG, .EML messages
.MBX Unix mailboxes
Searching in e-mail messages attachments
PKARC by PKWARE (.arc)
ARJ by ARJ Software (.arj)
Cabinet by Microsoft (.cab)
RAR by [url removed, login to view] (.rar)
ZIP PKZIP by PKWare (.zip)
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).