I will try to make it customizable and generalized. The input would be an input upload file field in a form. I'm fine with the MS Office 2010 and 2007 standard. I can manipulate DOCX and XLSX files but PPTX has some restrictions. The required Deadline is somehow hard but I'll try to stick to it. I can work with PHP 5.4+ file system and other function totally fine. I'd developed a lot of word processors, so I guess extracting text will be fine.
In my understanding for bonuses, you want only the text. I see they are required not bonuses. Regarding point 3: I'm not sure I can't do anything with OLE - somehow difficult. I can extract non-ambiguous texts. I mean paragraphs, Tables, shapes, smart arts, word arts, headers, footers, hyperlinks, and other objects that stores texts in <w:t> tag. No ActiveX objects, Embedded objects, watermarks (possible only for word not powerpoint), Reviews, footnote.
There will be different library for every extension. I mean I won't use standard OOXML library. Simply, because it doesn't exist for PHP. I've to understand the input\output mechanism of YarakuZen. So I invite you, sir, to contact me to discuss further details.
I plan to start with Word. I've already multiple ways to manipulate word documents programmatically. Then, I would go with Excel. It may be easy, though I'm to extract only cell content (No graphs and other objects for the current plan). In the powerpoint phase, I've only one or two ways. So I'm restricted to some conditions.