Here is a features list (created after some study of other text mining products web sites, since I did not know some of the terms
commonly used)                                                                                                                                                                          
back to home page

Document driven

Not really meant for unstructured text, blogs etc. The primary focus is the "document" as a whole, preferably those written in "officialese".

Data categorization

Yes, into any hierarchy of named silos (tables and columns and/or rows), for a later piping into an RDBMS structure. The final data XML presentation from a hierarchical structure does need some visual improvement :-), as you can see.

Minimal configuration (and easy on biz user after a front-end is built)

Well, you can create a front-end with connectors etc. if you wish. All that a business user has to visualize is "how is this part the same, or different". Finally it all boils down to a couple of XMLs. The biz user needs to mostly think up choices and text, if at all. The front-end can be (theoretically) made machine driven, i.e. say, read these 500 sample documents and provide useful config hints etc.

Entity extraction

Interesting to find that some text mining products recognize its complicacies. Yes, it exists and forms a fairly important part of the whole mining process. Difficult to generalize, so some entity types may needs some custom code, besides configs.

Sentiment extraction

If this has to do with extracting or validating based on pattern key-words, assigning weightages, yes, it exists. But for proper sentiment extraction (and not the statistical crude positive-negative modifiers eval as usually done,but a real "predicate proofed" way, you have to wait for my NLP attachment, (which seems to be taking  a bit more time than I thought :-)

Frequency of words

The freq (number) is useful during initial configuration, but not a necessity during actual extraction i.e it has very low weightage.

Headers, segments

This is really trivial, but I saw this as a "feature" on some text mining products web-sites.

Clustering, proximity

A rudimentary k-means used as a last resort in some structures, and only when I am sure outliers will be under control. Not sure about proximity, all sorts of home grown methods used to arrive at a good general purpose text extraction. I FOUND MATH AND STAT METHODS RATHER UNRELIABLE.

In general math tools, statistical techniques

Hardly any have been used, mostly because I don't know most of them, I am not a math wiz. Any technique can be used, plugged in (as a class, say) - typically the process sees it as "yet another mining option to reach a goal".

Data cleaning

Only in embedded structures, otherwise left to domain specific customized code, not for general purpose.

Drilldown to source

Where did this data piece come from ? What does it link to downstream ? Yes, exists nicely, although demo shows a simple form of it. You could in theory drill up several levels.

Signalling degradation of extraction confidence

Yes, at least for some things exists in the demo i.e. unsureness about the actual table it is seeing, multiple possible final data texts etc. Gives up nicely and informs user. More could be done, not exactly great as yet.

Use of formal RDF/OWL

 No. And I dont foresee this being used, because the config captures the essence of them.

Use of XSLT or HTML tags

No (well, a little, to identify tables and other embedded structures). No XSLT at all.

Final mapping to domain terms

Not in demo, but obviously will be needed once made domain specific. Easy to insert them as "skins", say.

Filtering, final data validations etc.

Not interested, kills the whole idea of "lazy" extraction, better add on as a domain skin later. Easily done of course, just by a config change, say.

Text analysis and metrics

The emphasis is on maximization of capture as a set of objects, as domain agnostic as possible. All this can be last mile add-ons.

Extraction of isolated data pieces

Not really meant for this, emphasis is always to see the whole document as a data tree.

Using language structures to derive meaning

Sadly not exists in current code, it is in my head though. I stopped on this particular codebase at this point, am now continuing the language structure stuff in Java , and it seems promising :-)