...You are invited to try out a demo tool for practical text mining. Please note: just plain text on this site, because mostly it is all about text and data...

    In many application building strategies, I have often seen a text mining/ web extraction process somewhere early in the chain. I have a feeling that as the world progresses towards 2.0, this need will increase. And how cumbersome are those bespoke homegrown efforts ! Here is a tool, a methodology and an approach, that captures any kind of semi-structured web document - using just 1 or 2 configuration files.
     
    This extraction tool takes very little effort to configure and use. Almost any kind of formal officialese document, high extractions, 2 config files.

    Consider the below statements
    • Is DATA your most important raw material ? Are semi-structured docs the "source" of your data ?
    • Is your data available in documents like DEF-14 A proxy statements, SEC 10-K filings, USPTO patent abstracts, LinkedIn -  or any other semi-structured public document available over the web ? Any document with the slightest "structuring".
    • Do you often buy data from data vendors ? Where do they get it from ? And how ?
    • Are your users happy with the quality and quantity of data ?
    • Do you often say 'my data is the best' without really feeling too sure about it ?

           Download the demo  >>> unzip the exe and some samples (click link at left to download)

           More samples >> here. It is plain text mining - but made well organized and easy !!!.

          Think !!! with a text mining solution working for you, your data vendor will have to really smarten up or lose out...
          In recessionary times, this also enables a do-it-yourself option that may save you millions of dollars.

    • !!! Right now, everyone in the world needs text extraction and text mining !!! - NEW !!! - visit www.sentimap.com/DemoSent to try out a Sentence Analyzer, a major enhancement to text extraction and retrieval. Also read up how the combination of structure and language improves text extraction efficiency, see here.
    • Does your org deal with zillions of documents, unique yet similar, that humans must read even if a machine could scan them ? Evaluate this demo to visualize how easy and practicable it is, if done nicely.Then get in touch at kinshuk_in@yahoo do.t com (it is . , not do.t , for spam-prevention reasons)

    >> get in touch at kinshuk_in@yahoo do.t com   (pls reconstruct it)                                  the LinkedIn is here : http://in.linkedin.com/in/kinshuka
    the technical gobbledegook is below                        list of usages                               for business/management folks, high level story is here    
    (monologue/thought/lecture dump actually)                                                        a text mining features/comparison list  ...and a must-read faq

    The DataHarvester tool/utility is an implementation of a powerful document data extraction methodology, applicable to any semi-structured document. Common documents like DEF-14 proxy statements, 10-K filings and a few others have been configured as a demonstration. Try it out, you will be surprised at what text mining done well can achieve.
     
    Note # 1: Many good friends have asked (see faq) - so this is just a web scraper, or at best a content extracter, right , maybe an intelligent one, but so what ? Hundreds of them out there ?

    The short answer is this - first try the demo and think it thru. The tool and its underlying approach does not make any claim to the science of it, it merely claims to be a practical and successful solution to a class of problems. It is a general purpose whatever-you-may-like-to-call-it, with roughly 200 working parts wired in a way that makes it (a) capable of encapsulating specialist business users ideas on content capture (b) adding more working parts  easily (c) tackling many diverse document types (d) fairly intelligent but with lots of scope for architects/designers to do their stuff.
     
    In short, a good methodology. If you have tackled this as a general purpose text mining problem earlier, on semi-structured documents of business value , then you  already have an idea how much coding/design effort it can take to get everything working together - to suit a wide range of document types and make biz users comfortable
     
    Two simple to create config files and you have a wide range of documents under a high extraction scanner, that is its usefulness

    Note # 2: Ok, now too many similar questions on the above lines : So here is something really short about this demo.
    • Firstly, the huge mass of text on this site can be safely ignored, it is just a distraction
    • The methodology expressed by the tool is more about "discovery" than about "extraction".
    • But you cannot discover unless you are sure that you can extract, sort of like the tail wagging the dog.
    • That is it, friends, for now !

    The idea is to extract data straight off any semi-structured web document, and present it as hierarchical XML. Or Excel cells. Or class objects. It is very friendly to business users as well as ETL developers - to "see" the data in its meaningful context. And then do whatever is needed with it.

    Good extraction lies in the business users ability to discern a structure in the source document - everything else the engine can provide simply via a set of configurations.

    It is general-purpose, and the same methodology works for any document i.e. extraction of SEC DEF-14 proxy statements, 10-K annual filings, USPTO patent abstracts, LinkedIn profiles and pharma-drug descriptions have all been done by one single engine working behind - the only difference being a couple of configuration files.

    If your company has data as the central raw material for all its products, you may find this useful. Most data suppliers do nothing better or more, and in most cases they cannot reach the extraction capabilities demonstrated here.

    The additional work needed to get this data into your own system is trivial. This implementation can be put to work right away, with possibly a web-crawling mechanism and a last-mile data capture/validation as the only needed pieces.

    Takes 10-15 days to configure/test a new document type, using existing (simple) structural elements. New elements can be added to the code if needed, since most of it is factory pattern.

    Please email to kinshuk_in@yahoo do.t com (pls reconstruct this id)

    ==disclaimer about this implementation==

    Disclaimer : Please do not even THINK of using this present tool/demo/utility  as-is, this is just the demonstration of an idea. Of course the engine is highly functional (minus the bells and whistles and some possible automating enhancements), so a simple dev project costing 2-3 months should make it production ready.