back to home
2 XML files per document. For simple docs they take 2-3 days to spell out, for complicated ones (like SEC filings) they take 10-15 days plus testing etc. on a large enough set of files.
It solves a business problem (getting data from the web) in a general-purpose, configurable and extensible way, (any label can be attached to the problem). Firstly maybe the demo has to be tried out by those who know the problem at its practical business level. The demo currently has about 200-300 working functionalities needed for good text extraction and intermediate presentation, organized in a certain way. There is every scope for adding more functionalities, tackling different document structures, encapsulating the business users ideas of content capture in a friendly way, refining the extraction width and depth, tackling document variants - and for architects/designers to do things that the demo at this point does not even foresee.
Given the above, and esp. for those who have spent development time on the problem, it saves time, money and chances of mis-directed effort. It works, so it lowers risk.
HTML is strictly speaking not necessary, except for de-lineating some structures like tables etc. Inputting plain text could be fine too if this is taken care of. For PDF, Word etc.I think it is better to use PDF->HTML converters, Word->HTML converters. For, although the HTML is completely stripped off somewhere in the whole process, somehow inputting HTML seems to convey "structure". I think better leave it to final implementors, maybe they can provide alternative options.
I havnt decided anything about the "selling" of it. I would like it to be taken to its next usable step by a company that feels technically interested, involved, and is capable of converting its various nuances (not observable in the demo) to the next practical step. For example, there already exists (but not shown in demo) a "learning" type of front-end. There is also a "failure and recovery option" when a given config does not quite deliver, and some internal metrics of success.
Far more can be done in a good development set-up. For example, the true difference between simple extraction and genuine "mining" is the ability to create arbitrary relations between pieces of data. Can you see this ability in the demo (most likely you can, since the final output is hierarchical XML howsoever badly formed). Nothing further has been attempted, since relations are a business domain affair.
But yes, it will feel nice to be rewarded before I discuss the nuts and bolts. My own association over its life cycle may also be needed for some time. I have seen many similar things over the web, and I think every company has a text mining or web scraping tool, but they dont seem to be making it general purpose. I would like someone to make use of it.
Certainly I dont want to be configuring piecemeal EXEs for the rest of my
life. I am already onto a different kind of NLP problem. But I do hope
a good patron takes over, else if no movement I'll probably just make it
open-source :-)
Not that there is much in the source. I have to repeat myself, it is all about organizing already available concepts in a manner that makes it work - so as to be practicable enough to suit the business. It is expressible in code, or maybe in a longish Word document, hardly makes a diff. The organization of the pieces is abstract enough for a wide range of extensibility. In fact, a good architecting effort, making all its "likelihoods" practicable is still a missing step. Class design comes way later.
No, XSLT is used nowhere at all. In fact the entire HTML is stripped off very early in the process. The XML is generated (just as the Excel or the CSV is) from an object structure that represents the document.
Also, at this point, it does not deal with input content at the level of nouns, verbs and general grammar, although that is a possible future direction.
In the exe demo, probably quite a few.
- since I have used only about a 100 SEC files to test it out, it is very likely that some weird things in some files will break the demo code, cause uncaught exceptions to be thrown etc.
- one of the commonest is, the entire process (seems to, or actually) hangs. 9 times out of 10 I can trace it to a k-means clustering block of code, very rarely it could be in the parsing. However, I have tried out 7 MB SEC files w/o much problems, the time taken is not in the size of the file, it is in the number of recursions.
-the demo code is not very sophisticated, there are some (maybe 5, 6) magic numbers that ought to go to a config file, some less-than-defensive code, and some pieces which i ought to be ashamed of. But then i am not really a developer, so I am sure a smart developer would tackle all of these issues with ease.
In the layout/architecture/design.
- I am not happy with its ability to keep an audit of its parsing success
- Climbing up and down the tree, I see scope of improvement
- Handling alternative options, it could be brought forward, made more configurable etc.
- Abstract class design, scope for improvement
- Many other what-if scenarios, widening of its business functionalities and extensibilities.
If you get errors that look like "mscorlib not intstalled" - you probably need to install Microsoft .NET SDK 2.x above, for which unless you know how, better talk to your PC admin.