Tuesday, February 18, 2003

Combine harvester
Looks like a well featured perl crawling/indexing soln, that also might be persuaded to do the task I have in mind.

HTML::Index on CPAN.
This is a set of (perl) modules for creating, storing and searching indexes of html files that looks like a handy starting point for my html indexer. Seems like I might be able to sub-class it to use my own parser and store the code and throw out the content. So I could search for things like which pages on are still using font tags? Which call such-and-such stylesheet or javascript library.
The real trick is going to be getting useful search results for tag combinations.
And don't forget I want to offer a download of the results in csv or xls format!