Automated classification: in response to a question about Teragram Categorizer, which combines automatic classification with rule-based classification, Seth Earley wrote this on the SIGIA-L list, which (it seems) isn’t being archived anymore which is why I repost here:
“IBM spent a lot of time and energy developing Discovery Server which was supposed to do clustering, automatic categorization and taxonomy generation. The terms were machine generated and needed intervention by human indexers. The algorithms were supposed to learn from changes to categories and manual reindexing but this process tended to poison the algorithms. Training sets needed to be very large and have good data. I co authored a book about the technology (with Wendi Pohs of IBM). The technology was largely abandoned but some of the DNA is now part of IBM’s Omni Search.”
(I couldn’t find the book referenced.)
The common wisdom among information architects has always been that automated classifiers can be useful, but only if your data is fairly clean and structured (news articles are, an intranet usually isn’t), or if you put in a lot of work developing rules. Does this still hold, or has the technology evolved?
I have 6 Gmail invites, leave a comment with your email to get one.
Amazon is trying yet another approach to the problem of scaling tabs. If you rollover the “see all 31 product categories” tab, you get a dropdown box listing all of them, plus a bunch of other links. It’s almost like having a sitemap at the bottom of the page, but at the top of every page.
I wrote before about different approaches to this problem, here’s another approach:
Looking for a PHP auth class. I have tried quite a few, but for some reason none of them really fulfills my needs.
– Mysql, and just a few files to include. No dependency on some adobd class or anything like that that gives me access to lots of databases that I’ll never switch to. I don’t like including 100’s of K’s of data in every page for something I really don’t need. So classes that require lots of other classes to be included are out.
– easy to use: easy to log user in, out, check for login. I can point it to an existing user table (I don’t want it to use its own tables), …
– “remember me” checkbox function and “forgot password” function.
– stable and mature.
A decent auth class in PHP is like sortable tables in HTML – it seems like someone should have done a decent version after all these years, but maybe they haven’t.
An animated GIF showing the tsunami waves: (via Nick Denton) Wikipedia has a great page on today’s tsunami.
Address bar knows all: “Go up to the address bar in your browser and put up each letter in the alphabet.”
A is for http://amazon.com/.
B is for http://bloglines.com/myblogs.
C is for http://crule.typepad.com/, a videoblog.
D is for a dev site that’s not publicly accessible.
E is for another non-public dev site.
F is for http://www.flickr.com/.
G is for http://groups.yahoo.com/group/videoblogging/messages.
H is for http://hotmail.com/.
I is for http://india.poorbuthappy.com/.
J is for https://joker.com/index.joker.
K is for http://knowspam.net/, an anti-spam service for email that I really like.
L is for http://localhost/.
M is for another non-public dev site.
N is for http://www.nymosaico.com/
O is for (PDF!) http://www.oclc.org/dewey/news/newsletter/ddcnews200401.pdf
P is for my non-public weblog posting url.
Q is for (Quicktime movie!) http://qtss.usip.org/video/2003/colombia/open.mov
R is for http://ryanedit.blogspot.com/, another videoblogger.
S is for yet another non-public dev site (a wiki).
T is for https://www.typepad.com/
U is for http://uzipp.com/, my host for this site.
V is for http://video.poorbuthappy.com/watch/
W is for http://wordpress.org/
X is for http://x.nnon.tv/, another videoblog.
Y is for http://www.ysearchblog.com/
Z is for http://www.zend.com/php5/contest/contest.php
Which suggest that I spend most of my time reading blogs, using online services or developing my own.