Summit Pre-Conference Program – ASIS&T 2005 Information Architecture Summit. The early registration deadline for the Information Architecture Institute’s Leadership Seminar is January 28th, so sign up now! I’ll be talking with Liv and Jorge about the state of global IA.

Liz Lawley: Many-to-Many: social consequences of social tagging. Excellent post, focussing on the social aspects of tagging, and introducing a kinda nasty meme, “lowest-common-denominator classification”, that I wish she hadn’t introduced. I don’t agree folksonomies necesarily represent a lowest common denominator classification – on the contrary. There are many experts out there that are not librarians that can give good tag. But Liz does identify some of the feedback mechanisms that encourage the lowest-c-d behaviour. Let’s just change those feedback mechanisms.

Vloggercon 2005 is over. It was brilliant, and historical. Vloggercon 2006 will have 500 people instead of the about 80 we got this year. But the amazing thing was: the 80 people were all the right people.

Remember, this was an unconference, pulled together in a few weeks by some people. No money involved. The wifi worked. There was cofee and bages. Nobody seems to have gotten hurt. The discussions were interesting. The vendors didn’t take over. The audience got to speak – and really, it is silly to speak of “audience” because everyone was a presenter. Congratulations to everyone involved, and especially Jay, who really pulled this one together with good vibes and hard work.

Flickr: Photos tagged with vloggercon

I have a problem with backing up files to my external harddrive (on Win XP).

When I drag a large folder (a few gigs) to the harddrive, it starts copying, but after a while (5 minutes), the harddrive “dissapears” – it no longer shows up in my list of drives, and the copy gives an error. I have to then unplug the harddrive and plug it in again to make it show up again in my list of drives.

I have this problem with 2 computers, and with 2 external harddrives, so I don’t think it’s a problem with either the computer or the drive. Small folders are ok, it’s just the larger folders that show this problem.

Is it impossible to back up large folders by just dragging them? Should I use a backup program? If so, which one is good (and cheap or free)? PC Magazine recommends the free SynchBack so I’m trying that out.

Later: damn! I tried but the same problem happened, after copying for 5 minutes, the outside drive just ‘dissapeared” from my computer, and the copy program reports an error. So it’s not just with drag and drop, it’s any copying to those drives (remember I have 2) that makes them “dissapear’ from windows. Any ideas?

Later: I ran Windows update and all, tried again, still the same problem.

Later: This article seems to describe the problem, in short: Windows XP can have problems with harddrives larger than 138Gigs, upgrade to Service Pack 2 (something I’ve been holding off on) to fix it.

Later: I successfully installed Service Pack 2, after it crashed on me once. I tried again, STILL the same problem. Arg.

Later: Alright, I am downloading more updates from the Windows Update site. The words “critical update”, “strongly recommend” and “security” were plastered all over the page so I figured I’d better say yes.

Later: I installed all latest Windows updates, doublechecked again (no updates), restarted, tried again, and fuck: still the same problem.

I am running out of ideas here…

Steve Arnold pointed out to me that “Languages that are synthetic do not fare well in automatic systems unless the source documents are highly technical.”

Which makes sense, once you understand what a synthetic language is. A synthetic language combines bits into really long words. For example, in Mohawk: Washakotya’tawitsherahetkvhta’se = “He ruined her dress” (strictly, “He made the thing that one puts on one’s body ugly for her”). One word is used for something that other languages need a multiple words or a whole sentence for. You can see how that can mess with automated systems.

Languages are not synthetic or isolating, they fall into a spectrum: some languages are just more synthetic than others. Examples of common synthetic languages are German, Russian, Turkish, Finnish, Japanese, Korean, and many more.

I came accross this other interesting paper on multi-lingual search.

Internet Searching and Browsing in a Multilingual World: An Experiment on the Chinese Business Intelligence Portal (CBizPort) (PDF) Journal of the American society for information science and technology —July 2004.

It gives some good practical background on a project to create a business search for Chinese. It also describes an approach to automatically summarizing and categorizing search results. Here is its description of the Chinese search landscape, illustrating some interesting language-locale subtleties.

“Chinese is the primary language used by people in mainland China, Taiwan, and Hong Kong. Language encoding, vocabularies, economies, and societies of the three regions differ signiï¬?cantly. Regional search engines, therefore, have been developed to provide Internet searching.

In mainland China, the major search engines include Sina and Baidu. Baidu currently powers over 80% of Internet search services in China, including ChinaRen, 163.net, etc. The database of Baidu stores over 60 million Web pages collected from mainland China, Hong Kong, Taiwan, and Singapore, and grows at a speed of several hundreds of thousands of Web pages per day. Sina is an Internet portal providing comprehensive services such as Web searching, e-mail, news, business directory, entertainment, weather forecast, etc. From our review of search engines in mainland China, we found that Baidu has better search capabilities than the others, as shown by its content coverage. Sina has a wider scope of functions than Baidu.

In Taiwan, the two major Internet search portals are Open�nd and Yam. Open�nd, established in 1998, is one of the largest portals in
Taiwan. In addition to basic searching, Openï¬?nd suggests terms that are highly associated with users’ queries to help them reï¬?ne their search. It also allows users to ï¬?nd more related items from each search result and highlights the query terms in the results. Established in 1995, Yam provides comprehensive online services. Its four major focuses are content, communication, community, and commerce (4C). Yam’s search engine allows users to search various media: Web sites, Web pages, news, Internet forum messages, and activities (in 18 Taiwan cities or regions). We found that Openï¬?nd has better functionality and content coverage, but Yam was better established in the local market (e.g., it powers the search function of the Taiwan government’s Web sites).

In Hong Kong, due to its bilingual culture, people rely on both English and Chinese when accessing and searching the
Internet. Major search portals include Yahoo Hong Kong and Timway. Of these, Yahoo Hong Kong is one of the most popular. Yahoo Hong Kong’s search engine returns results in different categories, Web sites, Web pages, and news. Headquartered in Hong Kong, Timway provides services such as Web searching, Web directory, e-mail, news, forums, etc. Its database stores over 30,000 Hong Kong Web sites and over 10 million Web pages. “

In other words, it’s not because Chinese is a language, that one Chinese search engine will be enough for the various users that want to search in Chinese. There are different groups of people who search in Chinese, with different local requirements, and this local requirement has given rise to a number of different search engines for Chinese.

A Framework for Multilingual Searching and Meta-information Extraction: what is “term isolation“? “Term isolations means extracting the individual terms from the text. This is necessary for languages such as Chinese and Japanese, that do not contain white space between individual terms. Term isolation is not a trivial task, and requires the software to understand the language’s grammar and have a complete dictionary. Term isolation is clearly a language-specific task (with different software modules for different languages).”

Multi-lingual search

How do you know what language the query a user entered is in, and how do you search languages like Japanese, that don’t use spaces between the words? And how do you identify languages used in an unstructured, multi-lingual document? Multilingual search is a hard problem.

Most of the search players all use the same basic technology provided by Basistech, check out their customer list: Google, Amazon, MSN, Yahoo, Endeca, Peoplesoft, Overture, you name it, they have them.

The technology, called the Rosette Linguistic Platform, “helps your applications unlock the meaning of unstructured text by determining the language and encoding of a given document, converting the text to Unicode so that it can be processed, identifying the basic linguistic features and structure, and locating key concepts like the names of people and places.”

In other words, it deals with Asian and Arabic language search problems, and does entity extraction (extracting names of people, places and companies and such). It identifies individual words for languages such as Japanese that do not use spaces between words, breaks compound words into their individual components, and identifies parts-of-speech such as verb, adjective, etc.

Once it has done its job making sense of the languages it finds, the search technology of the vendor takes over.

They have a pretty cool demo that explains what the technology does – here’s a screenshot:

rosetta

Google hits comment spammers hard?

Simon thinks Google will soon announce that they won’t be calculating PageRank for links with a rel=”nofollow” attribute. And he’s probably right, it makes complete sense for Google to do this. Dave Winer has a “mysterious” announcement coming up, and if you view source you see he has the rel=”nofollow” implemented.

There are two ways rel=”nofollow” could work: either Google simply doesn’t follow these links, or they don’t attach Pagerank to these links. As opposed to Simon, I think it is probably the first. It makes semantically more sense. It’s probably easier to implement for them as well. Also, Pagerank has become less and less important in Google algorythms over the years.

This means that you just add this attribute to all the links that are added by users, like links in comments. You don’t add it to the links in your blog posts. People can still follow ALL links, but search engines (only Google for now) would only follow the links in your blogposts. The stuff *you* link to in your blogposts still gets yummy Pagerank goodness – your blog doesn’t loose its Google power.

And more importantly, Google doesn’t loose its blog power – it can still take advantage of the meaning embedded in the links on blogs, just without much of the pollution. I can even see them implementing a little Pagerank boost for outgoing links on a domain that does have some rel=”nofollow” links implemented, since it means that the links that don’t have that attribute are probably somewhat more meaningful.

When you do this, the incentive for spammers to spam you (increased Pagerank) is pretty much taken away. Comment spammers don’t do it in the hope that some human will follow that link. They’re in it for the Pagerank.

The amount of spam you get won’t diminish immediately – spammers use automated tools and don’t really care about whether it works on a particular blog. But if the majority of blogs implements this, then it will become less and less attractive for comment spammers to spend time comment spamming.

This is where hosted services like Blogger (owned by Google) or Typepad really shine. I expect them to support this from the moment of announcement on, making the majority of blogs protected against spam. It wouldn’t make sense for Google to implement this and not let Six Apart (owners of Typepad) know about it – that would be abusing their search engine power a bit.

The most popular blogging packages would support this as well, and as people slowly upgrade to new versions, within a year or so 80 to 90% (I’m making these numbers up) of blogs will be protected. OK, who makes the condom-like logo that says “my blog is spam protected”?

Not everyone is optimistic though.

An open question: as Google looses some of its dominance in the search world, will other search engines start supporting this? If not, the measure may not be as effective as we hope.

Will this stop comment spamming? Not right now. Will it stop the growth of comment spamming? Hopefully.

Then again, this may all be wrong, since Dave supposedly left a comment saying “Pssst. Good work. You’re getting warmer. ;->”

Follow this story on Technorati.

Smart Mobs: Cameraphone as Conversational Medium: “Daisuke Okabe has just published Emergent Social Practices, Situations and Relations
through Everyday Camera Phone Use, her report on the research she conducted with Mizuko Ito. The continuous sharing of image-streams with social networks seems to be developing as a hybrid of technological artifact and mediated discourse — friends self-surveill and share what they are seeing as they move through the world, through their day.”

From the comments: How and why people use cameraphones (PDF).

We’re seeing some of this happening in the videoblogging group as well.

Nick Finch (?) writes about content and locales and the problems with continents, but I can’t seem to contact him.

I linked to another website with this post, that I found via Technorati, and I now suspect that that other website was a fake blog meant to create Pagerank, populated automatically from various feeds. Evil.

I fucking HATE that Blogger requires you to open a fucking account just to leave a fucking comment. Sorry Google, but that’s being worse than Microsoft.

Emergent i18n effects in folksonomies

My series of posts on international information architecture:

  1. Translating taxonomies and categories
  2. Translating categories, translating terms
  3. Translating the Dewey Decimal Classification system
  4. Designing the relationship between content and locales
  5. Emergent i18n effects in folksonomies (this post)
  6. The Maori versus Dewey, and why limiting access can be culturally appropriate.

Folksonomies are taxonomies created by users who add tags to things. Folksonomies are messy and have a lot of problems, but their great merit is that they’re scalable and they use the users’ terminology by definition, a serious problem with more classic taxonomies that are created by information architects or librarians.

There is a lot of innovation happening around folksonomies. One interesting area is internationalization. Users enter tags in many languages, but generally, the system does not know what language a tag is in. Have a look at a screenshot of this page from Technorati, showing popular tags:

technorati tags

I drew big red lines around them, so you’ll probably notice some tags in other languages than English. “Algemeen – Algemein – Algemeines” means “general” in Dutch and German. “Entertainment Entretenimento” is English-Spanish for entertainment. Notice that a misspelling of “Entretenimiento” is also used. But we are talking about languages today, not spelling. “Music – Musique – Música” are English, French and Spanish.

So what’s going on here? People are tagging things in many languages. Right now, Technorati displays the various languages mixed together on one page. That’s pretty interesting, expecially if you’re interested in languages like me. But it might also be cool to see only tags in your language. Especially if you don’t speak English.

How can we do this? One way is to use a dictionary lookup to figure out what language a certain tag is in. It won’t be perfect, but this approach could be used to display a page of popular tags in mostly German, mostly English, mostly Hindi and so on. This will reduce the amount of tags available to the user, but make them more relevant to them (because they are in their language). Again, seeing a few non-English tags won’t bother you, but this is not for English speakers – the dominant language. For all the people who don’t speak English, seeing tags in their language will be invaluable. If you do dictionary lookup only with popular tags, it shouldn’t be too resource-expensive – a tag only has to be checked once against the dictionary, and assigned a probable language.

Another way is to look at the language of the source (rss feed, …) and assume the tag is in the same language. Tricky – I’m not sure this will work. There might be other algorythms as well – if I do a Google search for “Música”, it knows this isn’t English, because it asks me if I want to “Search for English results only”, so there is some algorythm going on there I assume (unless they also use the dictionary approach).

Later: I realized something else. Displaying tags in mostly some language, as opposed to exclusively in that language, is not necessarily a bug, it might be a feature. Many user populations around the world incorporate words in multiple languages in their vocabulary. The language namespaces I am talking about might not map perfectly to a specific language, but include words in other languages, and slang and such, and in this way be a much better representation of the real language of a certain user population than if we were to just use one language. So it’s not so much about language namespaces, it’s more about user population namespaces. Language is just a starting point and might be an easy way to group user populations.

As an aside, I think the real innovation with folksonomies will come from creating algorythms. It’s all about scalability. The way Google’s superior algorythm in search made them the nr1 search engine, someone will invent superior algorythms in tagging and this may make them the nr1 tagging engine.

Back to languages. The most interesting aspect of the screenshot above isn’t that there are tags in other languages, is that the tags are the same in other languages. The tags in French and Spanish and such have their English translations right on the same page. This suggests that people seem to tag things similarly in different languages. Is there a way to create algorythms that take advantage of this fact? Also take into account that different people can tag the same things (pictures, bookmarks, …), in different languages.

An interesting language effect with tags was pointed out by Tanya, on this Flickr page for the tag chat. “Chat” means cat in French. Here’s a screenshot:

flickr tags chat

So different people have used “chat” and “cat” to tag similar items, and as a result Flickr knows that “chat” and “cat” are related tags. I’m not sure what that means for internationalization of folksonomies, but as an emergent i18n effect I think that’s pretty amazing.

A third and similar i18n effect I found when playing around with this in Flickr is that of language namespaces. If you start following related tags in Flickr in a certain language, you will see many tags in that language. Here’s a screenshot of the tag “leuk“, which means “funny” in Dutch.

leuk on flickr

The related tags are in English, but in the see also tags we see a whole bunch of Dutch words: “ik, konijn, middelharnis, mooi, oudetonge, overflakkee, plankje”. And if you follow those you’ll see more Dutch words in the related and see-also tags, creating a kind of Dutch namespace almost. Again, I’m not sure how to use this exactly, but it’s pretty amazing to me that, in this early stage, there are already interesting i18n effects happening in the tagging space.

Comments welcome!

| | | | |

Technorati has started to aggregate and combine tags from feeds (it uses categories), Flickr and Del.icio.us. Here’s an example page: Technorati: Tag: humor. They support the rel=tag attribute. It’s brilliant.

Innovation around folksonomies is going fast, a lot of it driven I think by David Weinberger, who is writing a book about tagging and taxonomies and such, and is blogging about this stuff all the time, and happens to be on the board of Technorati. I’m excited – this is great stuff. I am also starting to experiment with tags on me-tv, a videoblog aggregator project of mine.