I’m helping out a bit on the IA of ourmedia.org, and I’m trying to come up with a list of common languages. These will be expanded later. It’s tricky – the list I finally came up with is just a list – WARNING: please don’t use this list for your project if you need common languages – it is fairly arbitrary.

It’s pretty hard to come up with a list of languages for a project like this. We want to allow anyone to upload stuff in any language, but we also wanted to have a short-ish list of languages (20 to 40) instead of just using the really long one (a decision that’s not necessarily the right one, but it was taken).

We didn’t have time to take into account which languages are producing more media that’s likely to be uploaded to ourmedia.org, or any factors like that. So this list is mostly based on the amount of speakers these languages have, with some preference for European languages (target audience in terms of computer access right now).

Top 10 Common Languages by amount of native speakers
– The UN uses these main languages on their website: English, Arabic, Chinese, French, Russian, Spanish.

This is the list I came up with, alphabetically. Users can use this to indicate the language of their media, and other things, or choose another language.


Again, don’t just use this list if you need common languages. It’s very arbitrary and it took me less than an hour to put this together. I share it here as a starting point only.

Please leave comments!

I am in the middle of grading papers for the XTech 2005 Conference, and I am thinking about grading, categorizing and comparing. The important thing with grading is that you get a consistent approach – if I only give A’s to a few people, and someone else gives A’s to almost everyone, it’s not consistent. I think the XTech people average it out by manually comparing all the grades and comments.

When you grade or categorize, it is important to have something to compare to. The first thing I did when I started was to have a look at all the papers. In my head, I averaged them out, and then started grading. Similar effects happen with categorizing: it may be important to see how others have categorized this item, or what other items have been categorized using the categories you assigned to this item.

Excuse my rambling, but this reminds me of a story I read in the New York Times a long time ago, about the grading student papers and how the graders got trained. The challenge is to get a consistent grade, and the article made it sound as if they did have a good system to get that. Lots of training was involved, with example papers and grades, and regular refreshing of the training, to make sure graders were still following the standard. The article might have been called ” Grading This Article? First, Take Time to Learn the Rules”, but I’m not sure because the NYT doesn’t let you access old articles without paying up.

Then again, while searching for this article I found another one: “Grading Mistakes Caused More Than 4,000 Would-Be Teachers to Fail a Licensing Exam” (the NYT pisses me off with this closed linking policy by the way).

Librarians tend to say categorizing can only be done after a lot of training. And the stats show that, even with trained indexers, indexing terms differ something like 60% between indexers (Bella, correct me on the numbers here). Of course, indexing isn’t the same as categorizing, but that’s a lot of inconsistency for trained people.

Anyway, my point is, I think you can build in the right kinds of feedback to make some kinds of categorizing pretty efficient. And we haven’t explored these kinds of feedback very much yet – they’re specific to a computing environment, in other words, we didn’t have these possibilities 20 years ago. I have more thoughts about this, will report back later!

Designing the relationship between content and locales

My series of posts on international information architecture:

  1. Translating taxonomies and categories
  2. Translating categories, translating terms
  3. Translating the Dewey Decimal Classification system
  4. Designing the relationship between content and locales (this post)
  5. Emergent i18n effects in folksonomies
  6. The Maori versus Dewey, and why limiting access can be culturally appropriate.

I am trying to gather thoughts on the various structures and patterns that occur when creating multiple versions of a website in multiple locales. I hope to create a model, a way of thinking about these structures that can help when making decisions about things like:

  • What content should be translated?
  • From which website to which website?

We can come up with many models (models schmodels), the question is: is this a useful one? Does it aid your thinking? This is early work, so any feedback (shoot it down!) is greatly appreciated.

Let’s get started.

When you are creating versions of a website in multiple locales, the simplest case is when you just translate everything on the website. One on one translation. You start with a master locale, and just translate it into another locale. This is rare though – very few translation projects are this simple.

One on one translation.

The second, more common case is selective translation: you have a master locale (often English), and other locales are partially translated.

Selective translation.

There are various types of selective translation: you can do a summation, where multiple pages or whole sections of the master website are replaced by just one page in the translated version. Or you can just not translate parts of the website: removal. Most projects do a bit of both.

A third case is when you have a master locale, but also original content in the translated locale: original content.

Original content

The original content in the translated language can be used as a master for translation into yet another language.

For example, your master locale is English-US, the translated locales are English-Canadian and French-Canadian. (French is an official language in Canada, and there are certain legal requirements to provide information in both official languages.) You might have a partial original content translation from English-US to English-Canadian, in other words, you take parts of the English-US content, and create parts of the content for English-Canadian from scratch. Then you might do a one-on-one translation from English-Canadian to French-Canadian.

This example can be described as a grouping locales. Many countries or regions have legal requirements (and human needs) to provide content in various official languages. If you have an intranet in Canada, you must provide content in French even if you only have 1 employee in Montreal who speaks French. In Belgium, you should provide content in French and Dutch, since half the country speaks French and the other half Dutch. If your locale is South-America, you better provide content in both Spanish and Portuguese, and maybe a few other languages as well.

When you are creating your content-locale structure, grouping locales often makes sense, in that a certain locale can become the master of all local languages, like in our Canadian example.The content needed for this group of locales is the same or very close.

Finally, sometimes almost no content is directly shared. In this case, we’re just talking about separate websites. It is a valid option, but I won’t discuss it in much depth here.

So, to recap, we have 4 simple ways to connect locales:

  • One-on-one translation
  • Selective translation (summation and removal)
  • Original content
  • Grouping of locales.

These structures can help us design the relationship between content and locales for our websites.

When you start designing the relationship between content and locales, you’ll often find that the structure you come up with is different for different types of content. Technical support information may be translatable directly in a one-on-one translation. You might not sell exactly the same products or services in all markets, so marketing content might require a partial original content translation. And so on.

So here is an example of how you might design the relationship between content and locales for a public product website. We are distinguishing a few different content types, and for each content type we provide different relationships.

| | | |

Will the new $500 Apple mini work for my mom?

Tim Bray asks if the new $500 Apple Mini is good for his mom. I asked myself exactly the same question today.

My mom reads email, and she looks at a website or two every now and then. And writes something in Word and prints it out. She also likes pictures we send her by email. Will the mini work for her?

Read emails: no problems there.

Write things: OpenOffice has a Mac version, and will do just fine for her. The printer won’t fit, I don’t see a serial port there, so we’ll have to buy a new one. But it seems that the cheap kind of inkjet printer (where they make money of the cartridges) will work on Macs as well, so that’s fine.

Hardware: I have an USB keyboard lying around, so that’s good. I’ll have to buy a mouse. I’m not sure the old monitor will work, it’s a few years old. If I have to buy a new one that would suck. It’s got that standard blue PC monitor connection, but I don’t see that on the Mac. Will that work?

The learning curve.
Tim was right, however easy the OS is, for my mom it’s just another learning curve. Then again, the biggest problem with Windows is those error dialogs popping up, confusing her. Maybe the Mac really *is* easier and will make her life less stressed. The “it just works” approach is perfect for my mom, if it really does just work. What do you think? Should I get her one?

Translating the Dewey Decimal Classification system

My series of posts on international information architecture:

  1. Translating taxonomies and categories
  2. Translating categories, translating terms
  3. Translating the Dewey Decimal Classification system (this post)
  4. Designing the relationship between content and locales
  5. Emergent i18n effects in folksonomies
  6. The Maori versus Dewey, and why limiting access can be culturally appropriate.

The Dewey Decimal Classification system (used in libraries throughout the world to classify books) has been translated into many languages, so I figured I’d ask their Editor in Chief, Joan S. Mitchell, about their experience. I was particularly interested in their approach to categories as “concepts”, and their approach to developing a taxonomy of concepts, regardless of translatability.

Joan S. Mitchell: “Many DDC translations contain more detailed developments in selected areas than found in the English-language standard edition. For example, the province of Rovigo has one number (–4533) in Table 2 in the English-language full edition; there are nine subdivisions listed under this number (representing parts of Rovigo province) in the Italian translation of the full edition. ”

In other words, local translations have different requirements – the English version may be content with providing geographical categories for Italy up to the level of a province, but people in Italy may be interested in having categories for subdivisions of provinces.

Here’s a case study describing this as well: Beall, J. 2003. Approaches to expansions: case studies from the German and Vietnamese translations. Presented at the Classification and Indexing – Workshop, World Library and Information Congress: 69th IFLA General Conference and Council. 1-9 August 2003, Berlin. http://www.ifla.org/IV/ifla69/papers/123e-Beall.pdf

I also asked her about terms or categories that are not easily translatable.

Joan S. Mitchell: “In nearly every translation, we have encountered terms that are not easily translated from one language to another. There are a number of strategies we employ to address these issues.

In some cases, the term is left in English, e.g., “land-grant colleges” appears in English in the German translation because it is part of the name of a category and there isn’t a suitable equivalent in German.

Sometimes, we adjust the definition of category itself to accommodate translation issues. For example, the popular German form of bowling is skittles (ninepins); in the US, the most popular form is tenpin bowling. After an inquiry from the German translators, we added a note to indicate that the category of bowling covers both types.

In the case of examples that illustrate categories, we routinely encourage translators to substitute examples that have meaning in the specific language group for those used in the English-language edition. We also encourage the inclusion of index terms that are meaningful to the language group and compatible with the definition of the category. The index to a translation can include different synthesized numbers and accompanying index terms from those found in the English-language standard edition.”

I was curious about their approach to the taxonomy as a “concept tree”. This assumes that all concepts can be expressed in all languages, although (as in the bowling example above), sometimes categories are adjusted with feedback from translators.

In my previous post, “Translating categories, translating terms“, I discussed the problems with translating categories. My conclusions was that sometimes, categories are just not translatable. People think too different. The DCC has a different philosophy: the assume ALL concepts are translatable.

Joan S. Mitchell: “Translatability is not a criteria for adding new concepts; however, there has to be a certain threshold of published material (“literary warrant”) for the inclusion of a term or the expansion of a category. The literary warrant we use to develop the English-language standard edition is based primarily on the contents of OCLC’s WorldCat database, the world’s largest bibliographic database (over 57 million records). ”

For non library folks: “literary warrant” just means that, if you have a lot of books about a subdivision of a province in Italy, you should probably have that subdivision as a category, so people can find those books. The concept is the same with products you sell, or services, or whatever. Most companies have different products or services in different parts of the world, or have a different amount of information available (maybe not all technical manuals have been translated), therefore, the categories should be of different granularity.

| | | |


The videobloggers have been quietly working in the slipstream of the podcasters, fairly unnoticed and that has been a great advantage. Now, vloggercon, really just a meeting of a bunch of videoblogging geeks, is coming up and you can hear the spotlights (they sqeak) slowly turning towards us. I am confident the videobloggers will survive them, and when the storm is over we’ll continue to work on what we believe is tv 2.0. It’s about the long tail. It’s about conversations. We’re not entirely sure yet what it’s about, exactly, but it’s not about replacing tv. TV is not really relevant to this. Videoblogging is new. It’s also surprisingly different from text blogging, which is why we need quiet time to figure things out, things like language, discussion and voice.

solitude.dk | Defining (Video)Blogging: “Definitions are crucial to research, not because you can get recognition, but because definitions are the prerequisite for talking about a concept. If I can’t define what a blog is, I can’t discuss it”

Not. You can easily discuss love, or god, or loads of terms without explicityly defining them. Right now, defining videoblogging risks to narrow the imagination, and we need the opposite. Not that I’m against any attempts – go ahead :) And if you do it in a good way it might serve to open up the imagination.

Gates taking a seat in your den | Newsmakers | CNET News.com

It turns out that Bill Gates is a fucking moron, at least on some levels: (News.com interview)
“Q: In recent years, there’s been a lot of people clamoring to reform and restrict intellectual-property rights. It started out with just a few people, but now there are a bunch of advocates saying, “We’ve got to look at patents, we’ve got to look at copyrights.” What’s driving this, and do you think intellectual-property laws need to be reformed?”

“Bill Gates: No, I’d say that of the world’s economies, there’s more that believe in intellectual property today than ever. There are fewer communists in the world today than there were. There are some new modern-day sort of communists who want to get rid of the incentive for musicians and moviemakers and software makers under various guises. They don’t think that those incentives should exist.”

You *have* to be kidding.

ffmpeg-php: “ffmpeg-php is an extension for PHP that adds an easy to use, object-oriented API for accessing and retrieving information from movies and audio files. It has methods for returning frames from movie files as images that can be manipulated using PHP’s image functions. This works well for automatically creating thumbnail images from movie files, and it’s fast enough to extract thumbnails on the fly so that thumbnail images don’t need to be stored. “

I’m Mozilla based (Firefox, Thunderbird) on Win XP – any recommendations for a good calendar application that also lets you pop up reminders (of meetings and such)? Mozilla’s Sunbird is in v0.2 and recommended for testing only, so doesn’t seem like a good bet right now.