One of the questions that’s been occupying my mind lately, is if you have about 175,000 sentences made up of about 2.3 million words total (ie. the titles and descriptions for all the pictures in the main tattoo gallery on BME), how do you quickly generate a list of the most popular phrases between 1 and 6 words long? As a programmer, these are the types of problems I greatly enjoy because on one hand it seems like a difficult thing to do because there’s so much data, but on the other hand, it almost always has a fast and elegant solution… I need this function to do automatic keywording, so for example, once a subject becomes popular enough in the tattoo galleries, it automatically gets its own gallery.
It’s actually really very easy to do this type of analysis… In my case I use a quick parsing function (written in assembly so it’s very efficient) to create a giant array of every phrase in every entry (about ten million entries), which is then sorted with a shell sort function, also written in assembly. At that point it’s a simple matter of counting the repeats to get the answer to the question — the whole process takes seconds. Incidentally, in the tattoo gallery right now there are about 2,000 useful keywords, and about 12,000 that aren’t useful — drove myself bonkers sorting them manually, but luckily it only has to be done once. Later today I think I’ll post some themed entries here from that data.
The other question that’s been occupying my mind lately, as I read the ModBlog comment forums, is why y’all gotta by such haters? For that purpose, I offer up this sacrificial lamb.