Semantic UMW Rotating Header Image

October, 2008:

10,000 Posts!

The mission of semantifying the University of Mary Washington is coming along nicely — the store of information now knows about more than 10,000 posts from over 1,500 different blogs!

There’s also a lot more information that isn’t yet being exposed in the various exhibits and galleries. It’s all a big pile of facts about the blogs and bloggers, just waiting to be put together in a new and interesting way. How big is the pile of facts? First, let me get at what I mean by a ‘fact’.

Here, a fact is what you have whenever you put a relation between two different things. So “This post was created on October 30” is a fact — it relates the post to the date it was created. “This post was created by Patrick” is another fact, relating this post and the person who created it.

So when I say “a big pile of facts”, I mean that the pile contains approximately 140,000 distinct facts of that sort, and adding between 600 and 900 new ones each week!

Top 10 Lists

Always looking for new and interesting ways to see what’s going on in the world of UMW blogging, I’ve made a few quick Top Ten lists:

People with the most audio in their blogs.
People with the most images in their blogs.
People with the most video in their blogs.
People with the most links in their blogs.
People with the most distinct tags/categories in their blogs.

I haven’t done a thing to make them visually pretty — it’s just the data, so if you think data is pretty you’ll like it, if you need the visual appeal, well, that’s coming along soon.

A Timeline of Posts

Just put up a new prototype of an Exhibit/Timeline: Two Week Timeline. This gets a list of all the posts from the past two weeks and puts them into a timeline. I was surprised to find that this one interests me a little more than the others, because the timeline representation might end up revealing some patterns to the blogging activity. Even counting outliers and anomalies like the recent fall break and the pile-up from LabLogs (not sure, but there might be a bug here I need to chase!) — it looks like prominent blogging times show up.

This might start us toward some quantitative study of blogging behavior here at UMW.

In related news, the Exhibit for individual blogs now also contains a complete timeline. Use one of the searches to look up your blog and follow the link to its Exhibit, then click the “Timeline” tab at the top of the page.

Back up! (mostly)

I’ll have to consider the database meltdown as something of a blessing in disguise. First, in my digging around to find out what happened I discovered some previously unknown bugs. With those now fixed, I’ll be scraping in data much more reliably.

Second, in working to reconstruct as much data as I could I built some new code to back-fill a lot more data, so now we’ve got a lot more info than ever before!

Third, because we’ve got so much more data, I’m finding that I need to learn and discover more efficient techniques for searching through the database, which is bringing much improved performance as I rework the older code with the new techniques.

Last, Danny Ayers (wikipedia) at Talis (wikipedia), whose work and writing I’ve always much admired and has guided this project in many ways, popped right in to nudge me to sign up for an account with Talis. I’m on my way! (And notice, that’s another virtue of linking!)

Brief setback

So last night the database completely crashed. 🙁 I think I’ll be able to reconstruct at least the vast majority of information so that this experiment will be back up and running soon. But it looks like I’ll be bumping up the timetable for moving the code to a more reliable server.

I’ll also be exploring the Talis Platform to see if that will be a good solution.

Links: They’re Not Just For Breakfast and Google Anymore!

I mentioned in my previous post (now also here as a page) that working with links in posts is a big interest of mine. I’d like to give a quick update on the Link Friends Exhibit and expand on why links are so important and useful.

I’ve tweaked the pager for the link friends so that URLs with the most posts linking to them show up first. Unfortunately, the home page of UMWBlogs is excluded from the list because the “Hello World” post of new blogs links there. That makes the size of the result set simply too large to deal with, at least right now. Thanks to Eighteenth Century Audio, Librivox is at the top of the list with 70 — er, now 71 — posts that link to it.

In the Exhibits for individual blogs and posts, the list of links will now also direct you to an Exhibit of all the posts, blogs, and bloggers that share that link. Visit this blog’s Exhibit and click on something in the ‘All Links’ column to see it in action.

The majority of the posts that link to the same place are pairs created by auto-aggregation. Many course blogs are aggregating posts from the various members of the course, and so the same content — and links — appears in two different places. That makes a great number of pairs that link to the same place, which turns out to be a bit misleading since it’s really two instances of the same text, just from different contexts.

I’d like someday for things to get more interesting, with the Exhibit revealing completely different posts that happen to link to the same place. The technical mechanism will do that. This comes down to encouraging people to get into the habit of taking the time to include links to relevant sites when they can. What’s a relevant site?Blog home pages and particular posts are good candidates when your post is responding to someone else. Admittedly, this seems like a bit of overkill — trackbacks are meant to handle similar cross-referencing. Alas, because I’m scraping all the data from feeds, the trackbacks don’t show up.

Another good candidate is the sites being discussed in class, or are a reference, or are a useful tool for the class. Jeff McClurken’s post noting delicious (wikipedia link) as a useful tool is a good example of this. Mentioning (wikipedia)in your post? Make it a link. Using Zotero? (wikipedia) Make it a link. Omeka? Make it a link. Etc., etc., etc. . . .

That’s more than good practice for the readers of your blog, making it easy for them to check out a site that they might not yet be familiar with. It’s much, much, more. Many people are familiar with how Google (wikipedia) uses links. Through a mysterious algorithm that only they, and possibly Gandolf, know, the search results are rated by the links to that site. This is the “Google-juice” that useless sites use to get more traffic to their site. They create a bunch of links to their site, hoping that’ll boost their site to the top of the Google results page. It’s also why Wikipedia articles show up so frequently in Google searches — lots of people have linked to a relevant article.

But what I’m talking about is more, and more useful in some ways, than that. I’m talking about exactly the reverse of what Google does. Google uses the link to get information about the target of the link. I’m using the link to get information about the source of the link: your post. That’s a huge difference. That way, your link to stuff relevant to your post becomes data about your post. (That’s part of the idea of a Web of Data along with a Web of Documents that I mention in the About page.)

What difference does it make? Teaching and learning is all about discovering unexpected, maybe even serendipitous, connections. Two completely different people, studying completely different things, might very well be writing about the same site, tool, or topic. Including a link makes it easier to discover the unexpected commonalities across very different contexts.

But wait, as they say — there’s more. One especially useful variety of link is a link to a relevant Wikipedia article. My previous post mentioned that linking to a Wikipedia article serves as a tool for disambiguation, distinguishing between Paris France, Paris Texas, Paris Hilton, and Paris the Trojan hero. There’s more too it. A LOT more.

Through the extradinary service DBpedia (wikipedia article here), I will eventually be able to offer guides to finding similar posts even if they do not link to precisely the same Wikipedia article. DBpedia has been doing basically the same thing with Wikipedia that I’m doing with online content from UMW. Indeed, they are very much the inspiration for this project. They’re scraping data out of Wikipedia pages and making it available on the Semantic Web as Linked Data.

As I always used to encourage my students to ask, “So What?”. Almost all Wikipedia articles are in several different categories. DBpedia easily exposes those categories, which means it will be possible to find post that link to Wikipedia articles in the same categories. DBpedia also plays nicely with (wikipedia link), a huge body of geographical data. That will make it possible to find posts and/or blogs discussing things in the same geographical region. So if one post links to the Wikipedia article for Paris France, another links to the article for France, and other links to the article for the Eiffel Tower, it should be possible to pull all those together into a list of posts about stuff in France.

Did I mention that there’s more? DBpedia plays nicely with many other data sets. There’s YAGO, which offers standardized terms and relationships between them. There’s also a newer initiative from Zitgist called UMBEL. These projects are aiming toward subtle and precise categorization of material, which will make it easier to discover people with shared interests and thoughts.

We’ve moved into the future directions and possibilities for these technologies, and there’s still plenty of work to do to stitch it all together. That’s fine and as it should be. But the important thing we can all do now is to get into the habit of linking heavily. It’s a simple, easy-to-do technique that will pay off bigger and bigger dividends as time and technology progress.

Questions I’d Like To Help Answer / Scenarios

I said in my first post in this blog that I’m working on solutions to problems that most people don’t yet realize are problems. And I’ve hit on some possibilities sporadically as I’ve talked about new developments. It might be time to bring some of that thinking together into a few scenarios for which I think these approaches will be useful. Here’s a few of the things I have in mind.

“Just go to my blog at….”

Having an online presence that corresponds with a real-life presence is interesting. There are plenty of anecdotal stories people making connections in and out of the online environment. We need to foster mutual interaction of both.

I want people to be able to talk about their blogs and say “Just go to…” without having to find a cocktail napkin or matchbook to write it on. We should be able to find things later with only a hazy memory of how to get there, not the full URL. That’s what the “Contains” and “Starts With” searches aim at. You have just a few bits of where you want to get to, and those searches help you get to it from limited information.

So you know a blog is at something-or-other-“marching”. Or the blog has “marching” somewhere in the title. Or the post has “marching” in the title. Or the person you are talking to has a display name with “marching” in the title. Type “marching” into those searches and check out what comes up from there.

“Who’s got good pics?”

Start with the Image Gallery. It’ll show the images that’ve been included in online content from UMW (at least what I’ve been able to incorporate so far.). Each image will guide you to the post and/or the blog that it comes from.

“What’s the history of this blog?”

A lot of wonderful material gets buried in the reverse-chronological structure of blogs. Once something slips into the ‘archives’ (which only means that it’s not recent), it’s really hard to get back to it. But if you know the blog you want, do a “Contains” or “Starts With” search to dig up the blog, then go the Exhibit for that blog and you’ll get a better overview. (Improvements on that mechanism are coming soon).

“What interests do I have in common with others?”

This is a big-big-biggie for me. I have a suspicion that links are more reliable indicators of interests than tags, and so the “Link Friends” Exhibit is working toward helping people find others who have linked to the same places.

In my happy world, people will get into the habit of including lots of links, especially to Wikipedia. That’ll semi-tacitly provide information about you and your post, and make it easier to use these techniques to find possible common interests. For example, say you are writing a response to something in “Frankenstein, or the Modern Prometheus” for your chemistry class (yes, thanks to Leanna Giancarlo, that could happen!). And someone else is writing about the same work for a literature class. If they both include a link to the Wikipedia article on it, that’ll provide a great way to identify common interests and topics.

“Why not use tags?” you ask? Follow me to a more nuanced example. Instead of ‘Frankenstein’, you are writing about ‘Paris‘. ‘Paris‘, France?” No, the other ‘Paris’. ‘Paris‘, Texas?” No no, the other ‘Paris’. ‘Paris‘ Hilton?” No no no, the other ‘Paris’. ‘Paris’ the Trojan prince?” Yes! That’s the one!

The tag ‘Paris’ goes nowhere for disambiguating those possibilities, even though it is likely to be natural enough within the context of that blog. But connecting with others requires bridging contexts. In addition to the tag, a link to the relevant Wikipedia article will help to disambiguate and classify the blog post.

“What has so-and-so person written?”

Browse though the “Bloggers” Exhbit to find the display name you are looking for. That’ll guide you toward lots more info about what they’ve written and where.

These questions seem like they come naturally, but don’t have a handy mechanism for answering them. Providing that mechanism is at the core of what this blog and project is all about.

Got other questions you think would be useful? Let me know!