Introducing a Semantic Web Server

Introducing triki, an open source semantic web server. Imagine taking your own social media graph from Facebook, your reviews from TripAdvisor, your blogs from blogspot and your photos for Flikr and running your own site where you can still share with friends but also control your own content. Welcome to the triki.

Posted: Wed 13 Apr, 2016, 11:07

n my previous blog I proposed a model for how we publish, own and control our own web content. In order to move away from our current unhealthy dependency on the mega content sharing platforms to a more decentralised, secure and individual-owned system, I first wanted to suggest how it might work.

triki is an attempt to deliver a system based on this model described in Reclaiming the Web. Triki is best described as a semantic web server. It is more than this however it also attempts to pull in feeds of metadata from other sources, so a semantic web exchange may be a more accurate description. I use triki to run this site. It is an on-going, working prototype that meets most of my requirements - and maybe yours!

Let me check through each of the requirements from the last blog and show how it attempts to address each one.

Content Metadata - Semantic Web

The ability to easily define metadata about our content is fundamental. Semantic web technologies are at the forefront of his effort. For many years W3C have been building out specifications on this topic. Semantic web technologies were the next big thing not too long ago and in the eyes of many people it never quite happened. This is certainly not true however. Today, at my current workplace (as just one example), a system is being built (not by me incidentally) that massively leverages a triple store in a novel and very powerful way. Semantic web technologies are alive and well and an integral part of our software landscape today.

Before I describe how triki defines metadata using semantic web approaches, let me just give a two-minute overview and jargon buster. Fundamentally all URLs in a semantic web world are resources and those resources have properties (metadata). Those property values can be literals (strings, dates) or references to other resources. Thus a web of related URLs are created, also known as linked data. A resource is defined using a series of triple as follows:

            subject     predicate object
        e.g.	albumUrl    type      albumType
                albumUrl    created   "23 September 2017"
                albumUrl    title     "Summer Holiday Album 2017"
                albumUrl    contains  photoUrl1
                albumUrl    contains  photoUrl2

And so forth. Unlike a traditional table-based relational database there does not need to be a rigid schema that holds the metadata e.g Album, Photo etc. This would be very restrictive. Instead triples are defined in a flexible and extensible way using any propertie available - these properties are called predicates. Any type of resources can be defined, not just blogs, albums, photos etc. If a site is required to serve out (say) Albums, Songs, Reviews and Gigs, this is entirely possible.

Semantic web is a massive topic. However storing our web data in this form offers potential future benefits. For example predicates are defined using common vocabularies (ontologies) which we can all share as much as possible. This opens up many possibilities given we are all sharing the same "language" e.g. deductive reasoning and inference. But we are not even thinking about this right now. But it is good to be aware.

Metadata is captured in a file. Triples support may different file formats but the one that triki uses is the Turtle format.

Quick

So how much effort is it to create metadata for new blog? Here is an example.

resource:introducing-a-semantic-web-server a resource:blog ;
               dc:creator resource:donaldmcintosh ;
               dc:created "2016-04-13T11:07:00"^^xsd:dateTime ;
               dc:title "Introducing a semantic web Server" ;
               triki:include content:introducing-a-semantic-web-server.md ;
               triki:restricted triki:public .

It does not take long to create. The triples defined will be used to link the blog to other resources in your site (and beyond). In the example above a Photo Album is defined and many photos are then associated with it. These photos will all have their own metadata, shown below:

resource:DSC_0954_web.JPG
            a                      foaf:Image ;
            dc:created             "2016-02-06T13:50:00Z"^^xsd:dateTime ;
            dc:title               "DSC_0954.JPG" ;
            triki:content          "/image/DSC_0954_web.JPG" ;
            triki:restricted       triki:public ;
            triki:thumbimg         resource:DSC_0954_thumb.JPG ;
            exif:dateTimeOriginal  "Sat Feb 06 13:50:00 GMT 2016" ;
            exif:height            "4000" ;
            exif:width             "6000" ;
            time:month             resource:2016February .

Creating this metadata for every photo would be very laborious. Triki therefore has a photo importer that will scan a directory of photos, create thumbnails, generate the triples (from EXIF data) and write out a file in Turtle format. Adding a Photo is a trivial task and this approach could be extended to any file format e.g. MP4s.

One very useful feature comes in here. Whilst photos are imported, they are also linked automatically to auto-created resources for the year and month they were created (see above). This means that it is possible to explore a site by time. This is a powerful concept and is something we are familiar with. Browsing a site by the default page structure offered on the home page or by tags or by time. All this is supported.

resource:2016February
            a                 time:Instant ;
            dc:description    "February 2016" ;
            dc:title          "February 2016" ;
            triki:restricted  resource:public ;
            time:month        "February" ;
            time:year         resource:2016 .

Users are free to pull in any vocabularies they wish. Turtle (and all formats) allows a vocabulary to be imported at the start of the file and then referenced later.

Querable

Semantic web provides SPARQL for querying. Not SQL, but it achieves a similar end. SPARQL is a set-based query language that allows the entire set of triples to be reduced to those matching required criteria. Joins are simple, as are optional joins and virtually anything else required. It is a very mature, well thought through and supported query language. Triki uses the venerable Apache Jena extensively to supply both the triple store and SPARQL support.

Triki uses SPARQL to generate, say, the top 8 most recent blogs. This takes care of any lingering thoughts of adding a blog and then (so tedious...) adding it to the index page(s). If this was required for every resource, then obviously it would be impractical. However, when talking about a self-hosted site (not one provided by an expensive content management system or Wiki) it is these basic tasks that quickly become problematic.

SPARQL for this query is as shown below:

select ?obj ?created
            WHERE  { ?obj a resource:blog;
                      triki:restricted triki:public;
                      dc:created ?created }
            ORDER BY desc(?created)
            LIMIT 8

By being backed by a triple store, Triki also supports the other goal of providing a basis for people to query our metadata to pull out interesting content. To date, I have not had a chance to put in the SPARQL endpoint but everything is in place to support it. Though it may be better to support publication of an RSS feed and then move onto dynamic queries as required. The point is however, that the ground work has been done, and it will be a straightforward task now.

Agents

Triki currently supports periodic querying of two types of metadata - Atom and RSS. It has a built-in Quartz2 scheduler that triggers some Apache Camel routes that will pull back the latests feed, parse out the news items and then temporarily add details to the in-memory triplestore. At this point, the triplestore contains "static" details pulled in from our triples file and metadata about external interesting resources. When the scheduler runs again it just removes the existing links and adds in the updates. For this site (as an example) the News stories update every 10 minutes and the Feed stories update a couple of times per day.

Crucially the configuration that drives the agent is all held in your triple store. Thus the store becomes a single source of information that you wish to share and also a reference to external metadata sources that interest you. It was a design goal to retain everything in the triple store.

Authentication (and Authorisation)

Currently available authentication mechanisms are OAuth and Client SSL certificates. OAuth is favourable in many ways but it builds a dependency on the very mega-sharing platforms we are trying to extricate ourselves from. Client SSL certificates are an excellent solution - from a technical perspective - but they are not exactly user friendly. That may change before too long.

For this reason, triki currently uses something tactical - username & password. My hope is that there is a nice solution out there that is not overly technical (i.e. a degree in cryptography is not required to use it) but provides something similar to Client SSL. There is an interesting chat unfolding on lxer.com that may also offer a serious solution with GPG keys. Also telegram.org manage something with mobile phone number so again, this is another potential solution.

This is a work in progress and is a prime example of the gaps mentioned in my first blog in this series. We have to start somewhere though.

While we are on the subject, this only covers validation of a person/agent. Once validated, we need to be able to control what the user can see i.e. authorisation. Triki uses a fairly standard model of providing groups. Users are associated with one or more groups (friends, family, cycleclub etc.) and documents are associated with the same groups. Thus it is simple to restrict access to metadata at a very granular level. Unless specified, the default is private and nothing is visible. And where is all this stored? Yes, in the triple store.

A solution

triki provides a framework to leverage existing software and protocols to publish our own content from a self-hosted site. Crucially, the generation of our own personal metadata weaves together the content with people and controls sharing. Content is quick to add and triki users are not investing in triki - they are investing in the semantic web & other open formats - and therefore could transfer their data easily. All using open formats, software and interfaces.

Before I digress onto a couple of other topics I want to close our my mini blog series about our content on the web. The end of the story is that there are solutions, we just need to chose our direction of travel. To envisage our Web we want and work towards it. Everyone who writes open source software has been doing that for many years. The goal of triki is to fill a very small gap that, to me, existed between the existing software.

That is all triki is. Thanks for reading.

Appendix 1 - Never write HTML

So after much talk about sharing content there is the slight issue of creating the HTML around the content. Hand crafting HTML quickly becomes a rather tedious task. But we need to capture our review or blog or story or whatever in a form that is browser readable. So to solve this triki leverages the same technologies as Wikis and many other content servers use - markdown. Markdown allows for all the basic HTML constructs like paragraphs, headings, bold, tables, embedded images and lists - just by adding a few simple non-HTML embelishments to plain text. It is quick, powerful and simple.

Within triki, a directory location is provided where all your .md files will reside. These are edited in an editor of your choice and triki reads them and converts to HTML before sending back to the browser.

Appendix 2 - Rendering

Lastly, we need a renderer. We need something that can take our requested resource, get the metadata, get the markdown and render it. Triki uses a templating engine and for this task there are literally hundreds of libraries to chose from. Having used a few in the past (JSP, JSF and FreeMarker) I had some very specific requirements that I knew I wanted.

I wanted a template engine that was powerful but still restricted. It is very easy for code and logic to start appearing in what are essentially templates and this causes many problems. Plus I do not want triki users to feel they are coding when writing a template. All the same, I wanted some code-like features. I certainly wanted no XML. And lastly I wanted templates containing HTML, not HTML containing templates.

So I ended up using StringTemplate. It is awesome. Just the basic features are brilliant but it also has a ModelAdaptor feature that allows templates to recurrently decend into a key/value structure. Sound familiar? Our metadata is one enormous key/value structure so StringTemplate fits the bill.