\documentclass[conference,letterpaper]{IEEEtran} \usepackage{graphicx} %\usepackage{multirow} %\usepackage{ragged2e} \usepackage{algpseudocode} \usepackage[cmex10]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{fancyvrb} \usepackage{pstricks,pst-node} \usepackage[pdftitle={On-demand RDF to Relational Query Translation in Samizdat RDF Store}, pdfauthor={Dmitry Borodaenko}, pdfkeywords={Semantic Web, RDF, relational databases, query language, Samizdat}, pdfborder={0 0 0}]{hyperref} %\urlstyle{rm} \emergencystretch=8pt \interdisplaylinepenalty=2500 % \begin{document} % \title{On-demand RDF to Relational Query Translation in Samizdat RDF Store} % \author{\IEEEauthorblockN{Dmitry Borodaenko} \IEEEauthorblockA{Belarusian State University of Informatics and Radioelectronics\\ 6 Brovki st., Minsk, Belarus\\ Email: angdraug@debian.org}} \maketitle % typeset the title of the contribution \begin{abstract} This paper presents an algorithm for on-demand translation of RDF queries that allows to map any relational data structure to RDF model, and to perform queries over a combination of mapped relational data and arbitrary RDF triples with a performance comparable to that of relational systems. Query capabilities implemented by the algorithm include optional and negative graph patterns, nested sub-patterns, and limited RDFS and OWL inference backed by database triggers. \end{abstract} \section{Introduction} \label{introduction} % motivation for the proposed solution A wide range of solutions that map relational data to RDF data model has accumulated to date~\cite{triplify}. There are several factors that make integration of RDF and relational data important for the adoption of the Semantic Web. One reason, shared with RDF stores based on a triples table, is the wide availability of mature relational database implementations which had seen decades of improvements in reliability, scalability, and performance. Second is the fact that most of structured data available online is backed by relational databases. This data is not likely to be replaced by pure RDF stores in the near future, so it has to be mapped in one way or another to become available to RDF agents. Finally, properly normalized and indexed application-specific relational database schema allows a DBMS to optimize complex queries in ways that are not possible for a tree of joins over a single triples table~\cite{sp2b}. % what is unique about the proposed solution In the Samizdat open publishing engine, most of the data fits into the relational model, with the exception of reified RDF statements which are used in collaborative decision making process~\cite{samizdat-collreif} and require a more generic triple store. The need for a generic RDF store with performance on par with a relational database is the primary motivation behind the design of Samizdat RDF storage module, which is different from both triples table based RDF stores and relational to RDF mapping systems. Unlike the former, Samizdat can run optimized SQL queries over application-specific tables, but unlike the latter, it is not limited by the relational database schema and can fall back, within the same query, to a triples table for RDF predicates that are not mapped to the relational model. % structure of the paper The following sections of this paper describe: targeted relational data, database triggers required for RDFS and OWL inference, query translation algorithm, update request execution algorithm, details of algorithm implementation in Samizdat, analysis of its performance, comparison with related work, and outline for future work. \section{Relational Data} \label{relational-data} % formal definition of data targeted for storage Samizdat RDF storage module does not impose additional restrictions on the underlying relational database schema beyond the requirements of the SQL standard. Any legacy database may be adapted for RDF access while retaining backwards compatibility with existing SQL queries. The adaptation process involves adding attributes, foreign keys, tables, and triggers to the database to enable RDF query translation and support optional features of Samizdat RDF store, such as statement reification and inference for {\em rdfs:sub\-Class\-Of}, {\em rdfs:sub\-Property\-Of}, and {\em owl:Transitive\-Property\/} rules. Following database schema changes are required for all cases: \begin{itemize} \item create {\em rdfs:Resource\/} superclass table with autogenerated primary key; \item replace primary keys of mapped subclass tables with foreign keys referencing the {\em rdfs:Resource\/} table (existing foreign keys may need to be updated to reflect this change); \item register {\em rdfs:subClassOf\/} inference database triggers to update the Resource table and maintain foreign keys integrity on all changes in mapped subclass tables. \end{itemize} Following changes may be necessary to support optional RDF mapping features: \begin{itemize} \item register database triggers for other cases of {\em rdfs:sub\-Class\-Of\/} entailment; \item create triples table (required to represent non-relational RDF data and RDF statement reification); \item add subproperty qualifier attributes referencing property URIref entry in the {\em rdfs:Resource\/} table for each attribute mapped to a superproperty; \item create transitive closure tables, register {\em owl:TransitivePro\-perty\/} inference triggers. \end{itemize} \section{Inference and Database Triggers} \label{inference-triggers} Samizdat RDF storage module implements entailment rules for following RDFS predicates and OWL classes: {\em rdfs:sub\-Class\-Of}, {\em rdfs:sub\-Property\-Of}, {\em owl:Transitive\-Property}. Database triggers are used to minimize impact of RDFS and OWL inference on query performance: {\em rdfs:subClassOf\/} inference triggers are invoked on every insert into and delete from a subclass table. When a tuple without a primary key is inserted,\footnote{Insertion into subclass table with explicit primary key is used in two-step resource insertion during execution of RDF update command (described in section~\ref{update-execution}).} a template tuple is inserted into superclass table and the produced primary key is added to the new subclass tuple. Delete operation is cascaded to all subclass and superclass tables. {\em rdfs:subPropertyOf\/} inference is performed during query translation, with help of a stored procedure that returns the attribute value when subproperty qualifier attribute is set, and NULL otherwise. {\em owl:TransitiveProperty\/} inference uses a separate transitive closure table for each relational attribute mapped to a transitive property. Transitive closure tables are maintained by triggers invoked on each insert, update, and delete operation involving such an attribute. The transitive closure update algorithm is presented in \figurename~\ref{transitive-closure}. The input to the algorithm is: \begin{itemize} \item directed labeled graph $G = \langle N, A \rangle$ where $N$ is a set of nodes representing RDF resources and $A$ is a set of arcs $a = \langle s, p, o \rangle$ representing RDF triples; \item transitive property $\tau$; \item subgraph $G_\tau \subseteq G$ such that: \begin{equation} a_\tau = \langle s, p, o \rangle \in G_\tau \iff a_\tau \in G \, \wedge \, p = \tau \, ; \end{equation} \item graph $G_\tau^+$ containing transitive closure of $G_\tau$; \item update operation $\omega \in \{insert, update, delete\}$ and its parameters $a_{old} = \langle s_\omega, \tau, o_{old} \rangle$, $a_{new} = \langle s_\omega, \tau, o_{new} \rangle$ such that: \begin{equation} G_\tau' = (G_\tau \setminus \{ a_{old} \}) \cup \{ a_{new} \} \, . \end{equation} \end{itemize} The algorithm transforms $G_\tau^+$ into a transitive closure of $G_\tau'$. The algorithm assumes that $G_\tau$ is and should remain acyclic. \begin{figure} \begin{algorithmic}[1] \If {$o_{new} = s_\omega$ or $\langle o_{new}, \tau, s_\omega \rangle \in G_\tau^+$} \State stop \Comment refuse to create a cycle in $G_\tau$ \EndIf \State $G_\tau \gets G_\tau'$ \Comment apply $\omega$ \If {$\omega \in \{update, delete\}$} \State $G_\tau^+ \gets G_\tau^+ \setminus \{ \langle s, \tau, o \rangle \mid (s = s_\omega \, \vee \, \langle s, \tau, s_\omega \rangle \in G_\tau^+) \, \wedge \, \langle s_\omega, \tau, o \rangle \in G_\tau^+ \}$ \Comment remove obsolete arcs from $G_\tau^+$ \EndIf \If {$\omega \in \{insert, update\}$} \Comment add new arcs to $G_\tau^+$ \State $G_\tau^+ \gets G_\tau^+ \cup \{ \langle s_\omega, \tau, o \rangle \mid o = o_{new} \, \vee \, \langle o_{new}, \tau, o \rangle \in G_\tau^+ \}$ \State $G_\tau^+ \gets G_\tau^+ \cup \{ \langle s, \tau, o \rangle \mid \langle s, \tau, s_\omega \rangle \in G_\tau^+ \, \wedge \, \langle s_\omega, \tau, o \rangle \in G_\tau^+ \}$ \EndIf \end{algorithmic} \caption{Update transitive closure} \label{transitive-closure} \end{figure} \section{Query Pattern Translation} \label{query-translation} Class structure of the Samizdat RDF storage module is as follows. External API is provided by the {\tt RDF} class. RDF storage configuration as described in section~\ref{relational-data} is encapsulated in {\tt RDFConfig} class. The concrete syntax of Squish~\cite{samizdat-rel-rdf,squish} and SQL is abstracted into {\tt SquishQuery} and its subclasses. The query pattern translation algorithm is implemented by the {\tt SqlMapper} class. % prerequisites The input to the algorithm is as follows: \begin{itemize} \item mappings $M = \langle M_{rel}, M_{attr}, M_{sub}, M_{trans} \rangle$ where $M_{rel}: P \to R$, $M_{attr}: P \to \Phi$, $M_{sub}: P \to S$, $M_{trans} \to T$; $P$ is a set of mapped RDF properties, $R$ is a set of relations, $\Phi$ is a set of relation attributes, $S \subset P$ is a subset of RDF properties that have configured subproperties, $T \subset R$ is a set of transitive closures (as described in sections~\ref{relational-data} and \ref{inference-triggers}); \item graph pattern $\Psi = \langle \Psi_{nodes}, \Psi_{arcs} \rangle = \Pi \cup N \cup \Omega$, where $\Pi$, $N$, and $\Omega$ are main ("must bind"), negative ("must not bind"), and optional ("may bind") graph patterns respectively, such that $\Pi$, $N$, and $\Omega$ share no arcs, and $\Pi$, $\Pi \cup N$ and $\Pi \cup \Omega$ are joint graphs.\footnote{Arcs with the same subject, object, and predicate but different bind mode are treated as distinct.} \item global filter condition $F_g \in F$ and local filter conditions $F_c: \Psi_{arcs} \to F$ where $F$ is a set of all literal conditions expressible in the query language syntax. \end{itemize} For example, consider the following Squish query and its graph pattern $\Psi$ presented in \figurename~\ref{graph-pattern}. \begin{Verbatim}[fontsize=\scriptsize] SELECT ?msg WHERE (rdf::predicate ?stmt dc::relation) (rdf::subject ?stmt ?msg) (rdf::object ?stmt ?tag) (dc::date ?stmt ?date) (s::rating ?stmt ?rating FILTER ?rating >= :threshold) EXCEPT (dct::isPartOf ?msg ?parent) OPTIONAL (dc::language ?msg ?original_lang) (s::isTranslationOf ?msg ?translation) (dc::language ?translation ?translation_lang) LITERAL ?original_lang = :lang OR ?translation_lang = :lang GROUP BY ?msg ORDER BY max(?date) DESC \end{Verbatim} \begin{figure} \centering \psset{unit=3.8mm,labelsep=0.2pt} \begin{pspicture}[showgrid=false](0,0)(23,12) \footnotesize \rput(2.5,5.5){\ovalnode{msg}{\sl ?msg}} \rput(10,8){\ovalnode{stmt}{\sl ?stmt}} \rput(2.5,8){\ovalnode{rel}{\it dc:relation}} \rput(5,10.5){\ovalnode{tag}{\sl ?tag}} \rput(14,10.5){\ovalnode{date}{\sl ?date}} \rput(17,8){\ovalnode{rating}{\sl ?rating}} \rput(14,5.5){\ovalnode{parent}{\sl ?parent}} \rput(8,1){\ovalnode{origlang}{\sl ?original\_lang}} \rput(11.2,3.3){\ovalnode{trans}{\sl ?translation}} \rput(19.2,1){\ovalnode{translang}{\sl ?translation\_lang}} \ncline{<-}{msg}{stmt} \aput{:U}(0.4){\it rdf:subject} \ncline{<-}{rel}{stmt} \aput{:U}{\it rdf:predicate} \ncline{<-}{tag}{stmt} \aput{:U}{\it rdf:object} \ncline{->}{stmt}{date} \aput{:U}{\it dc:date} \ncline{->}{stmt}{rating} \aput{:U}{\it s:rating} \ncline{->}{msg}{parent} \aput{:U}(0.6){\it dct:isPartOf} \ncline{->}{msg}{origlang} \aput{:U}(0.6){\it dc:language} \ncline{<-}{msg}{trans} \aput{:U}(0.65){\it s:isTranslationOf} \ncline{->}{trans}{translang} \aput{:U}(0.6){\it dc:language} \psccurve[curvature=0.75 0.1 0,linestyle=dashed,showpoints=false]% (0.3,5)(0.3,10)(3,11.3)(20,9.5)(20,7)(8.5,7)(2.5,4.5) \rput(18.8,10){$\Pi$} \rput(16.5,5.5){$N$} \rput(12.5,1.5){$\Omega$} \end{pspicture} \caption{Graph pattern $\Psi$ for the example query} \label{graph-pattern} \end{figure} The output of the algorithm is a join expression $F$ and condition $W$ ready for composition into {\tt FROM} and {\tt WHERE} clauses of an SQL {\tt SELECT} statement. In the algorithm description below, $\mathrm{id}(r)$ is used to denote primary key of relation $r \in R$, and $\rho(n)$ is used to denote value of $\mathrm{id}(Resource)$ for non-variable node $n \in \Psi_{nodes}$ where such value is known during query translation.\footnote{E.g. Samizdat uses {\em site-ns/resource-id} notation for internal resource URIrefs.} % the algorithm Key steps of the query pattern translation algorithm correspond to the following private methods of {\tt SqlMapper}: {\tt label\_pattern\_components}: Label every connected component of $\Pi$, $N$, and $\Omega$ with different colors $K$ such that $K_\Pi: \Pi_{nodes} \to \mathbb{K}, K_N: N_{nodes} \to \mathbb{K}, K_\Omega: \Omega_{nodes} \to \mathbb{K}, K(n) = K_\Pi(n) \cup K_N(n) \cup K_\Omega(n)$. The Two-pass Connected Component Labeling algorithm~\cite{shapiro} is used with a special case to exclude nodes present in $\Pi$ from neighbour lists while labeling $N$ and $\Omega$. The special case ensures that parts of $N$ and $\Omega$ which are only connected through a node in $\Pi$ are labeled with different colors. {\tt map\_predicates}: Map each arc $c = \langle s, p, o \rangle \in \Psi_{arcs}$ to the relational data model according to $M$: define mapping $M_{attr}^{pos}: \Psi_{arcs} \times \Psi_{nodes} \to \Phi$ such that $M_{attr}^{pos}(c, s) = \mathrm{id}( M_{rel}(p) ), M_{attr}^{pos}(c, o) = M_{attr}(p)$; replace each unmapped arc with its reification and map the resulting arcs in the same manner;\footnote{$M$ is expected to map reification properties to the triples table.} for each arc labeled with a subproperty predicate, add an arc mapped to the subproperty qualifier attribute. For each node $n \in \Psi_{nodes}$, find adjacent arcs $\Psi_{nodes}^n = \{\langle s, p, o \rangle \mid n \in \{s, o\}\}$ and determine its binding mode $\beta_{node}: \Psi_{nodes} \to \{ \Pi, N, \Omega \}$ such that $\beta_{node}(n) = max(\beta_{arc}(c) \, \forall c \in \Psi_{nodes}^n)$ where $\beta_{arc}(c)$ reflects which of the graph patterns $\{ \Pi, N, \Omega \}$ contains arc $c$, and the order of precedence used by $max$ is $\Pi > N > \Omega$. {\tt define\_relation\_aliases}: Map each node in $\Psi$ to one or more relation aliases $a \in \mathbb{A}$ according to the algorithm described in \figurename~\ref{define-relation-aliases}. The algorithm produces mapping $C_a: \Psi_{arcs} \to \mathbb{A}$ which links every arc in $\Psi$ to an alias, and mappings $A = \langle A_{rel}, A_{node}, A_\beta, A_{filter} \rangle$ where $A_{rel}: \mathbb{A} \to R$, $A_{node}: \mathbb{A} \to \Psi_{nodes}$, $A_\beta: \mathbb{A} \to \{ \Pi, N, \Omega \}$, $A_{filter}: \mathbb{A} \to F)$ which record relation, node, bind mode, and a filter condition for each alias. \begin{figure} \begin{algorithmic}[1] \ForAll {$n \in \Psi_{nodes}$} \ForAll {$c = \langle s, p, o \rangle \in \Psi_{arcs} \mid s = n \, \wedge \, C_a(c) = \emptyset$} \If {$\exists c' = \langle s', p', o' \rangle \mid n \in \{s', o'\} \, \wedge \, C_a(c') \not= \emptyset \, \wedge \, M_{rel}(p') = M_{rel}(p)$} \State $C_a(c) \gets C_a(c')$ \Comment Reuse the alias assigned to an arc adjacent to $n$ and mapped to the same relation \Else \Comment Create new alias \State $a = max(\mathbb{A}) + 1$; $\mathbb{A} \gets \mathbb{A} \cup \{ a\}$; $C_a(c) \gets a$ \State $A_{node}(a) \gets n$, $A_{filter}(a) \gets \emptyset$ \If {$M_{trans}(p) = \emptyset$} \Comment Use base relation \State $A_{rel}(a) \gets M_{rel}(p)$ \State $A_\beta(a) \gets \beta_{node}(n)$ \Else \Comment Use transitive closure \State $A_{rel}(a) \gets M_{trans}(p)$ \State $A_\beta(a) \gets \beta_{arc}(c)$ \State \Comment Use arc's bind mode instead of node's \EndIf \EndIf \EndFor \EndFor \ForAll {$c \in \Psi_{arcs}$} \State $A_{filter}( C_a(c) ) \gets A_{filter}( C_a(c) ) \cup F_c(c)$ \State \Comment Add arc filter to the linked alias filters \EndFor \end{algorithmic} \caption{Define relation aliases} \label{define-relation-aliases} \end{figure} {\tt transform}: Define bindings $B: \Psi_{nodes} \to \mathbb{B}$ where $\mathbb{B} = \{\{ \langle a, f \rangle \mid a \in \mathbb{A}, f \in \Phi \}\}$ of graph pattern nodes to sets of pairs of relation aliases and attributes, such that \begin{equation} \begin{split} \langle a, f \rangle \in B(n) \iff &\exists c \in \Psi_{arcs}^n \\ &C_a(c) = a, M_{attr}^{pos}(c, n) = f \, . \end{split} \end{equation} Transform graph pattern $\Psi$ into relational query graph $Q = \langle \mathbb{A}, J \rangle$ where nodes $\mathbb{A}$ are relation aliases defined earlier and edges $J = \{ \langle b_1, b_2, n \rangle \mid b_1 = \langle a_1, f_1 \rangle \in B(n), b_2 = \langle a_2, f_2 \rangle \in B(n), a_1 \not= a_2 \}$ are join conditions. Ground non-variable nodes according to the algorithm defined in \figurename~\ref{ground-non-variable-nodes}. Record list of grounded nodes $G \subseteq \Psi_{nodes}$ such that \begin{equation} \begin{split} n \in G \iff &n \in F_g \,\vee\, \exists \langle b_1, b_2, n \rangle \in J \\ &\vee\, \exists b \in B(n) \, \exists a \in \mathbb{A} \: b \in A_{filter}(a) \, . \end{split} \end{equation} \begin{figure} \begin{algorithmic}[1] \State $\exists b = \langle a, f \rangle \in B(n)$ \Comment Take any binding of $n$ \If {$n$ is an internal resource and $\rho(n) = i$} \State $A_{filter}(a) \gets A_{filter}(a) \cup (b = i)$ \ElsIf {$n$ is a query parameter or a literal} \State $A_{filter}(a) \gets A_{filter}(a) \cup (b = n)$ \ElsIf {$n$ is a URIref} \Comment Add a join to a URIref tuple in Resource relation \State $\mathbb{A} \gets \mathbb{A} \cup \{ a_r \}$; $A_{node}(a_r) = n$; $A_{rel}(a_r) = Resource$; $A_\beta(a_r) = \beta_{node}(n)$ \State $B(n) \gets B(n) \cup \langle a_r, \mathrm{id}(Resource) \rangle; J \gets J \cup \{ \langle b, \langle a_r, \mathrm{id}(Resource) \rangle, n \rangle \}$ \State $A_{filter}(a_r) = A_{filter}(a_r) \cup ( \langle a_r, literal \rangle = f \wedge \langle a_r, uriref \rangle = t \wedge \langle a_r, label \rangle = n )$ \EndIf \end{algorithmic} \caption{Ground non-variable nodes} \label{ground-non-variable-nodes} \end{figure} Transformation of the example query presented above will result in a relational query graph in \figurename~\ref{join-graph}. \begin{figure} \centering \psset{unit=3.8mm,labelsep=0.2pt} \begin{pspicture}[showgrid=false](0,0)(23,13) \footnotesize \rput(1,6){\circlenode{b}{\vphantom{Ij}b}} \rput(6.7,6){\circlenode{a}{\vphantom{Ij}a}} \rput(12.8,6){\circlenode{c}{\vphantom{Ij}c}} \rput(2,11){\circlenode{d}{\vphantom{Ij}d}} \rput(1,1){\circlenode{g}{\vphantom{Ij}g}} \rput(22,11){\circlenode{f}{\vphantom{Ij}f}} \rput(20,1){\circlenode{e}{\vphantom{Ij}e}} \ncline{-}{b}{a} \aput{:U}(0.4){a.id = b.id} \bput{:U}(0.35){?stmt} \ncline{-}{a}{c} \aput{:U}{a.subject = c.id} \bput{:U}{?msg} \ncline{-}{d}{a} \aput{:U}{a.subject = d.id} \bput{:U}(0.4){?msg} \ncline{-}{g}{a} \aput{:U}(0.43){a.predicate = g.id} \bput{:U}{\it dc:relation} \ncline{-}{c}{f} \aput{:U}{c.part\_of\_subproperty = f.id} \bput{:U}{\it s:isTranslationOf} \ncline{-}{c}{e} \aput{:U}{c.part\_of = e.id} \bput{:U}{?translation} \pspolygon[linestyle=dashed,linearc=0.8](0.1,0.1)(0.1,11.9)(14.5,11.9)(14.5,0.1) \rput(13.8,1){$P_1$} \end{pspicture} \caption{Relational query graph $Q$ for the example query} \label{join-graph} \end{figure} {\tt generate\_tables\_and\_conditions}: Produce ordered connected minimum edge-disjoint tree cover $P$ for relational query graph $Q$ such that $\forall P_i \in P$ \, $\forall j = \langle b_{j1}, b_{j2}, n_j \rangle \in P_i$ \, $\forall k = \langle b_{k1}, b_{k2}, n_k \rangle \in P_i$: \begin{gather} K(n_j) \cap K(n_k) \not= \emptyset \, , \\ \beta_{node}(n_j) = \beta_{node}(n_k) = \beta_{tree}(P_i) \, , \end{gather} starting with $P_1$ such that $\beta_{tree}(P_1) = \Pi$ (it follows from definitions of $\Psi$ and {\tt transform} that $P_1$ is the only such tree and covers all join conditions $\langle b_1, b_2, n \rangle \in J$ such that $\beta_{node}(n) = \Pi$). Encode $P_1$ as the root inner join. Encode other trees with at least one edge as subqueries. Left join subqueries and aliases representing roots of zero-length trees into join expression $F$. For each $P_i$ such that $\beta_{tree}(P_i) = N$, find a binding $b = \langle a, f \rangle \in P_i$ such that $a \in P_1 \cap P_i$ and add ($b$ {\tt IS NULL}) condition to $W$. For each non-grounded node $n \not\in G$ such that $\langle a, f \rangle \in B(n) \, \wedge \, a \in P_1$, add ($b$ {\tt IS NOT NULL}) condition to $W$ if $\beta_{node}(n) = \Pi$, or ($b$ {\tt IS NULL}) condition if $\beta_{node}(n) = N$. Add $F_g$ to $W$. Translation of the example query presented earlier will result in the following SQL: \begin{Verbatim}[fontsize=\scriptsize] SELECT DISTINCT a.subject, max(b.published_date) FROM Statement AS a INNER JOIN Resource AS b ON (a.id = b.id) INNER JOIN Resource AS c ON (a.subject = c.id) INNER JOIN Message AS d ON (a.subject = d.id) INNER JOIN Resource AS g ON (a.predicate = g.id) AND (g.literal = 'false' AND g.uriref = 'true' AND g.label = 'http://purl.org/dc/elements/1.1/relation') LEFT JOIN ( SELECT e.language AS _field_b, c.id AS _field_a FROM Message AS e INNER JOIN Resource AS f ON (f.literal = 'false' AND f.uriref = 'true' AND f.label = 'http://www.nongnu.org/samizdat/rdf/schema#isTranslationOf') INNER JOIN Resource AS c ON (c.part_of_subproperty = f.id) AND (c.part_of = e.id) ) AS _subquery_a ON (c.id = _subquery_a._field_a) WHERE (b.published_date IS NOT NULL) AND (a.object IS NOT NULL) AND (a.rating IS NOT NULL) AND (c.part_of IS NULL) AND (a.rating >= ?) AND (d.language = ? OR _subquery_a._field_b = ?) GROUP BY a.subject ORDER BY max(b.published_date) DESC \end{Verbatim} \section{Update Command Execution} \label{update-execution} Update command uses the same graph pattern structure as a query, and additionally defines a set $\Delta \subset \Psi_{nodes}$ of variables representing new RDF resources and a mapping $U: \Psi_{nodes} \to \mathbb{L}$ of variables to literal values. Execution of an update command starts with query pattern translation using the algorithm described in section~\ref{query-translation}. The variables $\Psi$, $A$, $Q$, etc. produced by pattern translation are used in the subsequent stages as described below: \begin{enumerate} % node values \item Construct node values mapping $V: \Psi_{nodes} \to \mathbb{L}$ using the algorithm defined in \figurename~\ref{node-values}. Record resources inserted into the database during this stage in $\Delta_{new} \subset \Psi_{nodes}$ (it follows from the algorithm definition that $\Delta \subseteq \Delta_{new}$). \begin{figure} \begin{algorithmic}[1] \ForAll {$n \in \Psi_{nodes}$} \If {$n$ is an internal resource and $\rho(n) = i$} \State $V(n) \gets i$ \ElsIf {$n$ is a query parameter or a literal} \State $V(n) \gets n$ \ElsIf {$n$ is a variable} \If {$\nexists c = \langle n, p, o \rangle \in \Psi_{arcs}$} \State \Comment If found only in object position \State $V(n) \gets U(n)$ \Else \If {$n \not\in \Delta$} \State $V(n) \gets \mathrm{SquishSelect}(n, \Psi^{n*})$ \EndIf \If {$V(n) = \emptyset$} \State Insert $n$ into $Resource$ relation \State $V(n) \gets \rho(n)$ \State $\Delta_{new} \gets \Delta_{new} \cup n$ \EndIf \EndIf \ElsIf {$n$ is a URIref} \State Select $n$ from $Resource$ relation, insert if missing \State $V(n) \gets \rho(n)$ \EndIf \EndFor \end{algorithmic} \caption{Determine node values. $\Psi^{n*}$ is a subgraph of $\Psi$ reachable from $n$. $\mathrm{SquishSelect}(n, \Psi)$ finds a mapping of variable $n$ that satisfies pattern $\Psi$.} \label{node-values} \end{figure} % data assignment \item For each alias $a \in \mathbb{A}$, find a subset of graph pattern $\Psi_{arcs}^a \subseteq \Psi_{arcs}$ such that $c \in \Psi_{arcs}^a \iff C_a(c) = a$, select a key node $k$ such that $\exists c = \langle k, p, o \rangle \in \Psi_{arcs}^a$, and collect a map $D_a: \Phi \to \mathbb{L}$ of fields to values such that $\forall c = \langle s, p, o \rangle \in \Psi_{arcs}^a \; \exists D_a(o) = V(o)$. If $k \in \Delta_{new}$ and $A_{rel}(a) \not= Resource$, transform $D_a$ into an SQL {\tt INSERT} into $A_{rel}(a)$ with explicit primary key assignment $\mathrm{id}_k(A_{rel}(a)) \gets V(k)$. Otherwise, transform $D_a$ into an {\tt UPDATE} statement on the tuple in $A_{rel}(a)$ for which $\mathrm{id}_k(A_{rel}(a)) = V(k)$. % iterative assertions \item Execute the SQL statements produced in the previous stage inside the same transaction in the order that resolves their mutual references. \end{enumerate} \section{Implementation} The algorithms described in previous sections are implemented by the Samizdat RDF storage module, which is used as the primary means of data access in the Samizdat open publishing system. The module is written in Ruby programming language, supported by several triggers written in procedural SQL. The module and the whole Samizdat engine are available under GNU General Public License. Samizdat exposes all RDF resources underpinning the structure and content of the site. HTTP request with a URL of any internal resource yields a page with detailed information about the resource and its relation with other resources. Furthermore, Samizdat provides a graphical interface that allows to compose arbitrary Squish queries.\footnote{Complexity of user queries is limited to a configurable maximum number of triples in the graph pattern to prevent abuse.} Queries may be published so that other users may modify and reuse them, results of a query may be accessed either as plain HTML or as an RSS feed. \section{Evaluation of Results} \label{evaluation} %\enlargethispage{-1ex} Samizdat performance was measured using Berlin SPARQL Benchmark (BSBM)~\cite{bsbm}, with following variations: a functional equivalent of BSBM test driver was implemented in Ruby and Squish (instead of Java and SPARQL); the test platform included Intel Core 2 Duo (instead of Quad) clocked at the same frequency, and 2GB of memory (instead of 8GB). In this environment, Samizdat was able to process 25287 complete query mixes per second (QMpH) on a dataset with 1M triples, and achieved 18735 QMpH with 25M triples, in both cases exceeding figures for all RDF stores reported in~\cite{bsbm}. In production, Samizdat was able to serve without congestion peak loads of up to 5K hits per hour for a site with a dataset sized at 100K triples in a shared VPS environment. Regeneration of the site frontpage on the same dataset executes 997 Squish queries and completes in 7.7s, which is comparable to RDBMS-backed content management systems. \section{Comparison with Related Work} \label{related-work} As mentioned in section~\ref{introduction}, there exists a wide range of solutions for relational to RDF mapping. Besides Samizdat, the approach based on automatic on-demand translation of RDF queries into SQL is also implemented by Federate~\cite{federate}, D2RQ~\cite{d2rq}, and Virtuoso~\cite{virtuoso}. While being one of the first solutions to provide on-demand relational to RDF mapping, Samizdat remains one of the most advanced in terms of query capabilities. Its single largest drawback is lack of compatibility with SPARQL; in the same time, in some regards it exceeds capabilities of other solutions. The alternative that is closest to Samizdat in terms of query capabilities is Virtuoso RDF Views: it is the only other relational-to-RDF mapping solution that provides partial RDFS and OWL inference, aggregation, and an update language. Still, there are substantial differences between these two projects. First of all, Samizdat RDF store is a small module (1000 lines of Ruby and 200 lines of SQL) that can be used with a variety of RDBMSes, while Virtuoso RDF Views is tied to its own RDBMS. Virtuoso doesn't support implicit statement reification, although its design is compatible with this feature. Finally, Virtuso relies on SQL unions for queries with unspecified predicates and RDFS and OWL inference. While allowing for greater flexibility than the database triggers described in section~\ref{inference-triggers}, iterative union operation has a considerable impact on query performance. \section{Future Work} \label{future-work} Since the SPARQL Recommendation has been published by W3C~\cite{sparql}, SPARQL support has been at the top of the Samizdat RDF store to-do list. SPARQL syntax is considerably more expressive than Squish and will require some effort to implement in Samizdat, but, since design of the implementation separates syntactic layer from the query translation logic, the same algorithms as described in this paper can be used to translate SPARQL patterns to SQL with minimal changes. Most substantial changes are expected to be required for the explicit grouping of optional graph patterns and the associated filter scope issues~\cite{cyganiak}. Samizdat RDF store should be made more adaptable to a wider variety of problem domains. Query translation algorithm should be augmented to translate an ambiguously mapped query (including queries with unspecified predicates) to a union of alternative interpretations. Mapping of relational schema should be generalized, including support for multi-part keys and more generic stored procedures for reification and inference. Standard RDB2RDF mapping should be implemented when W3C publishes a specification to that end. \section{Conclusions} The on-demand RDF to relational query translation algorithm described in this paper utilizes existing relational databases to their full potential, including indexing, transactions, and procedural SQL, to provide efficient access to RDF data. Implementation of this algorithm in Samizdat RDF storage module has been tried in production environment and demonstrated how Semantic Web technologies can be introduced into an application serving thousands of users without imposing additional requirements on hardware resources. \vspace{1ex} % ---- Bibliography ---- % \begin{thebibliography}{19} %\bibitem {expressive-power-of-sparql} %Anglez, R., Gutierrez, C.: %The Expressive Power of SPARQL. In: A. Sheth et al. (Eds.) ISWC 2008. %LNCS, vol. 5318, pp. 82-97. Springer, Heidelberg (2008)\\ %\url{http://www.dcc.uchile.cl/~cgutierr/papers/expPowSPARQL.pdf} \bibitem {triplify} Auer, S., Dietzold, S. Lehman, J., Hellmann, S., Aumueller, D.: Triplify -- Light-Weight Linked Data Publication from Relational Databases. WWW 2009, Madrid, Spain (2009)\\ \url{http://www.informatik.uni-leipzig.de/~auer/publication/triplify.pdf} %\bibitem {swad-storage} %Beckett, Dave: %Semantic Web Scalability and Storage: Survey of Free Software / Open %Source RDF storage systems. SWAD-Europe Deliverable 10.1 (2001)\\ %\url{http://www.w3.org/2001/sw/Europe/reports/rdf\_scalable\_storage\_report} %\bibitem {swad-rdbms-mapping} %Beckett, D., Grant, J.: %Semantic Web Scalability and Storage: Mapping Semantic Web Data with %RDBMSes, SWAD-Europe Deliverable 10.2 (2001)\\ %\url{http://www.w3.org/2001/sw/Europe/reports/scalable\_rdbms\_mapping\_report} %\bibitem {cwm} %Berners-Lee, T., Kolovski, V., Connolly, D., Hendler, J. Scharf, Y.: %A Reasoner for the Web. Theory and Practice of Logic Programming (TPLP), %special issue on Logic Programming and the Web (2000)\\ %\url{http://www.w3.org/2000/10/swap/doc/paper/} \bibitem {bsbm} Bizer, C., Schultz, A.: The Berlin SPARQL Benchmark. International Journal On Semantic Web and Information Systems (IJSWIS), Volume 5, Issue 2 (2009)\\ \url{http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/} \bibitem {d2rq} Bizer, C., Seaborne, A.: D2RQ - Treating non-RDF databases as virtual RDF graphs. In: ISWC 2004 (posters)\\ \url{http://www.wiwiss.fu-berlin.de/bizer/D2RQ/spec/} %\bibitem {samizdat-euruko} %Borodaenko, Dmitry: %RDF storage for Ruby: the case of Samizdat. EuRuKo 2003, Karlsruhe (June %2003)\\ %\url{http://samizdat.nongnu.org/slides/euruko2003\_samizdat.html} %\bibitem {samizdat-impl-report} %Borodaenko, Dmitry: %Samizdat RDF Implementation Report. RDF Interest ML (September 2003)\\ %\url{http://lists.w3.org/Archives/Public/www-rdf-interest/2003Sep/0043.html} \bibitem {samizdat-rel-rdf} Borodaenko, Dmitry: Accessing Relational Data with RDF Queries and Assertions (April 2004)\\ \url{http://samizdat.nongnu.org/papers/rel-rdf.pdf} \bibitem {samizdat-collreif} Borodaenko, Dmitry: Model for Collaborative Decision Making Based on RDF Reification (April 2004)\\ \url{http://samizdat.nongnu.org/papers/collreif.pdf} \bibitem {cyganiak} Cyganiak, R.: A relational algebra for SPARQL. Technical Report HPL-2005-170, HP Labs (2005)\\ \url{http://www.hpl.hp.com/techreports/2005/HPL-2005-170.html} \bibitem {virtuoso} Erling, O., Mikhailov I.: RDF support in the Virtuoso DBMS. In: Proceedings of the 1st Conference on Social Semantic Web, volume P-113 of GI-Edition -- Lecture Notes in Informatics (LNI), ISSN 1617-5468. Bonner K\"{o}llen Verlag (2007)\\ \url{http://virtuoso.openlinksw.com/dav/wiki/Main/VOSArticleRDF} %\bibitem {rdf-mt} %Hayes, Patrick: %RDF Semantics. W3C Recommendation (February 2004)\\ %\url{http://www.w3.org/TR/rdf-mt/} %\bibitem {rdf-syntax-1999} %Lassila, O., Swick, R.~R.: %Resource Description Framework (RDF) Model and Syntax Specification, W3C %Recommendation (February 1999)\\ %\url{http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/} %\bibitem {rdb2rdf-xg-report} %Malhotra, Ashok: %W3C RDB2RDF Incubator Group Report. W3C Incubator Group Report (January %2009)\\ %\url{http://www.w3.org/2005/Incubator/rdb2rdf/XGR-rdb2rdf/} %\bibitem {melnik} %Melnik, S.: %Storing RDF in a relational database. Stanford University (2001)\\ %\url{http://infolab.stanford.edu/~melnik/rdf/db.html} \bibitem {squish} Miller, Libby, Seaborne, Andy, Reggiori, Alberto: Three Implementations of SquishQL, a Simple RDF Query Language. In: Horrocks, I., Hendler, J. (Eds) ISWC 2002. LNCS vol. 2342, pp. 423-435. Springer, Heidelberg (2002)\\ \url{http://ilrt.org/discovery/2001/02/squish/} %\bibitem {nuutila} %Nuutila, Esko: %Efficient Transitive Closure Computation in Large Digraphs. Acta %Polytechnica Scandinavica, Mathematics and Computing in Engineering %Series No. 74, Helsinki (1995)\\ %\url{http://www.cs.hut.fi/~enu/thesis.html} %\bibitem {owl-semantics} %Patel-Schneider, Peter F., Hayes, Patrick, Horrocks, Ian: %OWL Web Ontology Language Semantics and Abstract Syntax. W3C %Recommendation (February 2004)\\ %\url{http://www.w3.org/TR/owl-semantics/} \bibitem {federate} Prud'hommeaux, Eric: RDF Access to Relational Databases (2003)\\ \url{http://www.w3.org/2003/01/21-RDF-RDB-access/} \bibitem {sparql} Prud'hommeaux, Eric, Seaborne, Andy: SPARQL Query Language for RDF. W3C Recommendation (January 2008)\\ \url{http://www.w3.org/TR/rdf-sparql-query/} \bibitem {shapiro} Shapiro, L., Stockman, G: Computer Vision, pp. 69-73. Prentice-Hall (2002)\\ \url{http://www.cse.msu.edu/~stockman/Book/2002/Chapters/ch3.pdf} \bibitem {sp2b} Schmidt, M., Hornung, T., K\"{u}chlin, N., Lausen, G., Pinkel, C.: An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario. In: A. Sheth et al. (Eds.) ISWC 2008. LNCS vol. 5318, pp. 82-97. Springer, Heidelberg (2008)\\ \url{http://www.informatik.uni-freiburg.de/~mschmidt/docs/sp2b\_exp.pdf} %\bibitem {treehugger} %Steer, D.: %TreeHugger -- XSLT for RDF (2003)\\ %\url{http://rdfweb.org/people/damian/treehugger/} \end{thebibliography} \end{document}