ESWC 2020 |

ESWC 2020

Matches in ESWC 2020 for { <https://metadata.2020.eswc-conferences.org/rdf/submissions/Paper.155_Review.0> ?p ?o. }

Showing items 1 to 10 of 10 with 100 items per page.

Paper.155_Review.0 type ReviewVersion.
Paper.155_Review.0 issued "2001-01-27T15:08:00.000Z".
Paper.155_Review.0 creator Paper.155_Review.0_Reviewer.
Paper.155_Review.0 hasRating ReviewRating.1.
Paper.155_Review.0 hasReviewerConfidence ReviewerConfidence.4.
Paper.155_Review.0 reviews Paper.155.
Paper.155_Review.0 issuedAt easychair.org.
Paper.155_Review.0 issuedFor Conference.
Paper.155_Review.0 releasedBy Conference.
Paper.155_Review.0 hasContent "== Paper summary == This paper proposes a method to compute aggregate queries over SPARQL endpoints that support web preemption. Web preemption was proposed in previous work by the authors as a way for a server to pause a long running query, requiring the client to make a new request to continue executing the query; this technique allows for servers to control the cost of executing individual requests without having to terminate a query after a timeout, allowing for a more fair scheduling of client requests and also the ability to run more expensive queries using multiple requests. While previous work looked at features such as BGPs with union, where the client can simply take the (bag) union of results over different requests, this work focuses specifically on the case of aggregation, where the final results must be grouped and the aggregation function applied. However, rather than gathering all results and applying aggregation at the end, the authors propose a technique whereby the server returns the results of an intermediary aggregation over batches of partial results (computed within a single request), where these intermediate aggregations are themselves aggregated on the client across requests to compute the final result. For example, an average is computed by computing a sum and a count of the partial results of a request on the server, returning these to the client, who will subsequently divide the total sum and total count of all such pairs over all requests to get the final result; in the context of web preemption, such a technique can reduce the data to be transferred from the server to the client. These aggregation techniques under preemption are described in the paper for SUM, COUNT, MAX, MIN, AVERAGE, SUM DISTINCT, COUNT DISTINCT and AVERAGE DISTINCT, with the latter four aggregate functions being the most challenging (not being associative/commutative). Experiments are provided over BerlinSBM and DBpedia using an existing set of SPARQL aggregate queries (proposed in another work to compute VoID descriptions over endpoints). Comparisons are made between the author's web preemption framework without the proposed optimisations (SaGe), with the proposed optimisations (SaGe-AGG), as well as a TPF client and Virtuoso. The results confirm some expected trends: Virtuoso is faster overall and transfers less data (but does not offer preemption), TPF performs the worst (not being able to do joins server-side), while SaGe-AGG performs better than SaGe, reducing the amount of data to be transferred and processed (particularly in the simpler non-distinct cases). The paper is highly relevant to the ESWC conference, and though it could have been submitted to a number of tracks, the choice of "Distribution and Decentralization" seems a good one. In terms of strengths and weaknesses … == Strengths == S1: I like the idea of web preemption. It is a relatively simple but practical idea that does raise some interesting technical questions. Though the idea was originally proposed in another paper, the optimisation of aggregate queries proposed in the current submission does seem to be a good topic for a paper at ESWC, and a step towards making preemption more applicable in practice. In summary, I really could see this line of work finding applications in SPARQL endpoints in practice. S2: The informal discussion in the paper is quite easy to follow. The approach is well motivated, existing literature is discussed adequately (to the best of my knowledge), etc. S3: The proposed approach, though quite straightforward, makes a lot of sense, and does lead to significant speed-ups. S4: Though the experiments could be more comprehensive (and some more details could be added, discussed later), in general I think the experiments that are presented are sufficient to get a good overall idea of the performance of the system in practice, with respect to an existing set of aggregation queries, two different datasets, and four different strategies. The discussion of results and subsequent conclusions also seem appropriate, address limitations of the current work, and are easy to follow. == Weaknesses == W1: While in S2 I mentioned that the informal discussion in the paper is quite easy to follow, the main weakness I detect in this paper is with respect to the formal definitions, which are (unfortunately) sloppy throughout the paper. This sloppiness makes the formal parts of the paper quite frustrating to read and understand. In general, I think I would not have understood what the authors were trying to define here were it not for the fact that the technique is relatively straightforward and I knew a priori (from frameworks such as Spark) how such aggregations can be decomposed in batches. I will try to be specific so these aspects can be improved. (I hope this is not interpreted as nitpicking but each issue listed resulted in real confusion for me and, taken together, affect the technical quality of the paper): - General: Throughout the paper, the authors add unnecessary quantifications on free variables in the set-builder notation. For example, they write { (x,y) | \forall x some condition holds and \exists y for which some other condition holds }. This is incorrect and really quite confusing. The variables on the left of the "such that" | are already universally quantified! The right hand side should rather define a boolean condition (predicate) on the variables appearing on the left-hand side, e.g., { (x,y) | some condition holds on x and y }. (Also the notation will often become a lot simpler defining the domain of the variable(s) on the left; e.g., rather than { x | x \in X and ... }, one can more simply write { x \in X | ... }.) - Definition 1: Compatibility was defined for a single variable, while it is used here for a set. Only knowing beforehand what I know can I guess that all variables in the set (rather than any) need to be compatible. This should be defined beforehand. - Definition 1: The projection operator is not introduced as part of the query language previously; I also dislike that while P is defined using syntax ("AND", "OPT", "UNION"), the authors here mix syntax and relational operators (like \pi). - Definition 2: While \mu is defined earlier as a partial mapping from variables to terms, here it is unioned with a single term. Being more specific, the authors do not associate the results of the aggregation function with a variable, which is inconsistent with the definition of \mu and incompatible with later definitions. - "For mapping-at-a-time operators, ... is bounded by O(|Q| x log_b(|D|)). What is D? (I guess the RDF graph?) What is b? (Actually b can just be removed since changing base leaves a constant factor.) What does it mean to suspend and resume a query, precisely? What happens to the iterators and the programme state? Is the time considered for resumption that taken to return the first result? What are the number of operators required to evaluate Q? Is this the same as the number of operators in Q? (These were not defined precisely.) Such a specific bound seems to require more details to make any sense; without such details, I don’t see the point of stating such a bound. - Can a "full mapping operator" also be a "mapping-at-a-time operator"? I think the intent is that these are disjoint sets of operators, but the authors define that a full mapping operator must serialise all results, which seems to be the case for other queries, like JOIN. I guess they mean that that operator must be applied after all intermediate results are materialised to compute the final results? Perhaps this just requires a clearer way to phrase the text. - "However, the operator used to evaluate SPARQL aggregation is a full-mapping operator ... hence it cannot be suspended and resumed in constant time." Why does this implication hold? Why is constant time important here? (Even mapping-at-a-time operators are not suspended and resumed in constant time.) How is constant time defined (in data complexity or combined complexity)? - I have the same doubts about the core problem statement. I don't understand what precisely the suspend-and-resume task consists of, or what precisely is meant by "constant time". - Footnote 6: I don't follow. Does this assume no unbound (group) variables? I assume so but then, how can the number of group keys be greater than the number of intermediate solutions? How does having all variables in the group condition affect this property? Perhaps I misunderstood something. - Definition 3: Again, the quantification in the text is strange, mixing existential (if for *some* grouping variables ...) and universal (and for *all* non-empty multisets ...) in a way that I cannot wrap my head around. Is it not necessary to define that \Omega_1 and \Omega_2 have the same set of variables, and that V is a (non-empty) subset of that set? - Definition 3: I found the k \mapsto v notation very confusing as I don't know if k is the key, or k is an aggregation variable. Previous definitions used <k,\Omega> for grouped keys (not \mapsto), and definition 2 dropped variables for the results of aggregation functions; only from the examples later can I guess that k is a variable here. (Also why state k = k' rather than just use k in both instances?) - Definition 3: This does not seem to work, in my understanding, if a key is only in \Omega_1 or \Omega_2, in which case the key is dropped. - "CT(X) = { x | x \in X }" In other words, CT(X) = X? - "is decomposed as ..." This seems incorrect. Won't this compute the distinct values for V in \Omega_1 summed with the distinct values for V in \Omega_2? Take V = { ?a }, \gamma(V,CT(?o),\Omega_1) = { (:A,1) } and \gamma(V,CT(?o),\Omega_2) = { (:A,3) }. The union gives { (:A,1) , (:A,3) }. The count gives 2 rather than { (:A,3) } or { (:A,4) } as might be expected. - Definition 4: "and \omega_i \in [[P]]_G such that [[P]]_G = \bigcup_{i=1}^{i=n} \omega_i" This seems incorrect. If \omega_i in [[P]]_G, then it must be an element of [[P]]_G; if [[P]]_G = \bigcup_{i=1}^{i=n} \omega_i, then omega_i must be a subset of [[P]]_G. The only way this might be consistent is if [[P]]_G contains subsets of itself, which is probably not the intention. - Algorithm 1: what happens in Merge if Y contains a grouping key not seen previously in X? It seems that such a key should be added as an initial value, or combined with a zero value. Without this process, does this algorithm ever add anything to Z? - "on a single set of mappings, which can be done in constant time" Should this not be "a single mapping"? Again the question arises: "constant time" with respect to what? It does not seem constant in the number of variables, and merge (called from the non-interruptible section) reads through all solutions in X, so it's not constant in the intermediate results/data either. - "the preemptable SPARQL aggregation iterator is fully stateless". Again I don't really understand what this means. The iterator I_p needs a state to remember where it is (unless the server forgets this and the client passes this information somehow in the request?). W2: Algorithm 1 looks naive performance-wise: a nested loop is used to check through all solutions in X for one compatible in Y. Rather a data structure can be trivially used to look-up such solutions in X in O(1) or O(log|X|) time as it seems X can grow reasonably large (depending on the quantum defined). I was assuming this was just for simplicity of presentation, but the paper does not actually state that any optimisations are applied beyond the stated algorithm. W3: In general, the methods proposed are quite straightforward; the techniques for computing aggregations in this batch-wise streaming fashion are (as the authors allude to) known for frameworks such as MapReduce (combine/reduce) and Spark (aggregateByKey). Their application in this setting is indeed novel, but overall, this novelty is fairly straightforward. == Other comments == As a side observation, I think the authors should include some idea of which SP queries use which aggregation operators (an analysis of the effect of different aggregation operators on performance -- aside from the case of DISTINCT -- is missing). Also perhaps experiments could be done with aggregate queries taken from logs for Wikidata, DBpedia, etc. (it might also be interesting to know what percentage of queries in these logs are using aggregation). == Verdict == All in all, I would be willing to accept the paper were W3 the only concern as in general, I think such contributions should be accepted if they have the potential for practical impact (which this work has). Primarily I am most worried by W1, and thereafter by W2. With respect to W1, in particular, I think the extensiveness of the issue goes beyond the scope of a camera ready revision in that I think after corrections are made, the paper should be reviewed again to ensure technical correctness. Hence I lean towards a reject. I do believe this work can have practical impact and hope to see an improved version of this paper published soon, revising the aforementioned issues in the formal presentation, as well as using data structures in the server-side merge; it would also strengthen the paper to include experiments for other sources of aggregate queries (perhaps taken from SPARQL logs) and to provide results comparing the same queries with different operators. == Minor comments == - The plural of "quantum" is "quanta". Please revise throughout. - "After [a] quantum of time" - "to finally compute[] groups" - "which [allows for performing] the decomposition and [recombination]" - "that process[es] queries" - "as the TPF server[] only processes" - "An RDF triple $... \times T[]$" (stray parenthesis) - What about SAMPLE or GROUP CONCAT? - "HAVING BY" -> "HAVING". Also support for HAVING is never discussed? - "[]Preemption adds an overhead" - "To [address] these challenges" - "a query $Q$ is bound[ed by]" - "where $|Q[|]$ is the number" - "is usually [much] bigger" - "Reducing data transfer [requires] reduc[ing] ..." - "on [the] server side and recombine[] partial" - "[where] $n$ is the number of quant[a]" - "as a mapping-at-a-time[] operator" - "under the same condi[]tions" - "duration of the quantum [seriously impacts]" - "[A l]arge quantum" - "However, [a] large quantum" - "and [W]ikidata" - Algorithm 1: Would be good to add "Server-Side" to the caption for clarity. - "over all aggregation results in [X] ... to merge them with their equivalent in [Y] using the diff[e]rent" - "and then suspend[s]" - "and the [DISTINCT modifier]" - "SAGE server $S$[]," - "following [] the Web preemption model" - "SPARQL aggregation[] results" - Table 2: better to right align numeric columns. Also I do not like the scientific notation used for a single number (better to write it in full). - "It run[s] with [the] same configuration" - "between the start[] of the query and the production of [the final] results" - "it supports [] projection and joins [on the] server side" - "[significantly improves] data transfer" - "increasing the quantum [significantly improves the] execution times" - "quantum of 30s [] compared with Virtuoso" - "As ex[pe]cted" - "better performance[] in data transfer" - "on [the] server" - "aggregate[] push down" - "does not [support] intra-query" =================================================================== Post-rebuttal: I am quite satisfied by the authors' comments in the rebuttal. W1 was my main concern, and while I think there are potentially many (minor-ish) fixes to make for the camera ready version, I trust the authors to be able to clean up the notation (as they have already begun to do). I appreciate the clarification regarding W2. I agree that W3 is not a weakness per se in that well-executed, straightforward contributions are important and can make for very nice papers. I think though that in such cases a higher standard should apply to the paper and the work, which contradicts W1 in this case. Web pre-emption is an interesting approach and this is a useful contribution in that direction. With the clarifications of the rebuttal, I will improve my score to Weak Accept. I do urge the authors though to carefully correct the paper for the camera-ready version."".