ESWC 2020 |

ESWC 2020

Matches in ESWC 2020 for { ?s ?p ----------- Strong Points ----------- -Domain independent scoring function and threshold heuristic -Evaluation using textual, structured, and “dirty” datasets -Implementation and data available online ----------- Weak Points ----------- -Valley and elbow thresholding have rather similar results Summary: The paper addresses an important shortcoming of active learning for entity resolution i.e. cold start problem. The proposed method deals with the cold start problem by introducing unsupervised matching based on a novel domain-independent threshold heuristic to bootstrap active learning. The unsupervised matching uses a datatype-specific similarity metrics to assign a similarity score to all record pairs. The threshold boundary “t” is then set to a value accounting for the elbow point of the cumulative similarity score distribution of all record pairs. The distance between the threshold values and the aggregated similarity score of each pair serves as confidence weights that are then used to provide the active learning with the most suitable pairs at for every iteration, i.e. more noisy pairs are supposed to affect the warm start less than more confident pairs. The method is evaluated and shows promising results on three different types of data, i.e. structured, textual and dirty. The evaluation experiments are well-designed to measure the influence of the proposed thresholding heuristic, the bootstrapping and the warm start of the active learning. Introduction and Related Work: The introduction as the whole paper is well-written and introduces the problem and its specific. Related work is nicely structured along the three main points of the presented methodology, feature engineering, unsupervised matching, and active learning. Proposed Active Learning Methodology: The authors proposed a two-step methodology. An unsupervised matching step consisting of labeling pairs and assigning confidence weights using the elbow point threshold method. In the second step unsupervised labeled and weighted pairs are used in the warm start pool to bootstrap the training of the active learning random forest classifier and a heterogeneous committee of five different classifiers which includes the random forest classifier. The committee is used to select a pair form the noisy pool to be added to the labeled set after manual labeling in every iteration of the active learning. The labeled set is used to incrementally train new trees of the random forest classifier. This procedure allows for a “fading away” effect of the initial model learned in the warm start phase. Experiments and Evaluation: I appreciate the evaluation procedure aimed to highlight the specifics of the proposed threshold heuristic. Nonetheless, I somehow missed a comparison to other existing approaches of entity resolution. Such a comparison would have put the results in a different light. It would have been also interesting to see an evaluation addressing the effects of blocking non-matches that eventually would justify the selected threshold of 0.2. My main point of concern is, however, the very similar results of the valley and elbow threshold methods. For example, if we look at the deltas to the supervised F1-scores, we have three wins for the elbow two wins for the valley and one for the static threshold. For the unsupervised, we see similar results where for two datasets we have a difference between the two methods in third place after the decimal point. The results are presented in a clear insightful way. The authors, however, may consider using different than yellow color for the “no_boot” results as standard divisions are rather difficult to see in figures 5-7. From my point of view, the authors elegantly combine a set of existing methodologies and techniques in an interesting and innovative way to solve an important problem. The key point of the paper is the elbow point threshold heuristic. Overall, I think this paper presents a sound and valuable contribution to the ESWC community and should to be accepted. =============================== After Rebuttal =============== I keep my original score.". }

Showing items 1 to 1 of 1 with 100 items per page.

Paper.80_Review.0 hasContent "----------- Strong Points ----------- -Domain independent scoring function and threshold heuristic -Evaluation using textual, structured, and “dirty” datasets -Implementation and data available online ----------- Weak Points ----------- -Valley and elbow thresholding have rather similar results Summary: The paper addresses an important shortcoming of active learning for entity resolution i.e. cold start problem. The proposed method deals with the cold start problem by introducing unsupervised matching based on a novel domain-independent threshold heuristic to bootstrap active learning. The unsupervised matching uses a datatype-specific similarity metrics to assign a similarity score to all record pairs. The threshold boundary “t” is then set to a value accounting for the elbow point of the cumulative similarity score distribution of all record pairs. The distance between the threshold values and the aggregated similarity score of each pair serves as confidence weights that are then used to provide the active learning with the most suitable pairs at for every iteration, i.e. more noisy pairs are supposed to affect the warm start less than more confident pairs. The method is evaluated and shows promising results on three different types of data, i.e. structured, textual and dirty. The evaluation experiments are well-designed to measure the influence of the proposed thresholding heuristic, the bootstrapping and the warm start of the active learning. Introduction and Related Work: The introduction as the whole paper is well-written and introduces the problem and its specific. Related work is nicely structured along the three main points of the presented methodology, feature engineering, unsupervised matching, and active learning. Proposed Active Learning Methodology: The authors proposed a two-step methodology. An unsupervised matching step consisting of labeling pairs and assigning confidence weights using the elbow point threshold method. In the second step unsupervised labeled and weighted pairs are used in the warm start pool to bootstrap the training of the active learning random forest classifier and a heterogeneous committee of five different classifiers which includes the random forest classifier. The committee is used to select a pair form the noisy pool to be added to the labeled set after manual labeling in every iteration of the active learning. The labeled set is used to incrementally train new trees of the random forest classifier. This procedure allows for a “fading away” effect of the initial model learned in the warm start phase. Experiments and Evaluation: I appreciate the evaluation procedure aimed to highlight the specifics of the proposed threshold heuristic. Nonetheless, I somehow missed a comparison to other existing approaches of entity resolution. Such a comparison would have put the results in a different light. It would have been also interesting to see an evaluation addressing the effects of blocking non-matches that eventually would justify the selected threshold of 0.2. My main point of concern is, however, the very similar results of the valley and elbow threshold methods. For example, if we look at the deltas to the supervised F1-scores, we have three wins for the elbow two wins for the valley and one for the static threshold. For the unsupervised, we see similar results where for two datasets we have a difference between the two methods in third place after the decimal point. The results are presented in a clear insightful way. The authors, however, may consider using different than yellow color for the “no_boot” results as standard divisions are rather difficult to see in figures 5-7. From my point of view, the authors elegantly combine a set of existing methodologies and techniques in an interesting and innovative way to solve an important problem. The key point of the paper is the elbow point threshold heuristic. Overall, I think this paper presents a sound and valuable contribution to the ESWC community and should to be accepted. =============================== After Rebuttal =============== I keep my original score."".