ESWC 2020 |

ESWC 2020

Matches in ESWC 2020 for { ?s ?p Given that some Wikipedia pages (tagged as list pages) contain large lists of entities organized into categories or tables, this paper proposes a method to extract entities from these pages and identify their types. The motivation for this work is to extend general knowledge bases like DBPedia or Yago with new entities in relation with their classes. The result of the process is a shared RDF knowledge graph called CaLiGraph and the addition of 700K entities, 7.5M type statements and 3.8M additional facts to DBpedia. Two kinds of list pages are exploited : pages with vertical enumerations and pages with table. The authors propose a machine learning process in two stage: they first generate training data that will provide positive examples to a distant supervision algorithm, and then they represent this data as features to train this algorithm (a classifier) that learns list items and their types (called subject entity in the paper). The strength of the approach is the way training data is collected: a taxonomy of concepts is built by combining Wikipedia categories, DBPedia types and Wikipedia list graph (list categories). This process, called Cat2Tax, has been presented in a previous paper. Given this taxonomy, the goal is link each entity identified on a list page to the right subject type, or to decide that this entity is itself a subject type. The papers explains in a very clear way the complementarity of the three sources of taxonomic relations, and the way they are cleaned and combined to get a high quality taxonomy. The lexical structure of the nodes in this graph is used to decide whether nodes and hypernym relations are meaningful or should be eliminated from the taxonomy. This resource is used to label entity mentions in Wikipedia list pages. If an entity in the list is identified in the taxonomy, its type and all its ascending nodes in the graph are used to label this entity but also all the entity at the same level in the list page. A balanced set of positive and negative examples is built to train the classifier. Each example is represented with a set of features, some of weach are specific to lists and other to tables. After generating the features, 7 classification algorithms are compared, with highest scores for random forest and XG-boost. XG-boost is selected as it get higher precision. Results are very promising, and analysed in terms of distribution of entities added to DBPedia (a majority of places and species), of the role of the features (page features are the most influential), and number of types statements added according to the types. This work is very clearly presented. The process of crossing various sources to build a taxonomy and then learn to identify and type entities in tables and lists results of high quality. Results are positive and promising results. The authors evaluated their correctness and precision, discuss them with acute analyses and identify possible improvements like taking into account lay-out features, and including an entity disambiguation stage when linking entities to their mentions on the pages. Assertions to be clarified . in step4, why is DbPedia taxonomy considered as the reference? is it better than YAGO's taxonomy which is said checked more closely than DBPedia? is it because disjointness axioms are available with DBpedia and not with Yago? . about the entity facts identified in section5: For the reader not familiar with Cat2Tax, it is no clear that this algorithm generated relation axioms when building the taxonomy. May be you should add an example of relation axiom in your example in step 1, when explaining the role of Cat2Tax after reading the author answer to our comments _ I appreciate the answer to the reviewers questions and requests. I wissh all this be included in the final version.". }

Showing items 1 to 1 of 1 with 100 items per page.

Paper.39_Review.0 hasContent "Given that some Wikipedia pages (tagged as list pages) contain large lists of entities organized into categories or tables, this paper proposes a method to extract entities from these pages and identify their types. The motivation for this work is to extend general knowledge bases like DBPedia or Yago with new entities in relation with their classes. The result of the process is a shared RDF knowledge graph called CaLiGraph and the addition of 700K entities, 7.5M type statements and 3.8M additional facts to DBpedia. Two kinds of list pages are exploited : pages with vertical enumerations and pages with table. The authors propose a machine learning process in two stage: they first generate training data that will provide positive examples to a distant supervision algorithm, and then they represent this data as features to train this algorithm (a classifier) that learns list items and their types (called subject entity in the paper). The strength of the approach is the way training data is collected: a taxonomy of concepts is built by combining Wikipedia categories, DBPedia types and Wikipedia list graph (list categories). This process, called Cat2Tax, has been presented in a previous paper. Given this taxonomy, the goal is link each entity identified on a list page to the right subject type, or to decide that this entity is itself a subject type. The papers explains in a very clear way the complementarity of the three sources of taxonomic relations, and the way they are cleaned and combined to get a high quality taxonomy. The lexical structure of the nodes in this graph is used to decide whether nodes and hypernym relations are meaningful or should be eliminated from the taxonomy. This resource is used to label entity mentions in Wikipedia list pages. If an entity in the list is identified in the taxonomy, its type and all its ascending nodes in the graph are used to label this entity but also all the entity at the same level in the list page. A balanced set of positive and negative examples is built to train the classifier. Each example is represented with a set of features, some of weach are specific to lists and other to tables. After generating the features, 7 classification algorithms are compared, with highest scores for random forest and XG-boost. XG-boost is selected as it get higher precision. Results are very promising, and analysed in terms of distribution of entities added to DBPedia (a majority of places and species), of the role of the features (page features are the most influential), and number of types statements added according to the types. This work is very clearly presented. The process of crossing various sources to build a taxonomy and then learn to identify and type entities in tables and lists results of high quality. Results are positive and promising results. The authors evaluated their correctness and precision, discuss them with acute analyses and identify possible improvements like taking into account lay-out features, and including an entity disambiguation stage when linking entities to their mentions on the pages. Assertions to be clarified . in step4, why is DbPedia taxonomy considered as the reference? is it better than YAGO's taxonomy which is said checked more closely than DBPedia? is it because disjointness axioms are available with DBpedia and not with Yago? . about the entity facts identified in section5: For the reader not familiar with Cat2Tax, it is no clear that this algorithm generated relation axioms when building the taxonomy. May be you should add an example of relation axiom in your example in step 1, when explaining the role of Cat2Tax ____ after reading the author answer to our comments _____ I appreciate the answer to the reviewers questions and requests. I wissh all this be included in the final version."".