International Semantic Web Summer School in Bertinoro

This summer I had a very nice opportunity to attend to the International Semantic Web Summer School (ISWS). The school took 1 week and was mainly organized for Phd students, young professionals and graduates who are interested in topics like Semantic Web and surrounding technologies (e.g Natural Language Processing, Machine Learning, Ontology Design, Blockchain). My master thesis topic was 'Usage of Linked Open Data in Content-Based Recommender Systems for Real World E-Commerce', which I've written under the supervision of Prof. Harald Sack (Karlsruhe University) who is one of the directors of the summer school. Thus, I decided to take a chance and to refresh my knowledge about the semantic topics, to network with researchers from the area and at the same time to represent the SpringerNature company by the target audience. As an outcome I've got very special experience, a lot of lessons learned, great socializing and two awards for the best research paper and the best presentation.

Prehistory

This April there was an open Semantic Web SN Hack Day with the topic “Analytics and metrics to measure the impact of science”, which was kindly organized by our SciGraph team (Arian Celina‌, Markus Kaindl‌, Sebastian Bock‌ and Michele Pasin‌). I had a chance to attend to this event and learned more about the sn scigraph and our semantic tools and infrastructure. The project idea of my team, was to build a Google Chrome plugin, which will enrich a springerlink article page with semantic context:

the description of the article is analyzed and semantic entities are highlighted. By mouseover a Wikipedia abstract appears
fulfill the article with scigraph information (link to the SciGraph connections graph)
fulfill the article with unsilo information about related articles
fulfill the article with dimensions information about related categories

Our team was lucky to win the price for the most innovative project idea for this project.

Since I'm responsible for the Springernature.com webpage, I've used some of our 10% time hack days to implement similar functionality for www.springernature.com blog pages. The resulting prototype is able to extract semantic entities and to retrieve semantic categories from them. In the next steps, I need to identify the most relevant categories, which can be then used as automatic tagging functionality.

Preparations for the Summer School

Every participant has to make an A1 poster about his research or working topic. My poster was based on my master thesis and was about 'Semantic Recommender System for Scientific Publishing', where I described my ideas about an approach for semantic content-based recommendations for Springer Nature publications and books. Many thanks to Markus Kaindl‌, Michele Pasin‌ and for the review and very valuable improvements.

To my knowledge currently we use collaborative filtering (the user who bought an item A, bought also an item B) for our recommendations. But in scientific domain it is not so feasible to recommend to a researcher items which were purchased by another researcher since every researcher has his own unique interest area. My idea was to use semantic connections between the products in order to calculate how similar/related products to each other and give recommendations based on that. As data sources I suggested to use on one hand dbpedia (semantic, machine interpretable version of Wikipedia), SN SciGraph and third party datasources (e.g Dimensions, Unsilo). DBpedia had already a project with SN SciGraph team to interlink these datasets together (s. interlinking of Springer Nature’s SciGraph and DBpedia datasets)

This approach has several advantages compared to the collaborative filtering:

a cold start problem could be avoided. Which items should we recommend to the new user, who didn't purchased anything? How should the new item be recommended if nobody purchased it?
the discoverability of our products can be improved since even items which are rarely purchased will be recommended if they are somehow related to the topic
the user experience could be improved since the provided recommendations are better from the point of view of novelty and serendipity (the occurrence and development of events by chance in a happy or beneficial way)
the reasons for the particular recommendation can be provided

International Semantic Web Summer School

The conference was held from 1 till 7 July in the University Residence Center in a small Italian province Bertinoro. The residence center is a 1000 years old and very beautiful castle, which is used for conferences and training courses. I was one of 60 selected participants and had an honor to represent the Springer Nature company. Some statistics about participants you can see from the image below (37% were female, 73% phd students, mostly from european universities (Fr, It, Ge)). But you can meet people with background from any corner of the world.

The education was done in form of tutorials, keynotes and so called in-depth-pills. Furthermore there was a bigger group work, in which each team worked on a research topic with a paper and presentation as outcome. Overall the summer school aimed to model and represent a researcher’s life in one week, including team work, socializing and deadlines!

Talks

Keynotes were given by Enrico Motta about Data Analytics, Marta Sabou about Rigour and Relevance with Design Science, and Sebastian Rudolph about a Logician’s view of the Semantic Web. The keynotes covered more high level concepts and lessons learned. They were very inspiring and gave a good perspective on a researcher’s life.

The tutorials were given by Maria-Esther Vidal and Sebastian Rudolph were about the basics of Reasoning and SPARQL query execution. Claudia d’Amato and Michael Cochez talked about Machine Learning. John Domingue talked about Blockchain and decentralization.

As a third kind-of talk, the summer school offered so called in-depth-pills. Aldo Gangemi talked about ontology design patterns, Marieke van Erp about Natural Language Processing and Marta Sabou about crowdsourcing. Overall, each talk gave a lot of insights and the numerous questions from the audience were answered.

[Thanks to Sven Lieber for the nice summary in his blog.]

Research Task Forces

The students has to work on the same big topic "Validity in Linked Open Data". All participants were divided in teams, so called task forces of 6 students and a tutor. Every team got a specific topic about data validation and a cool team name. The supporting tutor was in the most cases a research or a professor teaching the courses in that particular area. The outcome should be a research paper and a presentation. After the school, the organizers plan to combine all papers in one and publish a big paper about the "Linked Data validity" with 75 co-authors.

As I mentioned above each team got a cool team name. And to assign the name the organizers used a sorting hat like in Harry Potter movies. They used the metaphor of Harry Potter quite often because it was very obvious in that environment. They bound a loudspeaker to the hat and used a speech synthesis using funny jokes to assign a name to the particular team. In a picture below you can see the tutors with the sorting hat on their heads.

The presented topics were:

Linked Data Validity in a Decentralized Disintermediated World using Blockchains (Tutor: Prof. John Domingue) (Team: Hufflepuff) (my task force)
What is a definition of validity? Does is apply to a single statement (e.g. a triple) in LOD or a collection of statement? Would it be a general definition, a context/domain dependent definition, or both? (Tutor: Prof. Claudia d’Amato) (Team: Mordor)
Can we find out whether data is valid or invalid by only looking at the graph itself,e.g., using machine learning to detect anomalies? Anomalies in subgraphs (Dr. Michael Cochez)
When you go from text to structured data how do assess validity of a piece of information?

How do you cope with imperfect systems that extract information from text to structured formats?

How do you deal with contradicting or incomplete information? (Tutor: Dr. Marieke van Erp) (Team: The 42's)
What are exemplary use cases for LOD validity?

How to establish validity metrics that are sensible both/either to structure (internal), as well as to tasks, existing knowledge and sustainability (external)? What are the patterns to check for validity (Tutor: Dr. Aldo Gangemi) (Team: Gryffindor)
Can be LOD validity established using common sense? What is common sense in terms of linked open data? (Tutor: Dr. Valentina Presutti) (Team: The Delorians)
How to define logical validity using mathematic calculus? Can be different degrees of logical validity? (Tutor: Prof. Sebastian Rudolf) (Team: Dragons)
Context of validity. Will the validity stay the same? Will it evolve over time? (Tutor: Prof. Harald Sack) (Team: Ravenclaw)
How to express the degree of validity of a dataset? (Tutor: Prof. Ruben Verbough) (Team: Hobbits)
Completeness of linked data in case of federated queries. Completeness models for RDF. Federated query engines. (Tutor: Prof. Maria-Esther Vidal) (Team: Jedis)

I was a part of the first team and we have investigated how we can ensure validity in a distributed environment.

Each task force has to prepare a 8 pages research paper, a 10 minutes presentation and a 1 minute funny video. At the end there was an award session for the best work. My team (Hufflepuffs) was awarded for the best research paper and the best presentation.

Task Force: Validation of Linked Data in Distributed Environment using Blockchain

The research goal of my team was to investigate how linked data can be validated in a distributed non-centralized environment using blockchains.

The current state of the World Wide Web is exposed to several issues caused by the over-centralisation of data: too few organisations yield too much power through their control of often private data. The Facebook and Cambridge Analytica scandal (Rosenberg et al 2018) is a recent example. Such over-centrealisation means that the users cannot control how and who access their data and how it is used.

Our task force proposed an approach how the personal data can be validated in a blockchain infrastructure.

The study addressed the following questions:

How does the concept of validity change in the context of a decentralized web?
What does a decentralized approach to data validation look like?
What benefits would accrue from a decentralized technology that supports validation in the context of LD?

The blockchain is a secure distributed environment where the whole data is shared between peers. The main idea is that each peer has the complete data but only peers which are authorized can access, view and write to the data. Therefore, no centralized authority holds the whole data and only minimal amount of explicitly required for the particular use case information is shared.

Consider following use case:

The user wants to prove that the information provided in his CV is valid by sharing the minimal possible amount of information. How can this workflow looks like in a distributed environment without the central organization and how linked data can help in this process.

The suggested workflow consists of several steps:

The user makes a Http request to upload a document (e.g. CV) to the blockchain based application.
The system extracts semantic sentences (e.g John Doe studied at Stanford University)
Using natural language processing and named entity recognition techniques semantic RDF triples are built (e.g person:John_Doe ex:studied_at organization:Stanford_University)
This information is fulfilled with other linked data sources using semantic inferences. For example we can get research topics of the university, its authority ranking, responsible person, contact or any other available and relevant information
Then this information is sent to the blockchain infrastructure where
1. The document itself and the extracted semantic information is then saved in IPFS (distributed file system in blockchain)
2. From this information hash values are retrieved which are then used as indexes to find and build together pieces of information

Using the smart contracts the system will inform the corresponding trusted authorities about the validation request.The authority will accept or reject the semantic information using its private key and put a signature with this information together.

When somebody wants to verify that the provided information is valid - the document should be uploaded to the system and the system will retrieve back the document with the corresponding signatures and therefore verifications.

Using this approach only explicitly necessary information is shared. The validation process is simple and doesn’t require much effort from the trusted authority. The information is linked with linked data cloud and can be fulfilled if required. The system could work globally without any organization in between.

This approach can be used for different use cases. Some of our propositions:

validation of biographical information (e.g cv) by corresponding trusted authorities
validation of government provided data by the responsible organizations
validation of personal data (f.e for dating purpose) by a big number of potentially untrusted peers

In the first and the second use case we have trusted authorities. Therefore weak validation model will be sufficient (if the authority validated the information it is considered as correct). In contradiction to that in the third use case we have a big number of potentially untrusted peers. Thus, a more complex strong validation model is required. Each peer can accept or reject validation. The final decision is taken by the majority vote over a particular threshold.

As part of our research our team implemented an easy proof of concept where we combined linked open data environment (in our case DBpedia Spotlight) and Open Blockchain provided by the Open University.

If you are more interested in this topic you can read our scientific report attached below or ask me about more details.

Summary and Conclusion

Overall I had a great experience during the summer school. I can recommend to everybody just do it because the last time when I had such intensive learning experience was during my master study. You can learn a lot during very short time together with other people and from other people and not only hard but also very valuable soft skills.

I think that this format is more efficient than the normal conference because you have to participate actively and not just listen to talks. The communication with other colleagues from the community is also more intensive because you don't only meet them in conference but also spend more time together and have a common target to reach. I had a great chance for networking with the best researchers in their particular area in very informal atmosphere and hopefully that there will be some interesting future projects also for the Springer Nature company.

If you are interested in some specific materials like other research posters, tutorial materials, the outcome papers just contact me.

Search This Blog

Michael's Blog

International Semantic Web Summer School in Bertinoro

Prehistory

Preparations for the Summer School

International Semantic Web Summer School

Talks

Research Task Forces

Task Force: Validation of Linked Data in Distributed Environment using Blockchain

Summary and Conclusion

Comments

Post a Comment

Popular posts from this blog

Excurse: Introduction to databases

How did I get the SAP internship?