Está en la página 1de 26

D6.

Appendix B: Social media questionnaire


This appendix contains the results of the social media questionnaire from December 2008.

60

Toinvestigatetowhichextentartsstudentsmakeuseofsocialnetworkingandknowledgesharing applications,41artsstudentswereaskedtofillinaquestionnaire.Theresultsaresummarizedin thisappendix.

1.1.1.Participants
Date: Participants: female male Averageage: female male December2008 41 32 9 21.8 21.3 23.2

1.1.2.Questionnaire
Thequestionnairestartswithanumberofgeneralquestionsaboutthefrequencyofuseofsome waysofonlinecommunication.Thesecondpartcontainsalistofsocialsoftware,dividedinto categorieslikesocialnetworks,bookmarking,mediasharing,etc.wherethetestpersonshadtofill inwhethertheyknowandusethesoftware.Thethirdpartasksforwhichpurposesthestudentsuse socialsitesandthelastsectionfocusesontheuseofsocialsoftwareforstudyingandlearning.

1.1.3Results
1.1.3.1Frequencyofuseofonlinecommunication
Onlinecommunication

Instancemessaging Voice&Chattools Forumsandmailinglists(read) Forumsandmailinglists(contribute) Updateyourblog Readvisitor'scommentstoblog Readotherblogs Writecommentsonotherblogs


0 10 20 30 40 50 60 70

daily severaltimesperweek severaltimespermonth severaltimesperyear never

percentage

1.1.3.2Useofsocialsoftware Socialnetworks

Socialnetworks
Hyves Facebook Myspace LinkedIn Twitter
percentage

Icontribute Iuseit Iknowit,butdon'tuse it Idon'tknowit

hi5
100 Netlog 90 70 50 40 Orkut 30

SocialBookmarkingandSocialTagging

80 Friendster 60 XING

Idon'tknowit Iknowit,butdon'tuseit Iuseit Icontribute

20 Skyrock 10 0 Tagged 0

Faves

StumbleUpon Delicious
20 40 60

Technorati
80

100

Digg

percentage

120

SocialBookmarking&SocialTagging Forthreeoftheincludedsocialbookmarkingandtaggingsites(Digo,CiteULike,andConnotea) noneoftheparticipantseverheardofit.Fortheotherstheresultsareshowninthebarchartbelow.


SocialBookmarkingandSocialTagging
Faves StumbleUpon Delicious Technorati Digg
0 10 20 30 40 50 60 70 80

Icontribute Iuseit Iknowit,butdon'tuseit Idon'tknowit

percentage

90

100

Mediasharing ThemediasharingsiteIpemitywasunknownforalltheparticipants.Fortheotherstheresultsare showninthebarchartbelow. Mediasharing


YouTube Picasa Flickr Photobucket WebShots Imeem Dailymotion
0 10 20 30 40 50 60 70 80 90 100

Icontribute Iuseit Iknowit,butdon't useit Idon'tknowit

percentage

SocialCatalogingandRating ThesitesofweRead,MobyGamesandDiscogswereunknownforallparticipantsandaretherefore notincludedinthebarchart.


SocialCatalogingandRating
Icontribute

IMDb

Iuseit Iknowit,butdon'tuseit Idon'tknowit

Last.fm

Shelfari

Flixster

LibraryThing

Goodreads
0 20 40 60 80 100 120

percentage

Questions&Answerswebsites
Question&Answerswebsites
Icontribute

Yahoo!Answers

Iuseit Iknowit,butdon'tuseit Idon'tknowit

WikiAnswers

AnswerBag
0 10 20 30 40 50 60 70 80 90 100 percentage

1.1.3.3Reasonstousesocialsoftware
Reasonsforusingsocialsites
Keepintouch Findinfo(general) Infoaboutcontacts Shareinterests Exchangefiles Findinfo(forstudy) Makeappointments Contactsofmycontacts Meetnewpeople Advertiseexpertise Materialusedbycontacts
0 10 20 30 40 50 60 70 80

importantreason nicepossibility,notmain reason notareason

percentage

1.1.3.4Useofsocialsoftwareforlearningpurposes
Studyingforcoursesorexams

Google LearningEnvironments Wikipedia Sitestofindspecifickindofknowledge Othersearchengines Otherwebsites Socialsites


0 20 40 60 80
percentage

Iuseit Idonotuseit

100

120

Selfguidedlearning
Google Askotherpeople Wikipedia Sitestofindspecifickindofknowledge Socialsites Othersearchengines LearningEnvironments Otherwebsites
0 20 40 60 80 percentage 100 120

Iuseit Idonotuseit

D6.2

Appendix C: WP6 Publications


The following WP6 publications are contained in this appendix: o Monachesi, P., Markus, T., Osenova, P., Posea, V., Simov. K., Trausan-Matu, S. (2009). Supporting knowledge discovery in an eLearning environment having social components. In Proceedings of International Conference on Engineering Education, Instructional Technology, Assessment, and E-learning (CISSE-EIAE 09). Springer. Osenova, P., Laskova, L. and Simov, K. (submitted). Exploring Co-Reference Chains for Concept Annotation of Domain Texts. Submitted to LREC 2010. Monachesi, P. and Markus, T. (submitted). Socially driven ontology enrichment for eLearning. Submitted to LREC 2010. Posea and Trausan-Matu 2009. Posea V. and Trausan-Matu S. (2009). Bridging Ontologies and Folksonomies using DBpedia. In the proceedings of the 17th International Conference on Control Systems and Computer Science. Bucharest, Romania.

o o o

66

Supporting knowledge discovery in an eLearning environment having social components


Paola Monachesi and Thomas Markus Vlad Posea and Stefan Trausan-Matu
Utrecht University The Netherlands Politehnica University of Bucharest Romania Bulgarian Academy of Sciences, Sofia, Bulgaria

Petya Osenova and Kiril Simov

Abstract One of the goals of the Language Technology for LifeLong Learning project is the creation of an appropriate methodology to support both formal and informal learning. Services are being developed that are based on the interaction between a formal representation of (domain) knowledge in the form of an ontology created by experts and a social component which complements it, that is tags and social networks. It is expected that this combination will improve learner interaction, knowledge discovery as well as knowledge co-construction.

I.

INTRODUCTION

In a Lifelong Learning context, learners access and process information in an autonomous way. They might rely on formal learning, that is they might focus on textual material approved by content providers in the context of a course developed by an organization or institution. They might also want to rely on informal learning, that is on (non-)textual content available through the web which is uploaded and accepted by the community of learners and not necessarily by a content provider of an institution. One of the objectives of the Language Technologies for LifeLong Learning (LTfLL)1 project is to develop services that facilitate learners and tutors in accessing formal and informal knowledge sources in the context of a learning task. More specifically, the aim is to support learners interaction in order to facilitate knowledge discovery and knowledge co-creation. To this end, a Common Semantic Framework (CSF) is being developed. The CSF allows the stakeholders to identify, retrieve and exchange the learning material. More specifically, it supports formal learning by addressing course material which includes textbooks, articles, slides as well as informal learning which we identify with (non-)textual material emerging from social media applications. Within the CSF, communication is facilitated through the use of social networks and new communities of learners are established through the recommendations provided by the system. In order to provide recommendations, the users profile, his interests, his preferences, his network and obviously the learning task are taken into account. It is through the access to formal and informal material that new knowledge will originate. Knowledge repositories are employed to achieve the above goal. An appropriate way to retrieve formal content might be by means of an ontology which can support the learner in the learning path, facilitate (multilingual) retrieval and reuse of content as well as mediate access to various sources of
1

knowledge, as concluded in [1]. In our specific case we use an ontology in the domain of computing, which is employed to provide deep annotation of the learning materials that should facilitate their understanding and reuse. However, the ultimate goal of the LTfLL project is to complement the formal knowledge represented by ontologies developed by domain experts with the informal knowledge emerging from social tagging in order to provide more flexible and personalized ontologies that will include also the knowledge of communities of users. The enhanced ontology will be connected to the social network of the learners through the tags that they provide, improving thus the possibility of retrieving appropriate material and allowing learners to connect to other people who can have the function of learning mates and/or tutors. In this way, it will be possible to provide a more personalized learning experience able to fulfill the needs of different types of learners. The paper is organized as follows. In section II, we discuss the differences between formal and informal learning. Section III presents the design of the Common Semantic Framework, while section IV focuses on its implementation. In section V, we present an initial evaluation of the system while VI introduces some concluding remarks. II. FORMAL VS. INFORMAL LEARNING

http://www.ltfll-project.org/

As already pointed out above, in the LTfLL project, we are developing services that facilitate learners and tutors in accessing knowledge sources within the context of a certain learning task. The learning process is considered an attempt to make the implicit knowledge which is encoded in the learning materials, explicit. In the LTfLL project, we aim at supporting both types of learning - formal and informal one. Under formal learning we understand the process, in which the learners follow a specific curriculum. For each topic in the curriculum, there is a set of learning materials, provided by the tutor. In contrast to this, the curriculum in the informal learning process is not obligatory, and the role of the tutor either is absent or is not obvious. The learners exploit different sources to succeed in their learning goal, very often relying on the information, available on the web, and on the social knowledge sharing networks. From the perspective of formal learning (i.e. guided, monitored one), learners rely mainly on material which has been prepared by the tutor on the specific topic and is addressed by the curriculum. Learners are expected to access certain pieces of pre-stored information, which would lead

them to the required knowledge. In the informal learning process, learners have to locate knowledge sources on the web, then to select the relevant ones, and finally to investigate each of them. During this process, learners often need guidance and thus they might profit from finding appropriate people (peers) on the web. The two learning paradigms are complementary from the lifelong learners perspective. In the LTfLL project, we are developing two sets of services that would support both - the formal style of learning and the informal one. These services are integrated within the Common Semantic Framework. III.

COMMON SEMANTIC FRAMEWORK: DESIGN

One of the aims of the LTfLL project is to build an infrastructure for knowledge sharing which we call Common Semantic Framework (CSF). It allows for identification, retrieval, exchange and recommendation of relevant learning objects (LOs) and of peers. It is ontology driven allowing thus for a formalization of the knowledge arising from the various stages of the learning life-cycle. Domain ontologies offer useful support in a learning path since they provide a formalization of the knowledge of a domain approved by an expert. In addition, ontologies can support formal learning through the annotation of the learning objects [1]. More specifically, with respect to formal learning, we are developing a Formal Learning Support System (FLSS) that supports the individual work of the learners and tutors in manipulating the learning objects, including the knowledge which is implicitly encoded in them and in adding their own descriptions to these learning materials. In particular, an annotation service has been implemented which is meant as a way to make the implicit knowledge attested in learning objects, explicit. Thus, through the annotation service, we aim at relating the knowledge, encoded in the text, to the formally represented knowledge in a domain ontology. Despite the possible limitations of the ontologies, they provide a highly structured way of interlinking different knowledge sources. From the point of view of the learning process itself, the domain ontology can be viewed as a way to produce a concise representation of the most important concepts in a given domain. Consequently, the learner has to interact with them in order to assess his current level of familiarity with the domain or to use them to acquire new knowledge. While in formal learning, courses and learning objects play a relevant role, in informal learning, communities arising from social media applications are becoming more and more dominant, but are poorly integrated in current learning practices. Learners prefer to use the internet to search for an answer rather than asking a tutor or a colleague [3]. Most of the documentation for a given learning task is found on the internet and most learners do not check if the information is accurate or reliable [4]. Recently, the usage of the internet has shifted towards social networking applications. 1 billion movies are seen everyday on YouTube.com2 and 150 million users are
2

logging each day to Facebook.com.3 The communities that emerge through social media applications such as Delicious, Flickr or YouTube provide two important elements that can be integrated in the Common Semantic Framework, in order to support informal learning. They provide the knowledge of the masses in the form of tags that are being used to annotate resources. On the other hand, the structure of these communities which can be represented through a social network can be employed to recommend content and peers. The social networks automatically introduce a notion of trust in the search results in relation to the user' s social network. The knowledge produced by communities in the form of tags can be employed to enrich existing domain ontologies semiautomatically. More specifically, we have merged the dynamic knowledge provided by users/learners through tagging with the formal knowledge provided by the ontologies by adding tags/concepts (or instances of concepts) and relationships between concepts in the domain ontology. In the CSF, the connection between words/tags and concepts is established by means of language-specific lexicons, where each lexicon specifies one or more lexicalizations in one language for each concept. DBpedia [5] is used as a knowledge base to resolve the extracted lexicalization to unambiguous existing concepts. DBpedia doesnt provide much information about relations among concepts when compared to specially crafted domain ontologies, but compensates this shortcoming with the huge number of available concepts. There is an important side effect of this ontology enrichment process: if tags given by learners or emerging from social media applications are related to the concepts present in the ontology, we manage to include not only the expert view of the domain, but also the learners perspective. In fact, domain ontologies developed by knowledge engineers might be too static, incomplete or might provide a formalization that does not correspond to the representation of the domain knowledge available to the learner which might be more easily expressed by the tagging emerging from communities of peers via available social media applications. It is important to help the learner manage the content that is created daily around him in his learning communities. The knowledge produced inside the learners network is very valuable to him. According to the Communities of Practice (CoP) theory [6], a learner acquires knowledge, as he moves from the periphery to the center of a CoP. In order to facilitate the informal learning experience, we automatically monitor the learners activities and his peers on the social network applications. The data that the learner and his network create or annotate is indexed in a semantic repository. This data can then be used to offer learning support which exceeds that which is currently offered through keyword search or simple recommendations. In this way, the CSF supports self-organization and the emergence of collaborative knowledge and classification. In addition, we aim at connecting learners to other learners in an
3

http://youtube global.blogspot.com/2009/10/y000000000utube.html

http://www.facebook.com/press/info.php?statistics

appropriate way. To this end, the content the learner is searching and selecting can be used as a trigger to get him in touch with other users that employ similar content or annotations. Furthermore, the system monitors the changes that appear in his network with respect to content and to users over time and will provide recommendations for how he could update his network (by adding new peers or removing old ones). Managing the social network is especially important for novices that need to create their own CoP with people with a similar interest. Alternatively, if a learner is already part of a CoP, he needs to keep up with the changes or new information in his domain(s) of interest. Most of the learner' s attention will probably go to content produced by peers that are highly relevant and trustworthy. Therefore, it is important that the system maximally exploits these relations. To summarize: In the CSF we establish an obvious link between the network of users, tagging and the resources (cf. also [7]). We further extend this theoretical model by automatically linking and enriching existing domain ontologies in order to structure the heterogeneous information present in social networks, (deeply) annotated resources, and concepts. The knowledge rich ontologies integrated with tagging provide compelling advantages for a wide range of tasks such as recommendation, community building and discovery learning. IV. COMMON SEMANTIC FRAMEWORK: IMPLEMENTATION
Figure 1. The Common Semantic Framework implementation

The Common Semantic Framework implementation is based on a Service Oriented Architecture (SOA) that represents the backbone of the system. It includes five components: data aggregation, ontology enrichment, multimedia document annotation, search and visualization. The services that link these components have a strong emphasis on semantics by employing shared URI' s for all the various entities such as concepts, resources, users and annotations. Furthermore the information exchanged through the Web services carries semantic meta-data, as most of the data used in the CSF is stored in RDF and OWL. As can be seen in figure 1, the communication between the modules is implemented using web services and the core of the CSF is represented by the semantic repository. The repository adopts RDF/OWL semantic vocabularies like SIOC, FOAF, Dublin Core and SKOS to represent information extracted from the social networking applications and employed by all the modules. The other modules, that is ontology enrichment and search, use web services to extract information from the repository in order to provide the required information. The final results are converted to data that can be visualized through a widget inside a learning or social networking platform (e.g. Moodle or Elgg). Here, the various components are described in more detail. Data Aggregation The data aggregation module consists of a crawler and the services that link it to the semantic repository. The crawler uses APIs provided by an increasing number of social networking applications to get information about users, resources and tags.

The crawler extracts information from a users social network such as indexing bookmarks that the user is posting on Delicious, videos posted on YouTube, slides posted on Slideshare together with the tags used to classify the resources and information about the social connections developed inside these web sites. The data is converted into semantic information using ontologies like SIOC, SCOT, FOAF. The data extracted by the crawler can be interpreted as a folksonomy, which is a hypergraph describing the information about users, resources and tags as specified in [7]. The storage of the folksonomy in a repository allows the system to provide various search and recommendation services. These services can provide references to users, resources, tags, concepts and annotations that the learner can use in his learning activities. The advantage of using a semantic repository and commonly used vocabularies for the purpose of data integration is that we obtain a highly integrated dataset containing semantically compatible data from different internet sources. These data can be used to enrich an existing domain ontology for providing improved search services that can exploit the structured knowledge formalized within an ontology as well as the knowledge coming from the participating communities. It should be noted that the repository also contains a collection of learning material collected from the web which has been semantically annotated. Ontology enrichment with social tags The purpose of this module is to benefit from the information extracted from social media applications for the enrichment of existing domain ontologies. To this end, we determine which concepts and relations are the best candidates to be added to the given domain ontologies. In the CSF we take, as starting point, the ontology on computing that was

developed in the "Language Technology for eLearning" project (cf. [1]). It contains 1002 domain concepts, 169 concepts from OntoWordNet and 105 concepts from DOLCE Ultralite. Candidates are generated by applying similarity measures or other sorting mechanisms to derive socially relevant tags. Implemented examples include coocurrence-based and cosinebased measures [9]. This analysis drives lookups and selective integration from DBpedia for the domain ontology enrichment. The various web services provide additional lexicon entries, socially relevant specific classes linked to more abstract classes already present in the ontology and additional relationships discovered between already existing domain concepts. The system provides an automatically enriched ontology which contains the vocabulary of the Community of Practice that the user is part of. More specifically, the resulting ontology integrates the socially relevant concepts, but with the structure of an expert view domain ontology. Extraction methods exclusively focused on deriving ontology-like structures from folksonomies cannot provide such a high quality of results due to unavailability of the implicit knowledge in folksonomies which has been made explicit in domain ontologies. Multimedia Document Annotation The aim of the multimedia document annotation is to annotate explicitly learning material with concepts from the domain ontology previously mentioned. The purpose of this annotation is to allow semantic search, based on the ontology. The usage of the domain ontology in the search provides abstraction from the concrete wording of. The annotation is done in the document. This type of annotation allows the user to be directed to a particular part of the document. Within the CSF both text annotation and image annotation are possible. The text annotation is implemented as a language pipe performing tokenization, POS tagging, lemmatization, semantic annotation and coreferential annotation. The first three steps rely on available state-of-the-art tools. The semantic annotation is performed by a concept annotation grammar constructed on the basis of an ontology-to-text relation as described in [13] and [14]. The coreference annotation module provides mechanisms to make the concept annotation more precise and with a better coverage of text (15% improvement). The image annotation is performed manually. It is complementary to the text annotation. The reason to include it is that very often the pictures and diagrams, etc, present in the learning documents contain important information for the learner. The user interface consists of a document browser, image editor and ontology browser. The user selects an image from the document browser. The image is opened within the image editor. Then the user can select a region of the image and assign one or more concepts to the region. The interpretation of this type of annotation is that the region depicts an instance of the corresponding concept. The document annotation is stored in the semantic repository together with links to the documents which will be used for document search by the ontology search engine. Search The search service takes the learners query as input and analyzes the information stored in the semantic repository in

order to provide relevant and trusted results. At the moment, we have implemented two types of search, that is ontology based search that exploits the structured information attested in the ontology as well as search based on tags and social networks that exploits the social aspects of the knowledge provided. Our plan is to integrate these two types of search into a recommendation system. It should exploit the strong features of both types of search, that is ontological structure and community based knowledge (i.e tags and social networks). Ontology based search The semantic search can be triggered by a list of keywords, provided by the user. The query string is analyzed on the basis of a lexicon, mapped to the domain ontology which is enriched with tags. The result of this analysis is a set of concepts from the ontology. The concepts, related to the keywords, are additionally processed on the basis of the information from the ontology by adding related concepts, that is query expansion. Query expansion is done via reasoning over the ontology. The most typical case is the addition of the sub-concepts to the ones extracted from the ontology on the basis of the keywords. The initial set of concepts is interpreted conjunctively and the addition from the query expansion is added disjunctively to it. The query, formed in this way, is evaluated over the repository of annotated documents. The result is a list of documents and in the case of the annotated ones it is a set of pointers to parts of the document which are considered as relevant to the query - sentences, paragraphs and pictures. The possibility of ontology browsing is also offered to the learner. More specifically, the enriched ontology can be exploited by the learner to get an overview of the domain specifically tailored to some search query, relevant user or specific document. This allows the learner to discover resources which do not match previously known keywords by supporting discovery of new concepts and associated resources through a visualization of the domain. Search based on social networks and tags In addition to ontology based search and browsing we have implemented an alternative type of search based on the structure of the social network and the tags provided to resources. More specifically, in order to return trusted results, the search service focuses on documents created or bookmarked by the learners friends and close contacts. The results found in this way also contain the name of the users who created or recommended them. Being in the learners network, he can now contact them and ask for more information or simply trust the document returned more as it has the guarantee of a peer recommendation. The resources found this way will most likely be in the learners zone of proximal development [10]. The search algorithms can also identify the most relevant peers for a given task. The learner can interact using chat or email with the peer to ask further questions and the peer might very well be helpful. He is already part of the learners social network and we suppose that this means there is already a connection established between them. The search services use the FolkRank algorithm [8]. This algorithm is used because it can retrieve any type of content

that has been previously tagged. A similar tag-based algorithm is used to search documents on the basis of the tags provided by users [11]. The algorithm computes results based on tag co-occurences, on tag-resource and on tag-user affinities. These affinities are computed based on the frequency of the user-resource and tag-user pairs. The tag co-occurence is the most used measure for tag similarity and it measures how often 2 tags are used together. After computing the co-occurence matrix and the affinities matrices, the algorithm clusters tags and computes an affinity between a user and a cluster of tags. Finally, using the affinity between user and the tag clusters the algorithm returns a number of resources of interest. Visualization The visualization module offers a graph based view of either a social network of users and resources centered around a given learner or a part of a domain ontology. The social network view can present the whole network as it exists in the repository or it can show only the part of the network returned by a search query. The search results are presented as a network of users and resources attached to them. The user can thus see which path links him to a specific user or resource on the given search topic. In this way, the user can identify the best means to contact a user for further questions and can also rapidly identify who are the most competent users on a specific topic. The unified relations represented graphically are from peers across different social networking applications because our system monitors and integrates all the social media applications that the learner is using. The ontology based graph allows a user to quickly gain insight into the important relations which exist between domain concepts. The ontology visualization is a hyperbolic hypergraph that contains domain concepts with their preferred lexicalisation and (taxonomic) relations to other domain concepts that are drawn as labeled edges. The hyperbolic geometry allows the user to focus on the important topics while still retaining a sense of the larger surrounding context. The user can discover, previously unknown, concepts which would not have shown up either in the social network or through a keyword search, by interactively browsing the ontology. Together the visualization services provide a way for the user to navigate through the knowledge sources, focus on specific details, add new information, take notes, etc. The visualization can be customized to be included in a number of widgets to show specific information like visualizing a part of the ontology, the social network, a definition, relevant peers, etc. These widgets can easily be recombined or disabled in order to provide a learning environment that best fits the user.

V.

EVALUATION

The Common Semantic Framework was evaluated on the basis of three use cases: 1. tutor support in the creation of a new course; 2. comparison between browsing of an ontology enriched with social tags and a cluster of related tags in order to solve a quiz;

3. recommendation of content and peers on the basis of a social network and tags. In the first use case, the scenario employed for the validation focused on providing support to tutors in the creation of courses in the IT domain. The idea was to evaluate the efficiency of the computer-assisting tool when creating a course on a certain topic. This evaluation focused on the performance of the various functionalities, reflected in users opinions. The outcome would give feedback to the developers on how to improve the supporting system. The evaluated functionalities were the following: relevance of the learning material corpus wrt a domain topic; relevance of the retrieved material; suitability of the ontology for structuring a lecture in the chosen domain; possible visualization of the interface; combination between ontology browsing and semantic search. There were two target groups: five teachers and three managers (vice-deans at the Faculty of Slavonic Languages, Sofia University). Our assumptions were as follows: the combination among semantic search, ontology browsing and a concept map based visualization would support the tutors in the processes of finding the relevant materials, structuring the course and representing the content in a better way. Our hypothesis was that, on the one hand, the tutors might differ in their preferences of using the functionalities. For some tutors, search would be more intuitive, while others might prefer structure (in our case - ontology). On the other hand, the managers would look at the facilities from a different perspective, such as how this supporting system would communicate with the well-known Moodle, for example. Two methodological formats were adopted: the think-aloud strategy for the tutors, and the interview for the managers. The thinkaloud format included seven questions, related to the specific functionalities (search, browsing, visualization) as well as the general usability and problematic issues of the system. The interview format included three questions, which were related to the usability and management issues related to the adoption of the system. The results confirmed our hypotheses. Additionally, four teachers reflected in their answers the fact that the functionalities are more efficient when operating in some interconnected mode. Regarding the semantic vs. text search three tutors gave preference to the semantic one, while two were more cautious and balanced between the two. Visualization was the most commented functionality, because the teachers had different ideas on how it should be controlled. The main worry of the managers was the requirement of a training course to be able to use the CSF. In the second use case, we have focused on the support provided by an ontology enhanced with tags in comparison with clusters of related tags extracted from Delicious (lacking ontological structure) in the context of a learning task. The underlying assumption being that conceptualization can guide learners in finding the relevant information to carry out a learning task (a quiz in our case). The hypothesis was that learners might differ with respect to the way they look for information depending on whether they are beginners or more advanced learners. While beginners might profit from the

informal way in which knowledge is expressed through tagging more advanced learners might profit from the way knowledge is structured in an ontology. The experiment was carried out with six beginners and six advanced learners, all with an academic background. Responses were elicited by means of a questionnaire that contained questions about which elements learners used to provide an answer to the quiz (i.e concepts, relations, documents, related tags, structure). The responses to the questionnaire by beginners showed that the enriched ontology can be a valid support to answer certain types of questions, because the relations between the concepts are given in an explicit way. However, beginners rely mainly on documents to find the relevant information both in the case of the enhanced ontology and in the case of the cluster of related tags. This attitude is probably influenced by the way search occurs in standard search engines. On the other hand, advanced learners indicated that they were able to find the answer to the quiz quickly by using the enriched ontology and they also used less documents to find the answers. They gave the structure of the ontology a higher rating than the beginners. Interestingly, advanced learners were more positive about the structure of the tag visualization than the beginners. This is probably due to their background knowledge that enabled them to interpret the graph better. We refer to [12] for additional details on the results of the experiment and their interpretation. In the third use case, we have focused on how we can support the activity of a group of learners (six) and one tutor that are using social networking applications for learning purposes. The tutor looked for a way to recommend resources for the learners in a time efficient way while the learners needed trust-worthy documentation on a given topic. The idea of the experiment was to see if the combined usage of social networking environments together with our tag based search could be valuable for their learning experience. The scenario was the following: the tutor and the learners provided only their public usernames of their social networking sites to our system and continued using the platforms as before. The crawling service gathered information about them, their network of friends and the resources present in these networks. The learners searched for documents of interest to them and the system returned documents written or recommended by their peers ordered by the relevance of the content. Among the results were some documents bookmarked or written by the tutor. The results contained the name of the document together with the name of the peer that recommended the document. The learners trusted the resources retrieved this way more than the ones retrieved on a similar Google search due to the fact that they knew and trusted the people behind those resources. For the tutor it was even easier, as his task consisted only of creating and bookmarking resources. As the learners were already connected with the tutor, his resources were part of their search results. That means they could get answers from their tutor without actually bothering him. The only inconvenience with this scenario for our little experiment group was that the tutor was not comfortable with all of his resources being accessible to the learners. These

privacy issues have to be taken care of in the next version of the platform.

VI. CONCLUSIONS
One of the goals of the LTfLL project is to develop services that facilitate learners and tutors in accessing formal and informal knowledge sources in the context of a learning task. To this end, a Common Semantic Framework is being developed that provides recommendations on the basis of the user profile, his interests, his preferences, his network and the learning task. An advantage of this system with respect to standard search (e.g. Google) is that the strong semantic component behind the search and recommendation system as well as the integration of social network analysis will improve learner interaction, knowledge discovery as well as knowledge co-construction.

REFERENCES
[1] P. Monachesi, L. Lemnitzer, K. Simov, Language Technology for eLearning, in Proceedings of EC-TEL 2006, in Innovative Approaches for Learning and Knowledge Sharing, LNCS 0302-9743, 2006, pp. 667-672 [2] L. Lemnitzer, K. Simov, P. Osenova, E. Mossel and P. Monachesi, Using a domain ontology and semantic search in an eLearning environment, in Proceedings of The Third International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering. (CISSE 2007). Springer-Verlag. Berlin Heidelberg, 2007 [3] V. Posea, S. Trausan-Matu, and V. Cristea, Online evaluation of collaborative learning platforms, in Proceedings of the 1st International Workshop on Collaborative Open Environments for Project-Centered Learning, COOPER-2007, CEUR Workshop Proceedings 309, 2008 [4] L. Graham , P. Takis Metaxas, Of course it' s true; I saw it on the Internet!: In critical thinking in the Internet era, Communications of the ACM, vol. 46, May 2003, p.70-75 [5] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, R. Cyganiak and Z. Ives, DBpedia: A nucleus for a web of open data, in Lecture Notes in Computer Science 4825, Aberer et al., Eds. Springer, 2007 [6] E. Wenger, Communities of Practice: Learning, Meaning, and Identity, Cambridge: Cambridge University Press, 1998 [7] P. Mika, Ontologies are us: A unified model of social networks and semantics. Journal of Web Semantics 5 (1), page 5-15, 2007 [8] A. Hotho, R. Jschke, C. Schmitz, and G. Stumme, FolkRank: A Ranking Algorithm for Folksonomies, in Proceedings FGIR 2006, 2006 [9] C. Cattuto, D. Benz, A. Hotho, and G. Stumme, "Semantic analysis of tag similarity measures in collaborative tagging systems," May 2008. [Online]. Available: http://arxiv.org/abs/0805.2045 [10] L.S. Vygotsky, Mind and society: The development of higher psychological processes, Cambridge, MA: Harvard University Press, 1978 [11] S. Niwa, T. Doi, S. Honiden, Web Page Recommender System based on Folksonomy Mining for ITNG 06 Submissions, In Proceedings of the Third International Conference on Information Technology: New Generations (ITNG'06), 2006, pp.388-393 [12] P. Monachesi, T. Markus, E. Mossel, Ontology Enrichment with Social Tags for eLearning, in Proceedings of EC-TEL 2009, Learning in the Synergies of Multiple Disciplines. LNCS 1611-3349, 2009, pp. 385-390 [13] P. Osenova, K. Simov, E. Mossel Language Resources for Semantic Document Annotation and Crosslingual Retrieval. In: Proc. of LREC 2008, ELRA. [14] K. Simov and P. Osenova Language Resources and Tools for OntologyBased Semantic Annotation. OntoLex 2008 Workshop at LREC 2008, pp. 9-13

Exploring Co-Reference Chains for Concept Annotation of Domain Texts Petya Osenova, Laska Laskova, Kiril Simov Linguistic Modelling Department, IPP, BAS Sofia, Bulgaria petya@bultreebank.org, laska@bultreebank.org, kivs@bultreebank.org 1. Introduction Annotation of domain texts with domain conceptual information is an ultimate task for applications as information retrieval, information extraction, etc. In our work, we rely on an ontology-to-text relation which provides mechanism for explicating of conceptual information within the text. The current ontology-to-text relation comprises a domain ontology, a lexicon and an annotation grammar see (Osenova, Simov, and Mossel 2008) and (Simov and Osenova 2008). One of the problems with the current implementation is the sparseness of the annotation (about 2 domain concepts per sentence). The idea is to enhance the domain semantic annotation through the discourse connectors. Our aim is to use co-reference chains as context pointers within concept annotated domain texts. For this purpose, we first manually annotated texts in IT domain with co-references, and then tested two automatic systems for co-reference annotation over the same texts. The main point here is that we are trying to see the relation between the co-reference mechanisms and the ontological concepts, while the most popular works in NLP focused on named entities, synonymy and anaphora. In this sense, our task is not trivial. The relation between concept annotation and co-references has been approached from various perspectives. For example, (Lech and de Smedt 2006) and (Nikolov et. al 2009), among others, exploit the semantic features from ontology in order to improve the coreference chaining; (Kawazoe et al. 2003) designed a software that helps experts in biomedical domain to create ontologies and annotate texts with co-references. In our task, we exploited these papers (together with the work on anaphora and co-reference annotation in general) in the annotation of the corpus. In our future work, we will apply their approaches for the implementation of a new version of our ontology-to-text relation. The work reported here is done within the context of LTfLL project Language Technology for Lifelong Learning. The paper is structured as follows: section 2 describes the co-reference mechanisms with respect to the concept annotated texts. Section 3 highlights the corpus and the manual annotation layer. Section 4 reports the experiments with two automatic systems BART and OpenNLP. Section 5 outlines the evaluation and results. Section 6 concludes the paper. 2. The role of the co-reference in concept annotation process The concept annotation is based on a domain ontology. As mentioned above, it relies on the connection between the ontology, the lexicon and the text. Semantic retrieval depends very much on the precision of the annotation. Our previous work showed that the concept annotation based mainly on the terms from the lexicon is rather sparse. This influences

the semantic retrieval coverage and precision. Additionally, the semantic retrieval loses from the unresolved ambiguity and the missing connections within the context: anaphora and lexical chains. Under lexical chain we mean concept that was named in the text in a more specific and in a more general way. For example, web page might be referred to within the document as just page. Ontologically, these would be connected to two different concepts, but contextually they are the same concept. Pure repetitions are not challenge until a more relaxed co-reference mechanism is adopted. For that reason, we have decided to explore the co-references for two purposes: disambiguation of the ambiguous concepts and providing more contexts for the concepts in the retrieval results. 3. Manual annotation of the corpus The whole manually annotated corpus is on XML and HTML topics. It comprises 158 769 tokens and 24 688 domain specific concepts, of which 4149 participate in a concept chain. The share of new concept elements is 31.33 % (1300). The co-reference annotation was done on the top of the concept annotation. However, for the experiment with the automatic systems, a document on HTML was chosen, which comprises 10 205 tokens. From all the tokens in this document 6350 meet the preliminary condition to become a markable candidate, that is, they are not function words, punctuation marks, interrogative pronouns or verb forms. However, only 1330 of them are concept bearers. There are altogether 92 concept chains covering 25 concepts. Since we were interested only in chains which included lexicalizations for the concepts from our IT ontology, not all existent in the text concept chains were marked. Thus, 273 expressions were co-indexed: 33.70% concept bearers (antecedents), 24.90% pronouns or one, 41.39% content words that receive a (new) concept as a result of them being an element of a chain. Our annotation scheme adheres to the following rules: The <Concept> elements, which are included in a chain, receive the same @index. The @class of the antecedent (concept bearer) is predefined. Anaphoric <Concept>s may or may not have a @class attribute (i.e. concept annotated on the basis of the lexicon), but all of them receive a @c-class (i.e. concept as bounded by the context in the chain). Its value is determined by the anaphoric relation with the antecedent. By default we use only one kind of relation equivalence, which corresponds to the relation IDENT in the MUC annotation schema. We considered NPs as possible markable candidates, including phrases with elliptical heads, relative pronouns, personal and possessive, reflexive and demonstrative pronouns, one. If the concept bearer does not have the attribute @class (i.e. the concept is not domain specific), then the candidates are ignored, and chains are not formed. The structure is as follows: <Concept @index=in#id(common_index_for_all_the_elements_in_the_chain) {@class= ontology_concept(as_defined_in_the_Lexicon) } {@c-class= ontology_concept(discourse_related)}> <tok1>lexical_term</tok1> . <tokn>lexical_term</tokn> </Concept>

The boundaries of a concept chain are fixed by the next appearance of an antecedent, which bears the same concept in discourse. Not all <Concept> with the same @class are co-indexed. Here come some concrete examples for the anaphora resolution and disambiguation cases: A) When there is an anaphoric relation between a pronominal expression and a <Concept> element, the anaphora is tagged as a <Concept>, and both of them receive @index attribute with equal value. The @c-class attribute for the pronominal expression indicates the shared concept. For example: <Concept @class=http://www.lt4el.eu/CSnCS#XML @index="in001"> <tok> XML</tok> </Concept> is used to aid the exchange of data. <Concept @c-class=http://www.lt4el.eu/CSnCS#XML @index="in001"> <tok> It</tok> </Concept> makes it possible to define data in a clear way. B) In case of disambiguation the annotation procedure is the same, except for that the anaphoric expression (in this case - lexical NP) has both @class and @c-class attributes the former indicates which concept the expression denotes according to its place in the lexicon and the latter its meaning according to the discourse entity it refers to. <Concept @class=http://www.lt4el.eu/CSnCS#HT MLPage @index="in007"> <tok> HTML</tok> <tok> file</tok> </Concept> can link to an external style sheet and also include a style element for additional style settings specific to this <Concept @class="http://www.lt4el.eu/CSnCS#Page " @c-class="http://www.lt4el.eu/CSnCS#HT MLPage" @index="in007" <tok> page </tok> </Concept> .

4. The automatic annotation with OpenNLP and BART systems Our first attempt to solve the co-referential task is to exploit off-the-shelf systems as they are distributed by their developers. We made experiments with several such systems, but here we report only the results from the two most successful systems. OpenNLP is a well-known Java-based toolkit that performs all standard NLP steps (sentence splitting, tokenization, POS-tagging, etc.), including co-reference detection, that makes use of WordNet. Since this tool is not designed for concept chaining but for co-reference resolution, it would be incorrect to measure up the results against the golden standard annotation. Therefore, we will not use the common F-measure to evaluate the results but we will try to assess OpenNLP usability as a subsidiary instrument to improve the concept annotation. BART (Beautiful/Baltimore Anaphora Resolution Toolkit) is an open source modular toolkit developed as a result of the project Exploiting Lexical and Encyclopedic Resources For Entity Disambiguation 2007. It includes ideas from GuiTAR and other coreference systems. BART architecture allows for further exploration of different preprocessing and resolving methods. Both input and output are in XML format (MMAX2 format). BART can be used as a platform for experimentation or as a off-theshelf tool for anaphora resolution. On MUC-6 corpus BART has better performance in pronoun resolution than JavaRAP (Versley et. al. 2008). 5. Results and evaluation The number of co-reference chains, marked by OpenNLP, is 154. Compared to the manually tagged elements, OpenNLP markables are often maximal NPs, which is in agreement with the MUC annotation scheme requirements. Approximately one quarter of them (24.67%) are expressions (usually heads in an NP) related to a concept from the domain ontology. Only 1 of the chains could be used for sense disambiguation (web page my page); 50% have as their members pronouns, and the rest are lexical repetitions. Based on these results, we can draw the conclusion that OpenNLP might be used as a means to detect the context-dependent meaning of the pronouns, which denote domain specific concepts. That in turn would provide a more adequate picture of the text saliency for the different concepts in the analyzed document. The output from BART includes 373 co-reference chains and compared to the OpenNLP output, there are more cases of embedded markables, e.g. {2the {1browser} window}. Taking into account the results from the previous experiment with OpenNLP, we expected that the co-reference information provided by BART might support anaphora resolution type of concept chaining. This assumption was confirmed. However, most of the chains include repetitions of one or two expressions. For example, one of the chains contains 131 markables, 28 of them personal pronouns (it), 2 possessive (its) and the rest are abbreviation tokens (HTML) or chunks, including the abbreviation. Although the recall is better, the precision is not very good (in this example, only 2 of the pronouns were co-referential with HTML). Since we aim at higher precision, we decided to use the information provided by the annotation, performed with OpenNLP.

6. Conclusions Both systems do not tend to take decision when there are ambiguities. In contrast to OpenNLP, BART connects named entities. However, the change of domain makes this facility an obstacle. Both systems connect only close synonyms of the same concept. The interference of more co-reference chains fails them. Also, they do not connect concept subconcept relations. BART connects all pronouns, which leads also to a lot of mistakes. Both systems can be used for anaphora resolution, but not for disambiguation. For that reason, our future work will aim at combining co-reference systems with word sense disambiguation ones. For the purposes of disambiguation and better concept salience in the texts, our future plans include also an extension of the corpus annotation (automatically) with concepts from the top part of the ontology (in our case the Dolce Ultralite). Thus, the non-domain lexemes would be covered, too. Then we will use this additional annotation to train the available systems for the task. References
(Kawazoe et al 2003) Ai Kawazoe Tony Mullen Koichi Takeuchi. Open Ontology Forge: A

Tool for Ontology Creation and Text Annotation Applied to the Biomedical Domain. In:
Genome Informatics 14: 677-678 (2003).

(Lech and de Smedt 2006) Till Christopher Lech and Koenraad de Smedt. Enhancing Semantic Annotation through Coreference Chaining: An Ontology-based Approach. In: Siegfried Handschuh, Thierry Declerck, Marja-Riitta Koivunen (eds.), CEUR Workshop Proceedings, Vol. 185, 2006. (Nikolov et. al 2009) Andriy Nikolov, Victoria Uren, Enrico Motta and Anne de Roeck. Towards instance coreference resolution in a multi-ontology environment. Presented at: Workshop on matching and meaning, Edinburgh, UK, April 2009. (Osenova, Simov, and Mossel 2008) Petya Osenova, Kiril Simov, Eelco Mossel. 2008. Language Resources for Semantic Document Annotation and Crosslingual Retrieval. In: Proc. of LREC 2008, ELRA. (Simov and Osenova 2008) Kiril Simov and Petya Osenova 2008: Language Resources and Tools for Ontology-Based Semantic Annotation. OntoLex 2008 Workshop at LREC 2008, pp. 9-13. (Versley et. al. 2008) Versley, Y., Ponzetto, S.P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., Moschitti, A. BART: A Modular Toolkit for Coreference Resolution. ACL 2008 System demo. Available at: http://www.versley.de/

Socially driven ontology enrichment for eLearning


Paola Monachesi, Thomas Markus Utrecht University
able to enrich ontologies automatically, which is an important condition for eLearning appliOntologies can play an important role within cations to be scalable. eLearning applications [1]. They can guide and support the learner in the learning process since State of the art they provide a formalization of the knowledge 2 of a domain approved by an expert. In addition, they can facilitate (multilingual) retrieval There is growing attention for ontology lifecycle and reuse of content as well as mediate access management which encompasses not only the to various sources of knowledge. Ontologies, creation of an ontology, but its extension and however, might be too static since they model maintenance as well. Techniques include manthe knowledge of the domain at a given point ual methods such as special wikis for ontology in time. We still lack reliable methods to deal modication [4] as well as Natural Language automatically with the conceptual dynamics of Processing techniques that can be exploited for evolving domains [3]. In addition, ontologies ontology learning [5]. Given the availability might be incomplete or might not correspond of social media data, there are emerging apto the representation of the domain knowledge proaches that attempt to extract hierarchical available to the learner. The vocabulary of the structure from tags by relying on various allearner (especially beginners) might be dierent gorithms, the implicit structure of tagging sysfrom that of domain experts and maybe more tems and background knowledge bases [6]. The work presented in this paper relies on sensitive to evolving terminology or less specialsimilar techniques but it diers from previous ized terminology. In the Language Technology for eLearning approaches because it exploits existing domain project1 , we envisage a solution to these short- ontologies by embedding the tags extracted comings by merging the dynamic knowledge from social media application into their existprovided by tagging that is available through ing structure. It is thus possible to exploit social media applications such as Delicious or the growing number of ontologies available as Flickr with the formal knowledge provided by result of the Semantic Web initiative and enhance them with the extended vocabulary arisdomain ontologies. Similarity measures are employed to identify ing from social data. tags which are related to the concepts of an existing ontology while a knowledge base such as 3 Enhancing ontologies DBpedia [2] is used in order to map the tags with social tagging into the ontology. Thus, we can include not only the expert view of a given domain that might be shared by advanced learners but also Domain ontologies created by experts can benthe view of beginners who are probably using a et from the information extracted from soless specialized terminology. In addition, we are cial media applications for their enrichment. We take, as starting point, the LT4eL do1 http://www.ltl-project.org/ main ontology on computing that was devel1

Introduction

oped in the Language Technology for eLearning2 project. It contains 1002 domain concepts, 169 concepts from OntoWordNet and 105 concepts from DOLCE Ultralite. The connection between tags and concepts is established by means of language-specic lexicons, where each lexicon species one or more lexicalizations for each concept. Similarity measures can play a relevant role in the automatic ontology enrichment process. They can be employed to identify whether social tags that we have extracted from Delicious represent an additional lexicalization of existing concepts, (the lexicalization of) a new concept or a more specic/general concept of an existing one. Co-occurence can provide valuable input to extract taxonomic relationships between tags, as attested by [7]. However, [8] points out that this measure should rst be normalized by proposing two dierent methods: Symmetric (according to the Jaccard coecient) and Asymmetric. Another possibility is to split the notion of co-occurrence into user co-occurrence and resource co-occurrence. The former takes the individual users into account when calculating the co-occurrence scores [7]. In the case of resource co-occurrence, tags are said to cooccur when added to the same resource (by different users) [7]. Cosine similarity is also known to provide valuable input for discovering taxonomic relationships [7]. We have experimented with all the measures mentioned above, but unfortunately, the results of our experiments didnt conrm those of the papers when applied to the computing domain.3 More specically, the similarity scores do not allow for an automatic and reliable discrimination between possible synonyms, lexicalization of new concepts and taxonomic relations that we require in order to make the ontology enrichment process fully automatic. Human intervention was still found to be necessary to carry out the appropriate selection. Even though the application of the various similarity measures didnt allow for a straightforward automated interpretation of the data in our domain, we have decided to use it as rst
2 www.lt4el.eu 3 Detailed

step in the ontology enrichment process. Given our eLearning application, our main goal is to include information that is relevant to a learner and his peers. We therefore assess the information implicitly contained in tag collections to obtain a sense of what is relevant and what is not in a given domain. It is this information that plays an important role for learners, especially beginners. Tagging systems provide us with a domain vocabulary which is validated as common knowledge by the community that has produced it. The similarity measure selects possible lexicalizations of concepts which are both related to the existing ones in the ontology, and which are in addition, assumed to be socially relevant with respect to the input lexicalisation. More specically, we have employed the resource coocurrence measure in our system for eciency reasons and wide use in the literature. However, if we want to map the related terms identied by the similarity measure to the ones present in the ontology, we still face the problem of identifying the appropriate relationships. To this end, several heuristics are employed. They heavily rely on the use of a large background knowledge base such as DBpedia[2]. For example, we employ DBpedia to assess whether a related tag can be considered a new concept or a lexicalization of an existing one. If it is found to be a new concept, its additional lexicalizations and possibly synonyms are identied. More specically, we map related terms to DBpedia resources, the underlying assumption being that DBpedia resources can be assimilated to concepts for our purposes. Each resource in DBpedia is described by various properties, including a (multi-lingual) label, that we consider as lexicalisations of concepts to be included in our lexicon. There are cases in which alternative page titles are attested within the label property, we can thus include all of them in our lexicon. By making use of the SKOS vocabulary [9], we can dierentiate between a preferred lexicalisation (the head term) and additional lexicalisations (i.e. popular and alternative terms for the same concept). In the case of an ambiguous term, we can rely on DBpedia redirections and disambiguation pages to 2

results will be provided in the full paper.

resolve the ambiguity. Some eort has been devoted to mapping other ontologies (i.e. openCyc) onto DBpedia in order to improve its usefulness and semantic interpretability. We exploit this information to discover new taxonomic relations. To this end, we rely on the rdf:type assertion which is present in DBpedia resources. More specically, the rdf:type assertion between a DBpedia resource and a resource from some other ontology can be used to infer that the DBpedia concept is actually a sub-concept of the object of that statement. By looking up the lexicalisation for the super-concept, we can discover where the concept should be placed in the target ontology, assuming that the super-concept is already present. If the super-concept is not present both the shared category and its subconcept are added with the appropriate taxonomic relations to the original seed concept. DBpedia resources are classied according to dierent classication schemata and one of these is Wikipedia categories. Wikipedia has an actively used category system which is used to group articles. These categories are also contained in other categories resulting in a hierarchical structure. We employ this information to identify possible relations which hold between the existing concept in our ontology and the tag we are mapping. It can be the case that the two concepts we are considering are not directly related, but only indirectly through some shared category higher up in the hierarchy. We automatically calculate the closest shared categories for two concepts and return them. To summarize with an example: given the pre-existing domain ontology concept XHTML, the similarity measure system generates the tag xslt which is attested in DBpedia as a resource (i.e. a concept) and it shares the Wikipedia category XML with the XHTML concept. Given that the category XML is already a concept present in the domain ontology the new concept XSLT can be added as a subclass of it. The methodology proposed allows for the enrichment of an existing ontology with the vocabulary of the Community of Practice that the user is part of. More specically, the re3

sulting ontology integrates the socially relevant concepts within the structure of an expert view domain ontology. Extraction methods exclusively focused on deriving ontology-like structures from tag systems cannot provide such a high quality of results due to the unavailability of explicit structural information in folksonomies, which on the contrary has been made explicit in domain ontologies.

Evaluation

In order to evaluate our methodology, we have compared three dierent ontologies: 1. the LT4eL computing ontology with the related English lexicon (1200 classes); 2. the manually enriched ontology which takes the LT4eL one as basis (1336 classes and 1672 lexical entries). This is our gold standard. 3. The automatically enriched ontology, which takes the original LT4eL ontology as basis. (2016 classes and 2325 lexical entries) A rst analysis of the lexical dierences between (1) and (2) shows a dierence of 80 lexicalisations. The aim of our evaluation was to assess whether the automatic enrichment process would add lexicalisations (and related concepts) that overlap with the manually added lexicalizations given a similar sub-domain. The automatically enriched ontology has been generated by considering each coocurring tag in our Delicious data set as eligible for enrichment. The Delicious dataset we have crawled contains 598379 resources, 154476 users and 221796 tags. Even though we considered every coocurring tag as eligible for use in ontology enrichment, the lexical overlap between the manually enriched ontology and the automatic one is minimal. More specically, 69 terms which have been added manually to the LT4eL ontology are multi-word units and are not attested in Delicious. They are representative of the expert view of the domain given their level of

specicity and include terms such as: NMTOKEN attribute, XML element type declaration, XML attribute list declaration. The remaining 21 terms are attested in Delicious but only 13 of them are generated by the similarity measures and are attested in DBPedia. Regardless of the minimal lexical overlap between the manually and the automatic enriched ontology, it is not the case that the terms added automatically are not appropriate and are misplaced in the ontology. A preliminary verication carried out by domain experts shows that the result is satisfactory both from the point of view of added classes as well as added relations. We can thus conclude that the methodology proposed allows for an appropriate enrichment process but produces a complementary vocabulary to that of a domain expert.

[2] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, R. Cyganiak and Z. Ives, DBpedia: A nucleus for a web of open data. Lecture Notes in Computer Science 4825, Aberer et al., Eds. Springer, 2007 [3] Hepp, M., Possible OntologiesHow Reality Constrains the Development of Relevant Ontologies. IEEE Internet Computing, 9096, 2007. [4] Ghidini, C. et al., MoKi: The Enterprise Modelling Wiki. Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications, 835, 2009. [5] Buitelaar, P., Cimiano, P., Magnini B., Ontology from Text: Methods, Evaluation and Applications Frontiers in Articial Intelligence and Applications Series, Vol. 123, IOS Press, 2005

Conclusion

[6] Specia L. and E. Motta, Integrating FolkWe have developed an ontology enrichment sonomies with the Semantic Web. 4th Eupipeline that can automatically enrich a doropean Web Semantic Conference - ESWCmain ontology using a combination of social 2007. Innsbruck, June 3-7, 2007 tagging systems, similarity measures, the DBpedia knowledge bases and several heuristics. [7] Cattuto, C., Benz, D., Hotho, A., Stumme, A preliminary evaluation reveals that there G., Semantic grounding of tag relatedness is minimal overlap between the ontology proin social bookmarking systems. Proceedings duced by means of a manual enrichment process ISWC 2008. LNCS, Karlsruhe, Germany, carried out by an expert and our automatic en2008 richment process based on tags extracted from Delicious. Both ontologies are correct from a [8] B. Sigurbjrnsson and R. van Zwol, Flickr tag recommendation based on collective formal point of view but the latter includes the knowledge. In Proc. 17th Intl. Conf. on vocabulary of the community of users, while World Wide Web, pages 327336, 2008 the former it includes very specialized tags provided by an expert. It is exactly this comple- [9] Miles, A., Matthews, B., Wilson, M. and mentarity that we wanted to achieve by embedBrickley, D., SKOS Core: Simple knowledge ding tags into an existing ontology and that we organisation for the web. Proceedings of the want to exploit in eLearning applications. International Conference on Dublin Core and Metadata Applications, 12-15, 2005.

References
[1] Monachesi, P., K. Simov, E. Mossel, P. Osenova, What ontologies can do for eLearning. Proceedings of International Conference on Interactive Mobile and Computer Aided Learning, IMCL08, 2008. 4

Bridging Ontologies and Folksonomies using DBpedia


Vlad Posea*, Stefan Tr usan-Matu* *Universitatea Politehnica Bucuresti Facultatea de Automatic si Calculatoare Romnia (e-mail: {vlad.posea, trausan}@cs.pub.ro) Abstract: The paper presents an experiment to link tags from a given folksonomy to existing high level ontologies using the DBpedia knowledge base. The paper presents the theoretical background, the technical approach and the results of the experiment. It also suggests how we can increase the precision of the method. 1. INTRODUCTION The usage of free-tagging has become very popular lately, most of the Web 2.0 applications offering this facility as a way for people to organize information. Starting with Del.icio.us in 2003, the possibility for the common user to add short words to describe content on the internet with the purpose of easier retrieval of information became an instant success. All the main content creation and sharing applications (YouTube.com, Flickr.com, Slideshare.net), blogging communities, even operating systems started to offer this tagging feature and the users were very eager to use it. Users used tagging for their obvious simplicity and for the advantages they offer. However the advantages offered by the folksonomies arent as many as the ones offered by their older relatives ontologies. Folksonomies allow a one level classification of the content, not using any relations between the tags themselves. This doesnt allow performing real semantic searches on the tagged content. This paper aims to increase the possibilities of semantic search on a folksonomy by linking the tags in a folksonomy with the concepts of an ontology. The paper presents a comparative study of ontologies and folksonomies with the purpose of identifying the ways to link them. Afterwards we present the knowledge base that we are going to use in linking an ontology to a folksonomy, the actual experiment that we have performed. Finally we are going to present the results and the conclusions that we have drawn from the experiment. 2. ONTOLOGIES VS. FOLKSONOMIES According to the most cited definition in the domain an ontology is a specification of a conceptualization (Gruber 1993). Usually we consider an ontology to be a formal representation of a domain, consisting of concepts and their properties and relations. A folksonomy is according to the initial definition, tagging that works (VanderVal, 2009). A more recent definition considers a folksonomy to be a tripartite graph with hyperedges. The nodes of the graph are the users that tag, the resources being tagged and the tags themselves. The hyperedges are the triplets (user, resource, tag). This approach aims among other objectives to discover relations between the tags in the folksonomy. The advantage of an ontology is the existence of these formal relations between the concepts, relations that allow inferences and semantic search. The relations that we can discover inside a folksonomy are based on frequent tag associations (tags that are used together, tags that are used by the same user, tags that describe the same resource). However these relations do not carry semantic information. A way to find semantic relations between the tags of a folksonomy is to discover that those tags carry the same meaning as the concepts in an ontology. This could be done from the tagging moment if the users would be allowed to annotate with ontological concepts. As this is considered too difficult for the users to understand we intend to match the tags to ontological concepts after the moment of the tagging. There have been some approaches in this field. Tom Gruber proposed an ontology of folksonomies (Gruber, 2005) and his proposal was developed by (Echarte&al, 2007). Another approach was to extract ontologies from folksonomies. This was tried through various methods by Peter Mika (Mika 2007), (VanDamme 2008). Their methods involved discovering co-occurences, clustering the folksonomies and defining similarity metrics to find related tags. We consider that there is a considerable amount of formalized knowledge on the internet, represented by ontologies like Yago (Suchanek&al. 2007), Cyc, Umbel, Freebase (Bollacker&al. 2007). Most of the information in these ontologies is aggregated within DBPedia. We believe that we can discover links between tags and existing ontological concepts specified in these ontologies by using DBpedia. The structure of DBPedia is specified in the next section and our approach is described in section 4. 3. DBPEDIA AS A KNOWLEDGE SOURCE DBPedia (Auer&al. 2006) is a huge knowledge base containing semantically rich information extracted from Wikipedia by means of automatical information extraction. The knowledge base describes according to its authors (http://dbpedia.org) more than 2.6 million things (persons, places, companies, films, concepts). The data in the knowledge base is linked to several other knowledge bases existing on the Internet Freebase, YAGO, OpenCYC. Simply said Dbpedia contains most of the concepts described in Wikipedia and these concepts are represented in

a semantic format (RDF). These characteristics allow complex semantic queries to be performed on DBpedia, queries whose results cant be obtained by an agent from another source. For example we could find out using DBpedia who are the persons that obtained an Oscar and that are still alive. In order to obtain that information we only need to know the model of the domain and to formulate a SPARQL query. SPARQL (W3C 2009) is the query language for RDF files and is a W3C recommendation since January 2008. The model of the domain could be easily made available to a software agent specialized in the domain and the query is very simple. SELECT ?person WHERE { ?person a <http://dbpedia.org/ontology/Person>. ?person ?award. <http://dbpedia.org/property/academyawards> The possible results were: there was no resource with this name this means we cant find the type of the tag using the DBpedia ontology there is a single resource with this name and it is found using the preliminary query. This means that we could identify a unique type for the given tag. Such an example is http://dbpedia.org/resource/Economy. This resource is clearly identifiable and the resource type can be easily obtained with the following SPARQL query SELECT ?type WHERE { < http://dbpedia.org/resource/Economy> a ?type } In this case the types of the given concept are identified from the opencyc ontology as being ExistingObjectType and QAClarifyingCollectionType. Also using the owl:sameAs property we can identify the resource to be identical with the Economy resource existing in the Freebase knowledge base. Theres no specific resource for the given tag but we identify that there is a redirect to a given resource. This usually means that the tag is an alias for a resource name and the dbpedia redirect property identifies the full tag name. We expect that this situation to be quite common in our set of tags as people tend to tag with abbreviations or with slang or with incomplete names. An example for this case is the dvd tag. This tag is capitalized to Dvd to match DBpedias naming conventions. The search for http://dbpedia.org/resource/Dvd doesnt fetch a resource but a redirect to the resource http://dbpedia.org/resource/DVD. The concept obtained like this can be analyzed like the one in the previous case. The SPARQL query for this case is:

Figure 1 the relations between Tag, Resource and Type

?award a <http://dbpedia.org/class/yago/AcademyAwards>. ?person a <http://dbpedia.org/class/yago/LivingPeople>. } The query gets information from DBpedia using multiple ontologies. It uses the default DBpedia ontology which has around 170 classes and 940 properties. It also uses YAGO which is a manually verified knowledge base extracted from Wikipedia with a declared 95% accuracy. Some other queries could also contain concepts from Umbel (Upper Mapping and Binding Exchange Layer) which is a lightweight ontology and from OpenCYC. DBpedia could be used by a software agent to find information using the semantic metadata, to link data from multiple sources, to generate knowledge using the information extracted. DBpedia can be queried using the SPARQL end point provided by the authors. This endpoint will be used in our experiment. 4. DESCRIPTION OF THE EXPERIMENT The experiment aimed to see if we could identify the type of the tags in a set using the DBpedia knowledge base. The idea was to query DBpedia for a resource that had the name identical to the tag. The tags were processed in order to make them suitable to the DBpedia resource naming style. Therefore all tags were capitalized and the English s plurals were eliminated. The SPARQL query searched for the resource name. After getting the resource name we identify the type of the resource by using the rdf:type relation existing in DBpedia. More precisely the relations that we extract and analyze are the following:

SELECT ?resource ?type WHERE { <http://dbpedia.org/resource/Dvd> <http://dbpedia.org/property/redirect> ?resource. ?resource a ?type } There is a resource identified but the resource name can be disambiguated with a number of concepts. Probably this is the most common case as a large number of tags are incompletely specified resources. We show here two different examples that both match this case. The first example is the one of the http://dbpedia.org/resource/Juno. Juno was a very successful film in 2008 winning an Oscar and 3 other nominations. However the DBpedia resource disambiguates the term Juno to no less than 18 concepts. One of them is Juno_(film). Another is the Juno beach one of the famous Normandy beaches from the Second World War. Another concept is the Roman goddess Juno. However there are no specific means to identify to which of these concepts the tag is reffering using only the tag and not its context. The second example is the one of the tag bush. The search for http://dbpedia.org/resource/Bush returns a concept that is disambiguated to no less than 13 concepts. The difference from the previous example is that the concept to which the tag was referring was among the 18 different concepts returned. In this case the tag bush refers to the president of the United States of America and the concepts refer to an 1916 automobile, the Bush band, a Belgium beer and among the rest the Bush family. Theres no way to link the tag directly to the concept were looking for using the rdf:type relation and a more thorough understanding of the domain. The SPARQL query for each of these cases is the following: SELECT ?resource ?type WHERE { <http://dbpedia.org/resource/Bush> <http://dbpedia.org/property/disambiguates> ?resource. ?resource a ?type } For the experiment we have used a set of 450 tags used by a randomly chosen YouTube user. The reason we have chosen the tags for a YouTube user for this experiment is that the videos contained on this sharing platform contain very few metadata except the tags provided by the author. It is important then to identify the meaning of the given tags and

to link them to existing ontologies in order to be able to attain a higher level understanding of the provided video content. 5. RESULTS From the 450 tags used we have identified 1960 different resources. For almost 50% of the tags we havent found any concept types using DBpedia.

For the rest of 242 concepts we have identified the 1960 concept types. From these concepts 78 have been identified as having a 1-1 matching with a given resource no disambiguation being necessary. This gives a 32% chance of identification the correct type having found a DBpedia resource for the specific tag and a 15% chance of identifying the correct type of a specific tag in the folksonomy. There is also a very high number of resources that can be disambiguated rather easily only 2 or 3 alternatives. Around 20% of the identified resources can have 2 different meanings or else said can be linked to two different concepts in the DBPedia knowledge base. 8% of the resources we have identified have exactly three disambiguating resources and 40% of the resources have more than 3 disambiguation possibilities

The final part of the analysis of the results concerns the number of types identified for each resource. As DBpedia links data from multiple ontologies each identified resource can have a large number of identified rdf:type properties. For the 1960 resources identified we have obtained a list of 2375 concept types. This means that we have an average of 1.2 types/ resource. This would prove that once we have

identified correctly a resource its very easy to link it to an ontology using a specific concept type.

6. CONCLUSIONS AND FUTURE DEVELOPMENTS This paper describes an experiment meant to link folksonomies to ontologies. The results of the experiment show that for the moment this kind of connections can be made automatically only with a large error margin. The problems come mostly from the fact that most tags are not included in the knowledge base that we have used. This could be solved by including more domain oriented knowledge bases, by detecting typos made by the user when tagging and probably by using a common abbreviation dictionary. This dictionary would maintain the most common abbreviations used in tagging. Some examples of such abbreviations that couldnt be identified are UCSC (University of California Santa-Cruz), CS (computer science). The number of tags whos types werent identified is so big also because some tags were matched to single resources but there were no rdf:type connections in DBpedia ontology. Such an example is given by http://dbpedia.org/resource/Webcast. This resource could be matched from a given tag but theres no connection of the type rdf:type between the resource in DBpedia and an external ontology. We estimate that for around 10% of the tags that we couldnt link with specific concepts through rdf:type links the error was within DBpedia. Another source of errors that is most relevant accounting for more than 60% of the found tags is the impossibility to realize the disambiguation. This happens because there arent available any of the following requirements: There is no context for a given tag. We dont consider (as we dont have extracted for the moment) the most common tag associations for the current folksonomy. If we consider the most common tag pairs then we could use them to identify the most likely disambiguation concept from the ones existing inside DBpedia. The tags arent always expressive. For example there is content tagged with just one of the names of the authors or of the persons involved (first name or last name). In this case it is very difficult to guess automatically which is the person the content is about. This could be solved by means of machine learning if we could have a repository of already annotated content.

International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 1115, 2007. Lecture Notes in Computer Science 4825 Springer 2007, ISBN 9783540762973. Bollacker K., Cook R., Tufts P. Freebase: A Shared Database of Structured General Human Knowledge in Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, 2007, Vancouver, British Columbia, Canada. AAAI Press 2007, ISBN 978-157735-323-2 Echarte F., Astrain J., Crdoba A., Villadangos J.: Ontology of Folksonomy: A New Modelling Method in Proceedings of the Semantic Authoring, Annotation and Knowledge Markup Workshop (SAAKM2007) located at the 4th International Conference on Knowledge Capture (KCap 2007), Whistler, British Columbia, Canada, October 28-31, 2007. CEUR Workshop Proceedings 289 Gruber T., A translation approach to portable ontology specifications, Knowledge Acquisition, v.5 n.2, p.199220, June 1993 Gruber T. Ontology of Folksonomy. Invited keynote to the First on-Line conference on Metadata and Semantics Research (MTSR' 05), 2005. Mika. P. Ontologies are us: A unified model of social networks and semantics. In International Semantic Web Conference, Lecture Notes in Computer Science, pages 522--536. Springer, 2005. Suchanek F. , Kasneci G., Weikum G., Yago: a core of semantic knowledge, Proceedings of the 16th international conference on World Wide Web, May 0812, 2007, Banff, Alberta, Canada Van Damme, C., Coenen, T., and Vandijck, E. Deriving a lightweight corporate ontology from a corporate folksonomy. In Proceedings of the 11th International Conference on Business Information Systems (BIS 2008) (2008), Springer, pp. 207--216. Vanderwal, T., Tagging That Works, Web2.0 Expo, San Francisco, California. (16 April 2007) available at: http://www.slideshare.net/vanderwal/tagging-that-worksoreilly-web-20-expo/ (accessed 15 January 2009) W3.org SPARQL Query Language for RDF http://www.w3.org/TR/rdf-sparql-query/

The conclusion is that we can link tags to concepts but in order for the accuracy of the process to be satisfactory it is necessary to have some additional data and tools. REFERENCES Auer, S., Bizer, C., Lehmann, J., Kobilarov, G., Cyganiak, R., Ives, Z,: DBpedia: A Nucleus for a Web of Open Data. In Aberer et al. (Eds.): The Semantic Web, 6th

También podría gustarte