Datensätze

Analysis of Structural Relationships for Hierarchical Cluster Labeling

Kontakt: Roman Kern, rkern[at]know-center.at

Kurzbeschreibung: (Nähere Details sind derzeit nur auf Englisch verfügbar)
Cluster label quality is crucial for browsing topic hierarchies obtained via document clustering. Intuitively, the hierarchical structure should influence the labeling accuracy. However, most labeling algorithms ignore such structural properties and therefore, the impact of hierarchical structures on the labeling accuracy is yet unclear. In our work we integrate hierarchical information, i.e. sibling and parent-child relations, in the cluster labeling process. We adapt standard labeling approaches, namely Maximum Term Frequency, Jensen-Shannon Divergence, Chi Square Test, and Information Gain, to take use of those relationships and evaluate their impact on 4 different datasets, namely the Open Directory Project, Wikipedia, TREC Ohsumed and the CLEF IP European Patent dataset. We show, that hierarchical relationships can be exploited to increase labeling accuracy especially on high-level nodes.

You can find the crawled html content of the DMOZ hierarchy here. At the root directory you find a file containing a lookup between a URL in the DMOZ articles and a sub-directory with the crawled HTML content.

Download the data set (~10GB)

Zum Seitenanfang ↑

Annotated Blog Corpus: Facet Annotations on the TREC Blogs08 test collection

Kontakt: Elisabeth Lex, elex[at]know-center.at

Kurzbeschreibung: (Nähere Details sind derzeit nur auf Englisch verfügbar)
This dataset contains annotations of a subset of the TREC Blogs08 test collection [1]. A number of 83 blogs (12870 blog posts) were manually annotated with the below described categories whereas the category annotations can have three values: (i) true if the particular blog belongs to the category, (ii) false if not and (ii) null if the category was not assessed for the blog.

If you use or build upon our work, please give us according credits by citing

@InProceedings{Lex2010,
author = {Elisabeth Lex and Michael Granitzer and Markus Muhr and Andreas Juffinger},
title ={Stylometric Features for Emotion Level Classification in News Related Blogs},
booktitle ={Proceedings of the 9th ACM RIAO Conference},
year = {2010}
}

[1] A description of the Blogs08 test collection can be found at http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html. License details and information on how to get access to the TREC Blogs08 collection are provided in http://ir.dcs.gla.ac.uk/test_collections.

Description of the annotations:

# blog contains a short trailer to advertise a blog post
blog.type.teaser

# blog contains a personal diary
blog.type.diary

# blog deals with current events and news
blog.type.news

# blog contains e.g. product reviews
blog.type.review

# blog cannot be assigned to any of the above categories
blog.type.other

# fully objective blog content
blog.object.full

# medium objective blog content
blog.object.medium

# subjective blog content
blog.object.none

# blog is written by single male author
blog.authorship.single.male

# blog is written by single female author
blog.authorship.single.female

# blog is written by single author with unknown gender
blog.authorship.single.unknown

# company blog
blog.authorship.company

# blog that contains headlines or minor pieces of information
blog.authorship.ticker

# multi author blog
blog.authorship.multi

# blog contains detailed (indepth) information
blog.facet.indepth

# blog contains shallow information
blog.facet.shallow

# blog is fully emotional
blog.emotion.full

# blog is medium emotional
blog.emotion.medium

# blog is not emotional
blog.emotion.none

# blog contains factual and neutral information
blog.facet.factual

# blog contains opinions
blog.facet.opinionated

# blog is written by an expert
blog.facet.expert

# blog is written by a non expert
blog.facet.nonexpert

# blog is written in good style (complete and grammatically sound sentences, no emoticons)
blog.facet.wellformed

# blog is written in web style (emoticons, abbreviations, incomplete sentences)
blog.facet.webstyle

annotation-with-feedids 58,59 kB

Zum Seitenanfang ↑

Annotated Tweets

Kontakt: Christopher Horn, chorn[at]know-center.at

Kurzbeschreibung: (Nähere Details sind derzeit nur auf Englisch verfügbar)
This dataset contains 4.897 annotated Tweets. The data was collected in March 2010 from Twitter’s public timeline. The text itself is anonymized, i.e. all twitter usernames have been replaced with ‘@USER’. A description of how the categories have been chosen can be found in my master’s thesis (link)

The format of the data file is as follows:

C1,C2,Text \n, where C1 is either ‘News’ (N), ‘User’ (U) or ‘Company’ (C), and C2 is either ‘Factual’ (F) or ‘Opinionated’ (O), and \n being the newline character.

If you use or build upon my work, please give me according credit by citing

@MastersThesis{Horn2010,
author = {Christopher Horn},
title ={Analysis and Classification of Twitter messages},
school={Graz University of Technology, Austria},
year= {2010},
note={Online available at: http://www.know-center.tugraz.at/forschung/knowledge_relationship_discovery/dissertationen_diplomarbeiten}
}

Download annotated_tweets.csv [457 kB]

A demo of the classifier can be found here: http://twitterclassifier.knowminer.at – just enter a Twitter username and click the ‘start’ button. The application will then fetch the latest Tweets of that user and classify them. Please note that the Tweets have to be available publically.

Zum Seitenanfang ↑

APOSDLE Datensatz

Kontakt: Stefanie Lindstaedt, slind[at]know-center.at

Kurzbeschreibung:
Das Know-Center veröffentlicht den ersten Referenzdatensatz aus dem EU-Forschungsprojekt APOSDLE um die Nachvollziehbarkeit und Generalisierbarkeit von Forschungsergebnissen im Bereich Technology Enhanced Learning (TEL) zu fördern.

Im Rahmen des dataTEL Theme Teams engagiert sich das Know-Center gemeinsam mit anderen eurpäische Forschungseinrichtungen, die Forschung an Recommender Systemen im Bereich TEL zu standardisieren. Die Standardisierung soll dazu beitragen, die Entwicklung und Evaluierung von Recommender Systemen transparenter zu gestalten, und damit der Forschungs-Community einen besseren Austausch und Nachvollziehbarkeit von Forschungsergebnissen zu ermöglichen.

Auf dem Portal von TELEurope – http://www.teleurope.eu/pg/groups/9405/datatel/ – ist kürzlich eine Liste von sieben Datensätzen unterschiedlicher Projekte im Bereich TEL mit kurzen Beschreibungen veröffentlicht worden, in der auch der APOSDLE Datensatz mit dabei ist. Mehr Infos zu dem Datensatz finden Sie auf http://www.teleurope.eu/pg/pages/view/50647

Durch die intensive Auseinandersetzung mit verschiedenen Aspekten des Datenschutzes im Verlauf des EU-Projekts APOSDLE, war es dem Know-Center nun möglich, einen Datensatz aus der Evaluierungsphase des Projekts zu veröffentlichen. Der Datensatz ist unter der Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Austria Lizenz erhältlich und umfasst Logdaten aus einer dreimonatigen Evaluierungsphase des APOSDLE Systems bei ISN (http://www.innovation.at/).

Zum Seitenanfang ↑

Heart Disease Sub Hierarchy

Kurzbeschreibung: (Nähere Details sind derzeit nur auf Englisch verfügbar)
The Heart Disease sub hierarchy of the OHSUMED Collection for retrieval and classifiation compiled into a directory structure.The Heart Disease sub hierarchy of the OHSUMED Collection for retrieval and classifiation compiled into a directory structure (incl. pre-cacluated term vectors using TFIDF and stemming) can be downloaded here:

Demo Data OHSU Heart Desease 19,18 MB

This test collection was used in the Master Thesis of Michael Granitzer.

Zum Seitenanfang ↑