the target information resources findable and understandable complement the human embodiments and subjectivity • It depends on the genre of information needs and resources add “Value-added information” • Second order, third order, and N-th order information • User tasks: Identification, Find, and Access Methodology: Description (record) and Classfication 6
be organized well to make it easier to find and understand Group the common/similar items together Describe items with the common attributes and structure Explain the common properties with the same name In general, “information organization” covers as follows: • Describe and extract the common structured information (Metadata) as the record, and make it searchable. Cataloguing and description • Analyze the contents based on a certain criteria, assign labels, and enable information resources with common contents together. Subject analysis, classification, subject headings, and indexing 7
Rabbit, Gorilla Human Plumber, Performer Fruit Apple, Grape Vehicle Motorbike, Rocket Tool One piece, Pants Tent, Desk, Tuba, Pencil, Paper airplane 12 Note that we may need more documentation on the classification scheme, if we need to classify more samples and more precisely.
Information organization helps users (& user community) to do the following tasks: 1. Identification 2. Find (subject search, content analysis) 3. Access (Acquirement, referring the location of the item) 14
keyword or category to browse through and discover what the content or subject matter is. • In the traditional information retrieval researches, the find tasks are divided into “subject search” and “known item search” • Identification task Distinguish one thing from another. The unit of identification of "difference in things" varies depending on its area and use, such as having the same title but clearly distinguishing different versions or different versions • Access task Check the location of the item, get it, and/or access the resources on the network. Note that every tasks are often conducted without the actual material, due we usually use surrogates (described metadata). 15
“Web” • 【web】 (noun) A network of silken thread spun especially by the larvae of various insects (as a tent caterpillar) and usually serving as a nest or shelter. 19 https://commons.wikimedia.org/wiki/File: Spider_web_Belgium_Luc_Viatour.jpg
are the Three main components of the Web. • HTTP specifies the data transmission on the network and the type of the document format . • URI specifies the address of web pages, and it enables the hyperlinks among them on the network. 20
Europe. • Big science using High-speed accelerator Material science, Particle physics, etc. • Large amounts of device information • Massive needs for documenting and sharing information Employee: about 2,500 Visiting scholars: about 15,000 22
of isolated— photon plus jet production in pp collisions at √s = 7 TeV with the ATLAS detector”. Nuclear Physics B, 875, 438-533 (2013) The number of authors : over 5,800
and establishing the specifications) • 1992 – 1993: Became popular gradually… • 1993 – 1994: Gain popularity exponentially Mosaic, Netscape, Yahoo! • 1994 – 1995: Gain popularity in the society Windows95, Amazon, … 24
and spread of the Web, its conflict • The concept “Hypermedia” coined and spread Memex (Vannevar Bush) - 1945 Xanadu (Ted Nelson) - 1963? WWW (Tim Berners-Lee) – 1989 • What the Web has lost Integration of browsing and editing Version control Diverse and extensible hyperlinks Copyright management & Micro payment 26 Tim Berners-Lee: “Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web”. Harper Business, 2000, 256p.
Semantic Web. Scientific American, 2001, Vol.284, No.5, pp.35-43. • From Web to “Semantic Web” • Web markups to enable semantic description and machine understandings 29
a dentist who can stop by after work” After work: week days 9:00-18:00 Stop by after work: Tsukuba Express Line (TX) • Having a consultation after 18 • Stations along the TX line: Tsukuba, Kenkyu-gakuen, …, Minami-nagareyama, Kita-senju, … • Within a 500-minute walk from the station • (Personal assistant / agent) 30
Mon. “9:00-13:00 ; 15:00-19:00” Closed days, Medical hours Holidays, public holidays, open all year round • Understanding common sense One week = Mon, Tue, Wed, Thu, Fri, Sat, Sun Week days = Monday to Friday • Information extraction from the Web markups 31
the Web Huge amounts of web spaces Big data with various concepts and descriptions can be obtained Diverse information provider Multi-languages and multi-culture Cannot assume strict use of controlled vocabularies and custom conventions • Difficulty of general-purpose model It is difficult for computer applications to understand the meaning of things 33
data model Directed graph with labels Triple representation • Feature Simple and highly expressive data model Writing rules tend to be complicated Processing operation takes time Resource (node) = URI (Uniform Resource Identifier) • Inherit the decentralized features of the web 34 J.K. Rowling Harry potter Author
a “subject” resource, and build triples by its attributes and values • { subject, predicate, object } ⇒ { this book, property, value } 35 Property Value Title Weaving the Web Author Tim Berners-Lee Publisher Harper Business
to other resources Only the “resource” node is possible to become a subject of a triple • In this example, if the author is a “resource” node, another triple can be connected This book Tim Berners-Lee Author Birth 1955 This book Author Birth 1955 Tim Berners-Lee Tim Berners-Lee Name
data model • Highly expandable by linking as a graph (network) (highly expressive) • Doesn't matter who writes and where • Uses only URI identifiers Extend by combining RDF data separately described in another place • Distribute RDF data further on the Web Turn a web space composed of hypertext documents into a web space with linked data descriptions. → Linked Data framework, proposed by Tim Berners-Lee 39
RDF data model, all the resources are identified by assigning a URI • It is important to assign an appropriate URI to a resource • The property (predicate) are also identified by assigning a URI There is no “title” property, but actually the property is identified with the URI http://purl.org/dc/terms/title Since URIs tend to be long, they are presented by short names (prefix) for convenience http://purl.org/dc/terms/title → dc:title • The URI http://purl.org/dc/terms/ are shortend with the prefix “dc:” 40
following information: The title of a resource (URI) is “Home page of Masao Takaku”, its creator’s name is “Masao Takaku”. 41 https://masao.jpn.org Masao Takaku dc:creator foaf:name mailto:[email protected] foaf:mbox Home page of Masao Takaku dc:title
an address that points to resources on the Web If you type in the browser address field, you will reach that resource Since it has a separate address space for each web server, it can be used as a simple identifier http://klis.tsukuba.ac.jp/school_affairs.html 43 Server address Location within the server Access scheme
easy to create application applications for each individual area with a simple data model • Structuring information on individual resources It is ok from where it is possible Add links (properties) one by one • Data models Uses RDF data model = Triples Data types are resources and literals • Resources act as identifiers (URIs) with addresses on the web • Transforms the current Web of document into “Web of Data” 45
2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4. Include links to other URIs. so that they can discover more things. 46 https://www.w3.org/DesignIssues/LinkedData.html
will not change URIs for 2 years, 20 years and 200 years later? • What to leave out Technical issues • Filename (suffixes) • Software (mechanisms) • Drive names Document management • Authors name • Topics • Status • Access permissions 47 https://www.w3.org/Provider/Style/URI.html
metadata description of various types of things for use by web search engines • Proposed and maintained by major search engine companies, Google, Microsoft, Yahoo!, etc. • Used in rich snippets at search results pages https://schema.org/Hotel https://schema.org/Book etc. 52
• The number of datasets: 1,239 • LOD datasets around the world Crossdomain, Geography, Government, Life sciences, Linguistics, Media, Publications, Social networking, User generated (From Japan) NDL Web Authorities, Textbook LOD