Information Organization in the Web Age

Information Organization in the Web Age ウェブ時代の情報組織化 Masao Takaku (高久雅生)
[email protected] 2020年2月14日（金） 1 TSUKUBA Short-term Study Program (TSSP) 2020

Me? • Masao Takaku (高久雅生; たかくまさお） • Research
interests Information retrieval, information seeking behaviour Digital library Linked Open Data (LOD) • Contact: Email: [email protected] Twitter: @tmasao 2

My research area? 3 Information System Contents Document collections User
& Community Information Needs My main research focus is to understand these elements and their relationships among them. Organization

Contents • Introduction • What is Information Organization? • What
is Web? • Web & Information Organization • Discussions 4

WHAT DOES REALLY MEANS “INFORMATION ORGANIZATION”? 5 「組織化」とは何か?

Let’s start with the conclusion... • Information organization does… make
the target information resources findable and understandable complement the human embodiments and subjectivity • It depends on the genre of information needs and resources add “Value-added information” • Second order, third order, and N-th order information • User tasks: Identification, Find, and Access Methodology: Description (record) and Classfication 6

What is information organization? • Large amounts of information should
be organized well to make it easier to find and understand  Group the common/similar items together  Describe items with the common attributes and structure  Explain the common properties with the same name In general, “information organization” covers as follows: • Describe and extract the common structured information (Metadata) as the record, and make it searchable.  Cataloguing and description • Analyze the contents based on a certain criteria, assign labels, and enable information resources with common contents together.  Subject analysis, classification, subject headings, and indexing 7

Ex. Description and classification 8

Ex. Description and classification 9 Tent Rocket Rabbit Grape Gorilla
Desk Paper airplane Pants Performer Motorbike Plumber Apple Pencil Tuba One piece

Ex. Description and classification 10 Tent Rocket Rabbit Grape Gorilla
Desk Paper airplane Pants Performer Motorbike Plumber Apple Pencil Tuba One piece

Ex. Description and classification (Ordering) 11 Tent Rocket Rabbit Grape
Gorilla Desk Paper airplane Pants Mortorbike Apple Pencil チューバ One piece Plumber Performer

Ex. Organize with attibutes and values Class Item (value) Animal
Rabbit, Gorilla Human Plumber, Performer Fruit Apple, Grape Vehicle Motorbike, Rocket Tool One piece, Pants Tent, Desk, Tuba, Pencil, Paper airplane 12 Note that we may need more documentation on the classification scheme, if we need to classify more samples and more precisely.

Information organization in the context of information seeking 13 Information
System Contents Document collections User & Community Information Needs Organization

Information organization in the context of information seeking (cont.) •
Information organization helps users (& user community) to do the following tasks: 1. Identification 2. Find (subject search, content analysis) 3. Access (Acquirement, referring the location of the item) 14

What is User tasks? • Find task  Use any
keyword or category to browse through and discover what the content or subject matter is. • In the traditional information retrieval researches, the find tasks are divided into “subject search” and “known item search” • Identification task  Distinguish one thing from another.  The unit of identification of "difference in things" varies depending on its area and use, such as having the same title but clearly distinguishing different versions or different versions • Access task  Check the location of the item, get it, and/or access the resources on the network. Note that every tasks are often conducted without the actual material, due we usually use surrogates (described metadata). 15

In the context of description 16

In the context of description (cont.) 17 田中宏和.com | 田中宏和宣言!!
http://www.tanakahirokazu.com/ 田中宏和. 田中宏和さん. リーダーズノート, 2010, 192p.

WORLD WIDE WEB (WWW) 18 Webとは?

World Wide Web • WWW (World Wide Web) Or just
“Web” • 【web】 (noun) A network of silken thread spun especially by the larvae of various insects (as a tent caterpillar) and usually serving as a nest or shelter. 19 https://commons.wikimedia.org/wiki/File: Spider_web_Belgium_Luc_Viatour.jpg

Three elements of the Web • HTTP, URI and HTML
are the Three main components of the Web. • HTTP specifies the data transmission on the network and the type of the document format . • URI specifies the address of web pages, and it enables the hyperlinks among them on the network. 20

21 Knight Foundation (2008) http://www.flickr.com/photos/knightfoundation/2467553359/

CERN • International research institute of high energy physics in
Europe. • Big science using High-speed accelerator Material science, Particle physics, etc. • Large amounts of device information • Massive needs for documenting and sharing information Employee: about 2,500 Visiting scholars: about 15,000 22

Collaboration by many scientists around the world! ATLAS Collaboration: “Dynamics
of isolated— photon plus jet production in pp collisions at √s = 7 TeV with the ATLAS detector”. Nuclear Physics B, 875, 438-533 (2013) The number of authors : over 5,800

Brief history of Web • 1989 – 1991: Proposed (design
and establishing the specifications) • 1992 – 1993: Became popular gradually… • 1993 – 1994: Gain popularity exponentially Mosaic, Netscape, Yahoo! • 1994 – 1995: Gain popularity in the society Windows95, Amazon, … 24

Very beginning of Web 25 Screenshot of the original NeXT
web browser in 1993 http://info.cern.ch/

[side story] Hypermedia and hypertext to the Web The rise
and spread of the Web, its conflict • The concept “Hypermedia” coined and spread Memex (Vannevar Bush) - 1945 Xanadu (Ted Nelson) - 1963? WWW (Tim Berners-Lee) – 1989 • What the Web has lost Integration of browsing and editing Version control Diverse and extensible hyperlinks Copyright management & Micro payment 26 Tim Berners-Lee: “Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web”. Harper Business, 2000, 256p.

Memex by Vannevar Bush (1945) 27

SEMANTIC WEB 28 Semantic Webの世界

Semantic Web (1) Tim Berners-Lee, James Hendler, Ora Lassila. The
Semantic Web. Scientific American, 2001, Vol.284, No.5, pp.35-43. • From Web to “Semantic Web” • Web markups to enable semantic description and machine understandings 29

Semantic Web application (1) • Example: “I want to find
a dentist who can stop by after work” After work: week days 9:00-18:00 Stop by after work: Tsukuba Express Line (TX) • Having a consultation after 18 • Stations along the TX line: Tsukuba, Kenkyu-gakuen, …, Minami-nagareyama, Kita-senju, … • Within a 500-minute walk from the station • (Personal assistant / agent) 30

Semantic Web application (2) • Disambiguation 月=月曜日 = Monday =
Mon. “9:00-13:00 ; 15:00-19:00” Closed days, Medical hours Holidays, public holidays, open all year round • Understanding common sense One week = Mon, Tue, Wed, Thu, Fri, Sat, Sun Week days = Monday to Friday • Information extraction from the Web markups 31

Semantic Web components 32 Identifier：URI Character set: Unicode Notation：XML Data
exchange：RDF Vocabulary：RDFS Ontology： OWL Rule： RIF/SWRL Search： SPARQL Digital Signature Logic Reasoning Trust User Interface / Application

Issues of Semantic Web • Decentralization + massive nature of
the Web Huge amounts of web spaces Big data with various concepts and descriptions can be obtained Diverse information provider Multi-languages and multi-culture Cannot assume strict use of controlled vocabularies and custom conventions • Difficulty of general-purpose model It is difficult for computer applications to understand the meaning of things 33

RDF data model • RDF (Resource Description Framework) • Graph-based
data model Directed graph with labels Triple representation • Feature Simple and highly expressive data model Writing rules tend to be complicated Processing operation takes time Resource (node) = URI (Uniform Resource Identifier) • Inherit the decentralized features of the web 34 J.K. Rowling Harry potter Author

Description with triples (1) • Consider the book itself as
a “subject” resource, and build triples by its attributes and values • { subject, predicate, object } ⇒ { this book, property, value } 35 Property Value Title Weaving the Web Author Tim Berners-Lee Publisher Harper Business

Description with triples (2) Graphical representation of triples This book
Weaving the Web Title This book Tim Berners-Lee Author This book Harper Bussiness Publisher 36

Description with triples (3) Aggregates the same resource as a
single resource 37 This book Weaving the Web Title Tim Berners-Lee Author Harper Bussiness Publisher

Description with triples (4) • “Literal values” cannot be extended
to other resources Only the “resource” node is possible to become a subject of a triple • In this example, if the author is a “resource” node, another triple can be connected This book Tim Berners-Lee Author Birth 1955 This book Author Birth 1955 Tim Berners-Lee Tim Berners-Lee Name

Advantages of RDF data • Formalized as a simple triple
data model • Highly expandable by linking as a graph (network) (highly expressive) • Doesn't matter who writes and where • Uses only URI identifiers  Extend by combining RDF data separately described in another place • Distribute RDF data further on the Web  Turn a web space composed of hypertext documents into a web space with linked data descriptions.  → Linked Data framework, proposed by Tim Berners-Lee 39

The role of URI in RDF resources • In the
RDF data model, all the resources are identified by assigning a URI • It is important to assign an appropriate URI to a resource • The property (predicate) are also identified by assigning a URI There is no “title” property, but actually the property is identified with the URI http://purl.org/dc/terms/title Since URIs tend to be long, they are presented by short names (prefix) for convenience http://purl.org/dc/terms/title → dc:title • The URI http://purl.org/dc/terms/ are shortend with the prefix “dc:” 40

Example of RDF data • The data representation for the
following information: The title of a resource (URI) is “Home page of Masao Takaku”, its creator’s name is “Masao Takaku”. 41 https://masao.jpn.org Masao Takaku dc:creator foaf:name mailto:[email protected] foaf:mbox Home page of Masao Takaku dc:title

Example of RDF data with RDF/Turtle format @prefix dc: <http://purl.org/dc/terms/>
@prefix foaf: <http://xmlns.com/foaf/0.1/> <https://masao.jpn.org/> dc:title “Home page of Masao Takaku”; dc:creator [ foaf:name “Masao Takaku”; foaf:mbox <mailto:[email protected]> ] . 42

For the reference: URI (Uniform Resource Identifier) • Works as
an address that points to resources on the Web If you type in the browser address field, you will reach that resource Since it has a separate address space for each web server, it can be used as a simple identifier http://klis.tsukuba.ac.jp/school_affairs.html 43 Server address Location within the server Access scheme

LINKED DATA 44

What is Linked Data? • A proposal to make it
easy to create application applications for each individual area with a simple data model • Structuring information on individual resources  It is ok from where it is possible  Add links (properties) one by one • Data models  Uses RDF data model = Triples  Data types are resources and literals • Resources act as identifiers (URIs) with addresses on the web • Transforms the current Web of document into “Web of Data” 45

Linked Data Principle 1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4. Include links to other URIs. so that they can discover more things. 46 https://www.w3.org/DesignIssues/LinkedData.html

[Related topic] Cool URIs don't change • Designing URIs which
will not change URIs for 2 years, 20 years and 200 years later? • What to leave out Technical issues • Filename (suffixes) • Software (mechanisms) • Drive names Document management • Authors name • Topics • Status • Access permissions 47 https://www.w3.org/Provider/Style/URI.html

Examples of Linked Open Data • Web search engines Entity
search and rich snippets • LOD dataset providers DBpedia : http://ja.dbpedia.org/ NDL Web Authorities : https://id.ndl.go.jp/auth/ndla CiNii Articles : http://ci.nii.ac.jp/ LC Linked Data : http://id.loc.gov/ 48

Use of Linked Data: Entity search 49 https://www.google.co.jp/search?q=嘉納治五郎

Use of Linked Data: Rich snippets 50 https://www.google.co.jp/searc h?q=京王プラザホテル

Use of Linked Data: Rich snippets 51 https://www.google.com/search?q= ダイワロイネットホテルつくば

Metadata vocabulary for the Web: Schema.org • Vocabulary for simple
metadata description of various types of things for use by web search engines • Proposed and maintained by major search engine companies, Google, Microsoft, Yahoo!, etc. • Used in rich snippets at search results pages https://schema.org/Hotel https://schema.org/Book etc. 52

Example of Schema.org metadata embedded in web pages • <div
itemscope itemtype="http://schema.org/LodgingBusiness">  <meta itemprop="name" content="Daiwa Roynet Hotel Tsukuba"/>  <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"> • <meta itemprop="addressCountry" content="JP"/> • <meta itemprop="addressLocality" content="Tsukuba"/> • <meta itemprop="addressRegion" content="Ibaraki Prefecture"/> • <meta itemprop="streetAddress" content="1-5-7 AzumaTsukuba- shi"/> • <meta itemprop="postalCode" content="305-0031"/></div>  <div itemscope itemprop="aggregateRating" itemtype="http://schema.org/AggregateRating" > • <div class="rating-score" itemprop="ratingValue">4.2</div> • <span itemprop="reviewCount" content="1614">1,614</span> • <meta itemprop="bestRating" content="5"/> • <meta itemprop="worstRating" content="0"/></div> 53 https://www.daiwaroynet.jp/tsukuba/

Examples of Linked Data dataset: DBPedia • Example: http://ja.dbpedia.org/page/つくば市 •
Structured data is extracted and integrated from Free Encyclopedia Wikipedia http://mappings.dbpedia.org/index.php/Mapping_j a 54

55 https://ja.wikipedia.org/wiki/つくば市

56 http://ja.dbpedia.org/page/つくば市

57 Linked Open Data Cloud http://lod-cloud.net/ (as of March 2019)
• The number of datasets: 1,239 • LOD datasets around the world  Crossdomain, Geography, Government, Life sciences, Linguistics, Media, Publications, Social networking, User generated  (From Japan) NDL Web Authorities, Textbook LOD

Summary (keywords) • Information organization Description and classification User tasks:
identify, find, access • Web HTTP, URI, and HTML • Semantic Web RDF data model, triples, URIs • Linked Data URI resources, Schema.org, datasets 58

Information Organization in the Web Age

Information Organization in the Web Age

More Decks by Masao Takaku

Other Decks in Education

Featured

Transcript