schemas. – RelaDonal data • Unstructured data – the other extreme. – Eg. Free text • Certain types of data are inbetween – Semi-‐structured – Schema variability across instances as well as Dme. – Eg. E-‐catalogs • XML supports a very flexible “schema” Lipyeow Lim -‐-‐ University of Hawaii at Manoa 2 ! Model – Brand = ViewSonic – Model = PJ551D – Cabinet Color = Black – Type = DLP ! Display – Panel = 0.55" DMD – Lens = Manual zoom/focus – Lamp =180W, 3,500 hours normal, up to 4,000 eco mode – Aspect Ratio = 4:3 (native), 16:9 • Model – Brand = TOSHIBA – Series = REGZA – Model = 52HL167 – Cabinet Color = Black • Display – Screen Size = 52“ – Recommended ResoluDon = 1920 x 1080 – Aspect RaDo =16:9 – … LCDTV Projector
– straighdorwardly usable over the Internet. – support a wide variety of applicaDons. – compaDble with SGML. – easy to write programs which process XML docs. – opDonal features in XML kept to the absolute minimum. – human-‐legible and reasonably clear. – easy to create. – Terseness in XML markup is of minimal importance. <item rdf:about="http://news.slashdot.org/story/ 09/11/17/2245241/Hackers-Broke-Into-Brazil-Grid-Last- Thursday?from=rss"> <title>Hackers Broke Into Brazil Grid Last Thursday</title> <link> http://rss.slashdot.org/~r/Slashdot/slashdot/~3/ JcTR_BoVsgI/Hackers-Broke-Into-Brazil-Grid-Last- Thursday</link> <description>An anonymous reader writes "A week ago, 60 Minutes had a story (we picked it up too) claiming that hackers had caused power outages in Brazil. While this assertion is now believed to be in error, hackers were inspired by the story actually to do what was claimed.…” </description> <dc:creator>kdawson</dc:creator> <dc:date>2009-11-17T23:41:00+00:00</dc:date> <dc:subject>security</dc:subject> <slash:department>wolf-no-really-this-time-i-mean-it</ slash:department> <slash:section>news</slash:section> <slash:comments>38</slash:comments> <slash:hit_parade>38,37,32,22,9,2,0</slash:hit_parade> <feedburner:origLink> http://news.slashdot.org/story/ 09/11/17/2245241/Hackers-Broke-Into-Brazil-Grid-Last- Thursday?from=rss</feedburner:origLink> </item> Lipyeow Lim -‐-‐ University of Hawaii at Manoa 3
XHTML – Webservice formats: SOAP, WSDL • File formats: – Microsog Office, Open Office, Apple’s iWork • Industrial – Insurance: ACORD – Clinical trials: cdisc – Financial: FIX, FpML – Mortgages: MISMO • Many applicaDons use XML as a data format for persistence or for data exchange Lipyeow Lim -‐-‐ University of Hawaii at Manoa 4
Hassanzadeh</author> <author>Anastasios Kementsietsidis</ author> <author>Lipyeow Lim</author> <author>Renée J. Miller</author> <author>Min Wang</author> <title> A framework for semantic link discovery over relational data.</title> <pages>1027-1036</pages> <year>2009</year> <booktitle>CIKM</booktitle> </inproceedings> </dblp> Lipyeow Lim -‐-‐ University of Hawaii at Manoa 5 dblp inproceedings author title pages year booktitle author author author author Oktie… A framework… @key conf… Anast… Lipyeow… Renee… Min… 1027-.. 2009 CIKM Tags or element names akributes Text values Parent-‐child relaDonship XML Document
• Simple API for XML (SAX) : programmers write callback funcDons for parsing events eg. when an opening “<author>” is encountered. • The XML tree is never materialized – Document Object Model (DOM) • The XML tree is materialized in memory • XML Query Languages – XPath : path navigaDon language – XQuery – XSLT : transformaDon language (ogen used in CSS) Lipyeow Lim -‐-‐ University of Hawaii at Manoa 6
directories. – RelaDve vs absolute • Examples: – /dblp/inproceedings/ author – //author – //inproceedings[year=2009 and bookDtle=CIKM]/Dtle • Results are sequences of nodes. • Think of a node as the XML fragment for the subtree rooted at that node. Lipyeow Lim -‐-‐ University of Hawaii at Manoa 7 dblp inproceedings author title pages year booktitle author author author author Oktie… A framework… @key conf… Anast… Lipyeow… Renee… Min… 1027-.. 2009 CIKM
locaDon steps separated by “/” of the form – Axisname::nodetest[predicate] • An axis defines a node-‐set relaDve to the current node: – self, parent, child, akribute – following, following-‐sibling – descendent, descendent-‐or-‐self – ancestor, ancestor-‐or-‐self – namespace – preceding, preceding-‐sibling • Examples – /child::dblp/child::inproceedings/akribute::author • /dblp/inproceedings/@key – /descendent-‐or-‐self::author • //author Lipyeow Lim -‐-‐ University of Hawaii at Manoa 8
of locaDon steps separated by “/” of the form – Axisname::nodetest[predicat e] • Predicates can be comparisons of atomic values or path expressions – //inproceedings[ year=“2009” and bookDtle=“CIKM”]/Dtle • A predicate is true if there exists some nodes that saDsfy the condiDons – // inproceedings[author=“Renee ”] Lipyeow Lim -‐-‐ University of Hawaii at Manoa 9 dblp inproceedings author title pages year booktitle author author author author Oktie… A framework… @key conf… Anast… Lipyeow… Renee… Min… 1027-.. 2009 CIKM
$auth in doc(dblp.xml)//author LET $Dtle=$auth/../Dtle WHERE $author/../year=2009 RETURN <author> <name>$auth/text()</name> <Dtle>$Dtle/text()</Dtle> <author> FOR $auth in doc(dblp.xml)//author[../ year=2009] RETURN <author> <name>$auth/text()</name> <Dtle>$auth/../Dtle/text()</Dtle> <author> Lipyeow Lim -‐-‐ University of Hawaii at Manoa 10 dblp inproceedings author title pages year booktitle author author author author Oktie… A framework… @key conf… Anast… Lipyeow… Renee… Min… 1027-.. 2009 CIKM
in DBMS ? • Inherent mismatch between relaDonal model and XML data model • Approach #1: BLOBs – Parse on demand • Approach #2: shredding – Decompose XML data to mulDple tables – Translate XML queries to SQL on those tables • Approach #3: NaDve XML store – Hybrid storage & query engine – Columns of type XML Lipyeow Lim -‐-‐ University of Hawaii at Manoa 11
Hawaii at Manoa 12 Hybrid Relational-XML data 15 20 30 price detail Zinfandel 3 Riesling 2 Burgundy 1 type id A B D B C D A B D B C D DB2/XML CREATE TABLE Product( id INTEGER, Specs XML ); INSERT INTO Product VALUES(1, XMLParse( DOCUMENT ’<?xml version=’1.0’> <ProductInfo> <Model> <Brand>Panasonic</Brand> <ModelID> TH-58PH10UK </ModelID> </Model> <Display> <ScreenSize>58in </ScreenSize> <AspectRatio>16:9 </AspectRatio> <Resolution>1366 x 768 </Resolution> … </ProductInfo>’) ); SELECT id FROM Product AS P WHERE XMLExists(‘$t/ProductInfo/ Model/Brand/Panasonic’ PASSING BY REF P.Specs AS "t")
document • XMLexists – checks if an XPath expression matches anything • XMLTable – converts XML into one table • XMLQuery – executes XML query Lipyeow Lim -‐-‐ University of Hawaii at Manoa 13 SELECT X.* FROM emp, XMLTABLE ('$d/dept/employee' passing doc as "d" COLUMNS empID INTEGER PATH '@id', firstname VARCHAR(20) PATH 'name/first', lastname VARCHAR(25) PATH 'name/last') AS X SELECT XMLQUERY( ‘$doc//item[productName=“iPod”]' PASSING PO.Porder as “doc”) AS "Result" FROM PurchaseOrders PO;
Namespace, Tag names • Path IDs for paths • XML tree parDDoned into regions & packed into pages. • Regions index track the pages associated with the XML structure Lipyeow Lim -‐-‐ University of Hawaii at Manoa 14
with specific XPaths. CREATE INDEX idx1 ON dept(deptdoc) GENERATE KEY USING XMLPATTERN ‘/dept/ employee/name’ AS SQL VARCHAR(35) • Index matching requires both the path and the type to match. – Queries involving /dept/employee/name and explicitly uses varchar or string for the type associated with the element can exploit the valued index Lipyeow Lim -‐-‐ University of Hawaii at Manoa 15
indexes we want to map the value associated with an XML pakern to nodes in the XML data tree • Key part of index entry is the “value” • Instead of a rid, an index entry stores the (region ID, node ID) Lipyeow Lim -‐-‐ University of Hawaii at Manoa 16 Non-leaf Pages Pages (Sorted by search key) Leaf