CS298 Proposal

Efficient Replication of XML Documents with BLOB data

Preethi Vishwanath (prithari@yahoo.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Chris Pollett, Dr. Robert Chun, Mr. Tuong Truong (pollett@cs.sjsu.edu) (rchun@cs.sjsu.edu) (tctruong@us.ibm.com)

Abstract:

With the rapid increase in streaming media content, there has been renewed interest in the delivery mechanism (e.g. Mobile Ad Hoc Network Security, BitTorrent) of these in a distributed environment. There is a scope for improvement of efficiency for certain mechanisms of streaming media data delivery. For example, in MANETs (Mobile Ad Hoc Networks) where nodes are connected to each other through wireless links with each node acting as a router forwarding data to its peers would cause substantial amount of wasted bandwidth and data replication at intermediate nodes. Our project is to create a mechanism to efficiently replicate XML documents and Binary Large Objects (BLOBs) that these contain across a distributed database network. For better explanation, let us consider a scenario wherein we have XML documents consisting of BLOBs represented by playlists. A user accessing the streaming media content (usually represented as an XML file) may be interested only in some part of the music content (individual BLOBs), but is forced to download all of them, since they are part of the same playlist (XML document). As part of CS297, we have successfully created a distributed (Java RMI based) infrastructure where multiple machines across a network interact with each other to agree on a location for a specific BLOB. We will seek to now use the infrastructure to arrive at a solution for the real-world problem of (a) identifying the minimal subset of blobs within the set of XML documents (playlists) that need be replicated such that all users in the network have access to all their requested BLOBs, and (b) the ideal location for storing this composite set of BLOBs (playlist). We believe that collectively, the algorithms proposed here that will be implemented on the infrastructure we developed as part of CS297, have real world relevance, for e.g. these can be applied in playlist formation, and exchange of BLOBS like music/video files in emerging devices such as cell phones and MP3 players.

CS297 Results

  • Added XML parsing functionality to Postgres
  • Implementation of Byzantine algorithm on a distributed system.
  • Extending Byzantine Algorithm for multiple BLOBs.
  • Collection of information on MANET, Bit Torrent, Google Videos etc.

Proposed Schedule

Week 1-2: 1/25-2/061/25-2/06
Week 3-4: Calender_Date_2 Implement a program to perform Byzantine agreement such that all the voters would agree on the machine, should the BLOB be replaced such that, voters who are located at machines far away from the original machine can easily access the same. This would be performed on a token ring model.
Week 5-6: 2/13-2/20 Implement the above arrangement on an IP address model
Week 7-8: 2/20-3/06 Implement a program which checks frequency of requests for BLOBs. Mechanisms such as setting timer / counters can be used to check when each BLOB has been accessed
Week 9: 3/06-3/21Statistics for token ring network and IP address. Creation of testing framework and perform testing on the implementation performed.
Week 10-12: 3/21-4/10Start writing report
Week 13: 4/10-4/17Submit draft to committee
Week 14-15: 4/17-5/2Prepare for presentation
Week 16: 5/2 ? 5/4Final Presentation

Key Deliverables:

  • Software
    • A program which examines all the voters and their distance from the BLOB that they need to access. Based on the frequency of access and the distance of the voters from the machines on which the BLOBs are located, all the voters perform Byzantine to come on agreement on which machine the BLOB should be replicated to.
    • The above deliverables were performed considering a token ring model. This needs to be modified to Internet Protocol Model.
    • Implement a program which checks frequency of requests for BLOBs. In case of a distributed network scenario, there could be various users situated across various machines requesting for certain BLOBs. Few of them could be more commonly in demand in comparison to others. The attempt of this deliverable is to come up with an algorithm which would check how frequently each BLOB is accessed by various machines across the network and determine which the most commonly accessed ones are.
    • Obtain statistics for the above performed work. Deliver a testing framework for testing above deliverable work performed. A rough model of the framework would be a group, consisting of multiple members collaborating together to make lets say a movie; wherein data is exchanged via cell phones.
  • Report
    • Detailed description of the architecture of software
    • Detailed description of results obtained during the course of the semester

Innovations and Challenges

  • Deciding an algorithm to check which frequency is good enough for a BLOB to be considered important enough requires certain amount of analyses and research.
  • Designing an algorithm to implement Byzantine across an IP address model is challenging. One of the approache would be to take each quad of the IP address and perform Byzantine on the same. Each following quad would depend on the address which has been obtained in the previous operations.
  • Since this is a distributed system, designing a mechanism to achieve synchronization between voters is not an easy task.
  • Creating a testing framework across a distributed network would be a challenging task.

References:

[Hall1999] Principles of Distributed Database Systems (2nd edition). M. Tamer Ozsu and Patrick Valduriez. Prentice Hall. 1999

[Serge2003] Dynamic XML Documents with distribution and replication. Serge Abiteboul, Angela Bonifati, Gregory Cobena, Ioana Manolescu, Tova Milo. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM Press. 2003. Pages 527-538

[Alonso2000] Database replication techniques: a three parameter classification. M Wiesmann, F Pedone, A Schiper, B Kemme, G Alonso. Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems. IEEE Computer Society. 2000. Page 206

[Hara2004] Dynamic data replication schemes for mobile ad-hoc network based on aperiodic updates. T. Hara and S.K. Madria, Proc. Int'l Conf. on Database Systems for Advanced Applications (DASFAA 2004), pp.869-881, 2004.

[Sailhan2003] Cooperative caching in ad hoc networks.F. Sailhan and V. Issarny. Proc. Int'l Conf. on Mobile Data Management (MDM'03), pp.13-28, 2003.

[Wang2002] Efficient and guaranteed service coverage in partitionable mobile ad-hoc networks. K. Wang and B. Li. Proc. IEEE Infocom'02, Vol.2, pp.1089-1098, 2002.