Archive for 十一月, 2014

大数据应用场景

  • report
  • 分析(用户流失、系统监控、安全防攻击)
  • 决策

spark goals:one stack to rule them all

  • batch(mr、迭代式运算)
  • interactive
  • streaming

spark core:底层的计算模型,使用统一的模型支持batch、interactive、streaming

随着硬件的发展,内存的使用成为关键。

http://www.infoq.com/cn/articles/spark-core-rdd

spark streaming其实是mini-batch的运算方式,延迟最快半秒。目前最好的lambda architecture解决方案。

 

理论系列

 

  • Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
  • On Designing and Deploying Internet-Scale Services
  • Paxos Made Simple
  • How to Build a Highly Available System Using Consensus
  • Consensus on Transaction Commit
  • Time, Clocks, and the Ordering of Events in a Distributed System
  • Eventually Consistent Transaction
  • SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

 

Google系列

 

  • Web search for a planet: the google cluster architecture
  • The Google File System
  • MapReduce: Simplied Data Processing on Large Clusters
  • Bigtable: A Distributed Storage System for Structured Data
  • MegaStore: Providing Scalable, Highly Available Storage for Interactive Services
  • GFS: Evolution on fast-forward
  • Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • Dremel: Interactive Analysis of WebScale Datasets
  • F1-The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business
  • Spanner: Google’s Globally-Distributed Database
  • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
  • Chubby: The Chubby lock service for loosely-coupled distributed systems
  • Availability in Globally Distributed Storage Systems

 

Jeff Dean系列

 

  • Designs, Lessons and Advice from Building Large Distributed Systems
  • Challenges in Building Large-Scale Information Retrieval Systems
  • Experiences with MapReduce, an Abstraction for Large-Scale Computation
  • Taming Service Variability,Building Worldwide Systems,and Scaling Deep Learning
  • Large-Scale Data and Computation:Challenges and Opportunitis
  • Achieving Rapid Response Times in Large Online Services

 

其他系统:

 

  • The Hadoop Distributed File System
  • Dynamo: Amazon’s Highly Available Key-value Store
  • Cassandra – A Decentralized Structured Storage System
  • PNUTS: Yahoo!’s Hosted Data Serving Platform
  • ZooKeeper: Wait-free coordination for Internet-scale systems
  • Finding a needle in Haystack: Facebook’s photo storage
  • Flat Datacenter Storage
  • Hive – A Petabyte Scale Data Warehouse Using Hadoop
  • A Study of Linux File System Evolution