ㄚ銘老師的部落格

資料收集
1. flume
  1. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  2. http://flume.apache.org/
2. logstash

分散式資料庫同步系統
1. https://github.com/alibaba/otter
  1. canal mysql資料同步 https://github.com/alibaba/canal
2. sqoop
  1. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  2. http://sqoop.apache.org/

自動化資料同步流
1. http://nifi.apache.org/

資料同步工具
1. mysql replication protocal go 實現： https://github.com/siddontang/go-mysql
2. mysql replication protocal python 實現 https://github.com/noplay/python-mysql-replication

DataX
1. DataX 是阿裡巴巴集團內被廣泛使用的離線資料同步工具/平臺，實現包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各種異構資料來源之間高效的資料同步功能。
2. https://github.com/alibaba/DataX

ETL

KETTLE
1. https://community.hds.com/docs/DOC-1009855

離線任務調度

hadoop任務調度
1. http://oozie.apache.org/
2. https://azkaban.github.io/

原阿裡宙斯 zeue
1. https://github.com/ctripcorp/dataworks-zeus

個人開源任務調度
1. https://github.com/xuxueli/xxl-job

control-m
1. https://baike.baidu.com/item/control-m/176677?fr=aladdin

資料平臺作業調度和實踐
1. https://www.jianshu.com/p/bddffdfea00b
2. https://www.jianshu.com/p/428ae367a38b

autosys

etl-automation

tws (ibm)

TASKCTL
1. http://www.taskctl.com/Service/Document

JobCtrl
1. 海量任務作業調度監控平臺 – Primeton JobCtrl
2. http://www.primeton.com/

EDB

USE

SMC

JMC

Moia

計算引擎&框架

spark
1. http://spark.apache.org/

taz

hadoop-mapreduce
1. http://hadoop.apache.org/

bigflow
1. https://github.com/baidu/bigflow

storm

flink
1. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
2. http://flink.apache.org/

hive
1. The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
2. http://hive.apache.org/

impala

大資料存儲

OLTP(on-line transaction processing)

OLAP（On-Line Analytical Processing）
1. PALO
  1. 百度資料倉庫Palo是百度雲上提供的PB級別的MPP資料倉庫服務，以較低的成本提供在大資料集上的高性能分析和報表查詢功能。
  2. 百度資料倉庫Palo不是面向OLTP的資料庫產品，而是一款面向OLAP的資料庫產品，和百度資料倉庫Palo功能定位比較相似的產品包括Greenplum、Vertica、Exadata等商業資料倉庫系統和Amazon RedShift、Google BigQuery等雲服務，大家可以參考以上產品來理解百度資料倉庫Palo。
2. https://cloud.baidu.com/doc/PALO/System.html#.E7.B3.BB.E7.BB.9F.E6.9E.B6.E6.9E.84
3. Cloud-native MySQL database for unlimited scalability and performance
  1. http://radondb.io/
4. tidb 國產開源分散式newsql關係型數據庫（完美相容mysql）
  1. https://pingcap.com/index.html
5. kudu 開源分散式 nosql olap資料庫
  1. a new addition to the open source Apache Hadoop ecosystem, Apache Kudu completes Hadoop’s storage layer to enable fast analytics on fast data.
  2. http://kudu.apache.org/
  3. 產考文檔：
    1. 小米kudu即時分析系統&kudu、hbase、parquet對比 https://baijia.baidu.com/s?old_id=581124
6. kylin
  1. Apache Kylin™是一個開源的分散式分析引擎，提供Hadoop/Spark之上的SQL查詢介面及多維分析（OLAP）能力以支援超大規模資料，最初由eBay Inc. 開發並貢獻至開源社區。它能在亞秒內查詢巨大的Hive表。
  2. http://kylin.apache.org/
7. greenplum
  1. Greenplum DB 號稱是世界上第一個開源的大規模並行資料倉庫，最初是基於 PostgreSQL，現在已經添加了大量資料庫方面的創新。Greenplum 提供 PD 級別資料量的強大和快速分析能力，特別是面向大資料方面的分析能力，支援大資料的超高性能分析查詢。
  2. https://greenplum.org/
  3. http://www.greenplum.net.cn/
  4. 產考資料
    1. 《Greenplum資源隔離指南》 https://yq.aliyun.com/articles/57763
    2. 《三張圖讀懂Greenplum在企業的正確使用姿勢》 https://yq.aliyun.com/articles/57736
8. Vertica
  1. https://www.vertica.com/
9. Exadata
10. Amazon RedShift
11. Google BigQuery

parquet hadoop生態下的列式存儲、資料處理框架
1. https://parquet.apache.org/
2. 適用場景：
3. 適用案例：

Elasticsearch 是一個分散式的 RESTful 風格的搜索和資料分析引擎，能夠解決不斷湧現出的各種用例
1. https://www.elastic.co/
  1. 生態關聯
    1. logstash
    2. beats
    3. kibana

hbase 分散式列式存儲
1. Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
2. https://hbase.apache.org/
3. 中文產考資料： http://abloz.com/hbase/book.html
4. 技術延伸
  1. openTSDB 基於hbase的時間序列資料庫
    1. The Scalable Time Series Database. Store and serve massive amounts of time series data without losing granularity.
    2. http://opentsdb.net/
  2. kylin

prestodb 開源的分散式SQL互動式解析查詢引擎
1. Distributed SQL Query Engine for Big Data
2. https://prestodb.io/
3. http://prestodb-china.com/
4. https://github.com/CHINA-JD/presto/