【轉貼】大數據領域相關技術索引 @ ㄚ銘老師的部落格

資料來源：http://whatua.com/category/bigdata/

github: https://github.com/whomm/bigdata-tech-index
國內外相關
1. 國內資料分析計算平臺產品
  1. 神策
    1. https://www.sensorsdata.cn
      1. https://www.sensorsdata.cn/blog/technical_implementation_of_sensors_analytics/
      2. https://www.sensorsdata.cn/manual/
  2. growingio
    1. https://www.growingio.com/
  3. 海致：
    1. https://www.bdp.cn/home.html
  4. 阿李雲 quick bi
    1. https://data.aliyun.com/product/bi
  5. finebi
    1. http://www.finebi.com/
      1. finereport
      2. http://www.finereport.com/
2. 國外資料分析平臺
  1. tableau 資料分析：
    1. https://www.tableau.com/
  2. http://www.pentaho.com/
    1. ETL
      1. KETTLE
        
        Pentaho Data Integration ( ETL ) a.k.a Kettle
        
        https://github.com/pentaho/pentaho-kettle
        
        https://wiki.pentaho.com/display/COM/Community+Wiki+Home
  3. http://www.spagobi.org/
  4. https://www.bmc.com/
    1. CONTROL-M
      1. Control-M 是一套數位業務自動化解決方案，能夠簡化並自動化各種批次處理應用工作負載。在基礎架構、資料和應用程式中優化 SLA 並加速應用程式部署。
      2. http://www.bmcsoftware.cn/it-solutions/control-m.html
      3. http://www.doc88.com/p-1863463402569.html
  5. https://www.teradata.com.cn
數據視覺化
1. superset：
  1. https://superset.incubator.apache.org/
2. 報表工具 https://git.oschina.net/max256/morpho
3. 關聯技術
  1. 前端技術
    1. echarts
      1. http://echarts.baidu.com/
    2. antv
      1. https://antv.alipay.com/zh-cn/index.html
4. cboard 開源BI儀錶板平臺，支援互動式多維報表設計和資料分析
  1. https://github.com/yzhang921/CBoard
5. datav 阿裡雲的資料視覺化產品
資料同步
1. 資料傳輸
  1. kafka: a distibuted streaming platform
    1. http://kafka.apache.org/
  2. ActiveMQ
  3. RabbitMQ
2. 資料收集
  1. flume
    1. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
    2. http://flume.apache.org/
  2. logstash
3. 分散式資料庫同步系統
  1. https://github.com/alibaba/otter
    1. canal mysql資料同步 https://github.com/alibaba/canal
  2. sqoop
    1. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
    2. http://sqoop.apache.org/
4. 自動化資料同步流
  1. http://nifi.apache.org/
5. 資料同步工具
  1. mysql replication protocal go 實現： https://github.com/siddontang/go-mysql
  2. mysql replication protocal python 實現 https://github.com/noplay/python-mysql-replication
6. DataX
  1. DataX 是阿裡巴巴集團內被廣泛使用的離線資料同步工具/平臺，實現包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各種異構資料來源之間高效的資料同步功能。
  2. https://github.com/alibaba/DataX
ETL
1. KETTLE
  1. https://community.hds.com/docs/DOC-1009855
離線任務調度
1. hadoop任務調度
  1. http://oozie.apache.org/
  2. https://azkaban.github.io/
2. 原阿裡宙斯 zeue
  1. https://github.com/ctripcorp/dataworks-zeus
3. 個人開源任務調度
  1. https://github.com/xuxueli/xxl-job
4. control-m
  1. https://baike.baidu.com/item/control-m/176677?fr=aladdin
5. 資料平臺作業調度和實踐
  1. https://www.jianshu.com/p/bddffdfea00b
  2. https://www.jianshu.com/p/428ae367a38b
6. autosys
7. etl-automation
8. tws (ibm)
9. TASKCTL
  1. http://www.taskctl.com/Service/Document
10. JobCtrl
  1. 海量任務作業調度監控平臺 – Primeton JobCtrl
  2. http://www.primeton.com/
11. EDB
12. USE
13. SMC
14. JMC
15. Moia
計算引擎&框架
1. spark
  1. http://spark.apache.org/
2. taz
3. hadoop-mapreduce
  1. http://hadoop.apache.org/
4. bigflow
  1. https://github.com/baidu/bigflow
5. storm
6. flink
  1. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
  2. http://flink.apache.org/
7. hive
  1. The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
  2. http://hive.apache.org/
8. impala
大資料存儲
1. OLTP(on-line transaction processing)
2. OLAP（On-Line Analytical Processing）
  1. PALO
    1. 百度資料倉庫Palo是百度雲上提供的PB級別的MPP資料倉庫服務，以較低的成本提供在大資料集上的高性能分析和報表查詢功能。
    2. 百度資料倉庫Palo不是面向OLTP的資料庫產品，而是一款面向OLAP的資料庫產品，和百度資料倉庫Palo功能定位比較相似的產品包括Greenplum、Vertica、Exadata等商業資料倉庫系統和Amazon RedShift、Google BigQuery等雲服務，大家可以參考以上產品來理解百度資料倉庫Palo。
  2. https://cloud.baidu.com/doc/PALO/System.html#.E7.B3.BB.E7.BB.9F.E6.9E.B6.E6.9E.84
  3. Cloud-native MySQL database for unlimited scalability and performance
    1. http://radondb.io/
  4. tidb 國產開源分散式newsql關係型數據庫（完美相容mysql）
    1. https://pingcap.com/index.html
  5. kudu 開源分散式 nosql olap資料庫
    1. a new addition to the open source Apache Hadoop ecosystem, Apache Kudu completes Hadoop’s storage layer to enable fast analytics on fast data.
    2. http://kudu.apache.org/
    3. 產考文檔：
      1. 小米kudu即時分析系統&kudu、hbase、parquet對比 https://baijia.baidu.com/s?old_id=581124
  6. kylin
    1. Apache Kylin™是一個開源的分散式分析引擎，提供Hadoop/Spark之上的SQL查詢介面及多維分析（OLAP）能力以支援超大規模資料，最初由eBay Inc. 開發並貢獻至開源社區。它能在亞秒內查詢巨大的Hive表。
    2. http://kylin.apache.org/
  7. greenplum
    1. Greenplum DB 號稱是世界上第一個開源的大規模並行資料倉庫，最初是基於 PostgreSQL，現在已經添加了大量資料庫方面的創新。Greenplum 提供 PD 級別資料量的強大和快速分析能力，特別是面向大資料方面的分析能力，支援大資料的超高性能分析查詢。
    2. https://greenplum.org/
    3. http://www.greenplum.net.cn/
    4. 產考資料
      1. 《Greenplum資源隔離指南》 https://yq.aliyun.com/articles/57763
      2. 《三張圖讀懂Greenplum在企業的正確使用姿勢》 https://yq.aliyun.com/articles/57736
  8. Vertica
    1. https://www.vertica.com/
  9. Exadata
  10. Amazon RedShift
  11. Google BigQuery
3. parquet hadoop生態下的列式存儲、資料處理框架
  1. https://parquet.apache.org/
  2. 適用場景：
  3. 適用案例：
4. Elasticsearch 是一個分散式的 RESTful 風格的搜索和資料分析引擎，能夠解決不斷湧現出的各種用例
  1. https://www.elastic.co/
    1. 生態關聯
      1. logstash
      2. beats
      3. kibana
5. hbase 分散式列式存儲
  1. Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
  2. https://hbase.apache.org/
  3. 中文產考資料： http://abloz.com/hbase/book.html
  4. 技術延伸
    1. openTSDB 基於hbase的時間序列資料庫
      1. The Scalable Time Series Database. Store and serve massive amounts of time series data without losing granularity.
      2. http://opentsdb.net/
    2. kylin
6. prestodb 開源的分散式SQL互動式解析查詢引擎
  1. Distributed SQL Query Engine for Big Data
  2. https://prestodb.io/
  3. http://prestodb-china.com/
  4. https://github.com/CHINA-JD/presto/
7. 分散式檔存儲
  1. https://github.com/chrislusf/seaweedfs