博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
面向流数据的应用构建指南
阅读量:6790 次
发布时间:2019-06-26

本文共 3600 字,大约阅读时间需要 12 分钟。

hot3.png

面向流数据应用构建指南

本文的目的是探讨使用Storm、Spark或其他软件的流数据处理软件构建流应用的关键点。

我使用Storm和Flume构建流应用的关键点:

1. 流应用的SLA(服务等级)要求

流数据的应用的数据传递要求:

  • At-most once (最多一次,数据丢失是可以接受的。)
  • At-least once (至少一次,数据丢失是不可接受的。)
  • Exactly once (精确一次,幂等计算)

这些要求严重影响到元组的行为(in Storm), 或者下采样是可以接受的 (sensor data processing)。

2. 最小数据重放

In Guaranteed Processing SLA use cases, data replay must be minimized by application logic to avoid situations like replay loops and heavy back pressure. This situation is seen in poorly designed topologies where incorrect exception handling leads to a single datapoint being infinitely replayed.

3. Minimize processing latencies

Latencies for processing individual events/tuples adds up to the cumulative processing latencies therefore streaming application must be engineered for low latency using appropriate technologies for local in-memory caching and micro-batching where possible to minimize network latencies and amortize the cost over several calls.

4. Tradeoffs between Throughput and Latency

Performance tradeoffs usually are between Throughput Vs. Latency and can be tuned using Micro-batching (Storm Trident or Spark Streaming). Micro-batching approach may use time-based or size-based batches both of which have some caveats.

Size-based micro-batching may add latencies since the buffer must fill up before processing applied or transfers are executed and is therefore subject to event velocity. This model can't be used if there are hard latencies limits for the application however, if a sustained minimum event rate (velocity) is guaranteed then micro-batching can be applied while preserving acceptable latencies.

Time-based micro-batching is subject to stability and performance issues if spikes (volume and velocity) are not accounted. Time-based micro-batching can satisfy hard-latency constraints of a use case.

A hybrid model of micro-batching can also be deployed which uses both size based and time based batching and has hard limits on both to guarantee low latencies and high throughput while providing stability.

5. Aggregation Bottlenecks

Use cases where Streaming Aggregations are performed must

  • account for event volume at the pivot point of aggregation (aggregation key) to avoid bottlenecks and pipeline back pressures
  • account for the distribution of event volume across aggregation keys

6. Polling and Event Sourcing

Polling and Event Sourcing are two prominent design patterns for updating configuration and logic in a Streaming pipeline. Logic may include but is not limited to: Dynamic Filtering, Caching for data enrichment and Machine Learning Models.

In Polling based design these updates are polled for at a pre-determined frequency in an out-of-band fashion (separate update thread).

In Event Sourcing based design these updates are delivered to the processing component as Event and the processing component has a separate code branch (if statement) to handle these kinds of events and execute logic update.

Using Event Souring approach allows for a lock-free design whereas Polling based design requires locking (at some point) and concurrent data structures to be used to guarantee consistency.

7. Error Handling

Error Handling must be given thorough design thought as incorrect error handling will lead to application downtime and performance issues. Errors that are recoverable may allow for replays like network unavailability or failover, however unrecoverable errors must be written to a separate stream without blocking the primary execution pipeline. NiFi handles this very seamlessly using Success and Failure streams.

转载于:https://my.oschina.net/u/2306127/blog/859041

你可能感兴趣的文章
iOS 编写高质量Objective-C代码(八)
查看>>
CVE-2011-2461原理分析及案例
查看>>
Duo Security 研究人员对PayPal双重验证的绕过
查看>>
如何优雅地取消Retrofit请求?
查看>>
什么是Java内存模型
查看>>
动手实操 | 如何用 Python 实现人脸识别,证明这个杨幂是那个杨幂?
查看>>
Flutter完整开发实战详解(一、Dart语言和Flutter基础) | 掘金技术征文
查看>>
聊聊flink DataStream的window coGroup操作
查看>>
[译文]Domain Driven Design Reference(六)—— 提炼战略设计
查看>>
自己实现一个滑动窗口
查看>>
Android小知识-了解下WebView
查看>>
js实现数据结构及算法之列表(List)
查看>>
疯狂Kotlin讲义连载之 命令行编译、运行Kotlin
查看>>
Java开发者不会这些永远都只能是三流程序员,细数一下你是不是?
查看>>
Pipenv:新一代Python项目环境与依赖管理工具
查看>>
Java知识点归纳-J2EE 、 Web 部分
查看>>
Android开发之最流行的网络框架Retrofit - jsonp及特殊响应数据格式处理
查看>>
前端路由跳转基本原理
查看>>
Web安全概述
查看>>
init.rc中添加服务以在开机时启动脚本
查看>>