面向流数据应用构建指南
本文的目的是探讨使用Storm、Spark或其他软件的流数据处理软件构建流应用的关键点。
我使用Storm和Flume构建流应用的关键点:
1. 流应用的SLA(服务等级)要求
流数据的应用的数据传递要求:
- At-most once (最多一次,数据丢失是可以接受的。)
- At-least once (至少一次,数据丢失是不可接受的。)
- Exactly once (精确一次,幂等计算)
这些要求严重影响到元组的行为(in Storm), 或者下采样是可以接受的 (sensor data processing)。
2. 最小数据重放
In Guaranteed Processing SLA use cases, data replay must be minimized by application logic to avoid situations like replay loops and heavy back pressure. This situation is seen in poorly designed topologies where incorrect exception handling leads to a single datapoint being infinitely replayed.
3. Minimize processing latencies
Latencies for processing individual events/tuples adds up to the cumulative processing latencies therefore streaming application must be engineered for low latency using appropriate technologies for local in-memory caching and micro-batching where possible to minimize network latencies and amortize the cost over several calls.
4. Tradeoffs between Throughput and Latency
Performance tradeoffs usually are between Throughput Vs. Latency and can be tuned using Micro-batching (Storm Trident or Spark Streaming). Micro-batching approach may use time-based or size-based batches both of which have some caveats.
Size-based micro-batching may add latencies since the buffer must fill up before processing applied or transfers are executed and is therefore subject to event velocity. This model can't be used if there are hard latencies limits for the application however, if a sustained minimum event rate (velocity) is guaranteed then micro-batching can be applied while preserving acceptable latencies.
Time-based micro-batching is subject to stability and performance issues if spikes (volume and velocity) are not accounted. Time-based micro-batching can satisfy hard-latency constraints of a use case.
A hybrid model of micro-batching can also be deployed which uses both size based and time based batching and has hard limits on both to guarantee low latencies and high throughput while providing stability.
5. Aggregation Bottlenecks
Use cases where Streaming Aggregations are performed must
- account for event volume at the pivot point of aggregation (aggregation key) to avoid bottlenecks and pipeline back pressures
- account for the distribution of event volume across aggregation keys
6. Polling and Event Sourcing
Polling and Event Sourcing are two prominent design patterns for updating configuration and logic in a Streaming pipeline. Logic may include but is not limited to: Dynamic Filtering, Caching for data enrichment and Machine Learning Models.
In Polling based design these updates are polled for at a pre-determined frequency in an out-of-band fashion (separate update thread).
In Event Sourcing based design these updates are delivered to the processing component as Event and the processing component has a separate code branch (if statement) to handle these kinds of events and execute logic update.
Using Event Souring approach allows for a lock-free design whereas Polling based design requires locking (at some point) and concurrent data structures to be used to guarantee consistency.
7. Error Handling
Error Handling must be given thorough design thought as incorrect error handling will lead to application downtime and performance issues. Errors that are recoverable may allow for replays like network unavailability or failover, however unrecoverable errors must be written to a separate stream without blocking the primary execution pipeline. NiFi handles this very seamlessly using Success and Failure streams.