(millisecond, nanosecond) as timestamp. • But current Fluentd timestamp cannot hold sub-second. • If we support nanosecond resolution, it will be able to cover all requirements because it is minimum resolution generally. (but there will be a little more overhead than millisecond/microsecond)
(same with current Timestamp) • @nsec: nanosecond integer • In most cases, it behaves just like current Timestamp • It is serialized as MessagePack Ext type • Fluent::Engine.now and built-in parsers returns EventTime
works fine just as it is. • EventTime behaves like current Timestamp in most cases. • If a plugin doesn’t want to loose sub-second, it may need additional code to handle EventTime. • I have checked many many plugins.
(forward plugins), external data format (MessagePack) have to serialize EventTime as a different type from old timestamp, but this may breaks old nodes. • Introduced time_as_integer option to output forward plugin to force timestamp to be serialized as Integer (same with the current timestamp).
forward plugin receives PackedForward, it is deserialized and serialized for converting EventTime to Integer. • By keeping source nodes old or keeping time_as_integer true on them, we can set false to time_as_integer on relay node.
sub- second part, it is difficult to cache results of Time.strptime (Time.strptime is heavy task). • Introduced strptime gem, which can precompile a format string.
for customers in near real-time • e.g. number of records/bytes over event collector number of running jobs number of allocated CPU cores for a particular job • Reduce support cost by making customers to understand how they are using our computing resources • Make TD staffs known what/how our customers are doing data processing
know how they are using our computing resources • Collector for various metrics sent from workers • Storages to store metrics (InfluxDB/TD) • API server to handle requests from dashboard and query backend storage • Dashboard Application
Almost system administrator doesn't know about how many logs they are generating. • The number of logs are affected by many events, like release of new services, new versions of apps, and so on.
*OqVY%# send metrics How to collect metrics • Workers send metrics to monitoring Server. • Monitoring Server filters and aggregates (if needed) metrics, then store them to InfluxDB and TD.
data) query cache How to show metrics • Dashboard asks API Server for metrics based on its configuration. • API Server queries InfluxDB or TD(Presto) based on the time window. • Dashboard renders graph for query results.
data to InfluxDB and TD(presto) • InfluxDB hold only recent data(e.g. 30 days) • TD(presto) hold all data • Why I chose InfluxDB? • It is fast enough. • We can write time-series data in a hash and query it on the fly flexibly, so it makes trial and error easy. 5% 1SFTUP *OqVY%# query query store store
from dashboard and queries InfluxDB or TD(presto) • Only when old data is needed, API Server queries TD(presto), otherwise InfluxDB • It can cache query results in redis. "1* 4FSWFS 3FEJT query cache query (only old data)
happening InfluxDB degradation. • Because then API server uses TD(Presto) automatically. • To add new metrics, you just need to add configuration to monitoring server and dashboard application. • Of course, workers need to be able to send the metrics.