the project at Facebook in 2014. He serves as a member of the Osquery Technical Steering Committee, and works with a variety of clients to integrate osquery into their operations as the Principal Engineer at Dactiv LLC. Zach believes that we can make security accessible to everyone through open-source tools. Zach Wasserman Principal Engineer Dactiv LLC Page 2 @osqueryatscale @thezachw
created by Mike Arpaia at Facebook. • October 2014 - Facebook open-sources osquery. • 2014-2019 - Facebook maintains osquery as an open-source project. • June 2019 - Osquery project is handed to the Linux Foundation for a community support model. • October 2019 - Osquery 4.0.2 becomes the first stable release of osquery as a project of the Linux Foundation. • Current - Community is working to establish a regular release cycle, define a roadmap for the future, raise funds for the project, and help osquery grow. Page 4 @osqueryatscale @thezachw
to discuss status of the project and develop plans. ◦ Next meeting: February 4, 10AM PST ◦ Join #officehours in osquery Slack for announcements. • Donate to the osquery project. ◦ Funds are not yet earmarked, but raising funds for the project will allow us to improve testing infrastructure, and hire devs to work on osquery features and maintenance. • Volunteer to test releases. ◦ Organizations that can deploy beta versions of the agent can help ensure the quality of stable releases. Page 5 @osqueryatscale @thezachw
state of the systems we manage. • Resource utilization has real impact to the bottom line of our business. ◦ Production: Resource utilization is multiplied over each production server. More performance impact = higher cost. ◦ Workstations: We need to ensure that security workloads do not interfere with employees doing their jobs. • Osquery is built for performance, but it is easy to schedule queries that will have significant performance impacts on the system. Page 7 @osqueryatscale @thezachw
Watchdog • Develop monitoring for resource consumption of queries. ◦ The osquery_schedule table • Deploy new queries in a controlled manner. ◦ Host grouping and query sharding • Investigate performance. ◦ Profiling and SQLite explain query plan Page 8 @osqueryatscale @thezachw
two processes: ◦ Parent process - The “watchdog” ◦ Child process - The “worker” • Potentially resource-intensive operations are performed in the worker process. ◦ Run queries, output logs, etc. • The watchdog process checks the utilization stats for the worker on an interval. ◦ Resource utilization limits exceeded -> Watchdog kills/respawns worker Page 11 @osqueryatscale @thezachw
runs on the worker? 1. Worker writes to RocksDB the name of the query being run. 2. Query executes. 3. Worker removes notation of running query. Page 12 @osqueryatscale @thezachw
the worker during query execution. 1. Worker writes to RocksDB the name of the query being run. 2. Query begins executing. 3. Watchdog kills worker. 4. Worker respawns, reads RocksDB, and sees that the previous worker was in the middle of execution. 5. Worker logs the failure during query execution and “blacklists” the query. Page 13 @osqueryatscale @thezachw
fails. • Blacklisted queries are removed from the schedule for 24 hours. • This prevents crash-looping, and unnecessary use of resources when queries will be killed. • Observe query blacklist status using the blacklisted column of the osquery_schedule table. ◦ More on this later. Page 15 @osqueryatscale @thezachw
in osquery by default. ◦ Default settings: “normal” ▪ CPU - Over 10% for up to 12 seconds ▪ Memory - Up to 200MB ◦ --watchdog_level=1: “restrictive” ▪ CPU - Over 5% for up to 6 seconds ▪ Memory - Up to 100MB ◦ --watchdog_level=-1: “off” ▪ Performance limits are disabled Page 17 @osqueryatscale @thezachw
to specific needs. • --watchdog_utilization_limit ◦ Threshold percentage of CPU ◦ Time allowed over the threshold is defined by the intervals from --watchdog_level ▪ --watchdog_level=0 - 10 seconds above limit ▪ --watchdog_level=1 - 5 seconds above limit • --watchdog_memory_limit ◦ Maximum memory in MB • Tradeoff: Visibility <-> Performance safety Page 18 @osqueryatscale @thezachw
the watchdog to limit utilization by osquery extensions. ◦ Extensions typically run as child processes spawned by osqueryd (with the --extensions_autoload flag). ◦ Use --enable_extensions_watchdog to turn on this feature. Page 19 @osqueryatscale @thezachw
osquery performance. • The osquery_schedule table exposes performance information for all scheduled queries. • Performance information is collected by looking at the difference in CPU time and memory of the worker process during execution. Page 21 @osqueryatscale @thezachw
= 1; • Returns information about all of the queries that are currently blacklisted. • Depending on your requirements, this may be worth alerting on! Page 23 @osqueryatscale @thezachw
create osquery performance dashboards. • Useful charts: ◦ Blacklisted queries ◦ Memory usage of queries ◦ System + user time usage of queries • Visualizing middle and top percentiles can help find outliers. Page 24 @osqueryatscale @thezachw
of new queries: 1. Segment hosts and deploy queries to groups of hosts. 2. Use the shard option of scheduled queries to slow roll queries within a host group. Use these strategies together for the best control of rollouts. Page 29 @osqueryatscale @thezachw
for performance issues. ◦ Start with lower risk hosts. • Different techniques can be used depending on the deployment/configuration strategy of osquery. • With tools like Chef/Puppet/Ansible: ◦ Use the tooling to deploy different pack files to each group of hosts. • With plain osquery: ◦ Use the discovery query feature of query packs to gate pack execution based on the results of dynamic queries. • With Fleet: ◦ Use labels to segment hosts and target packs to labels. Page 30 @osqueryatscale @thezachw
a scheduled query to enable the query on a subset of hosts that receive the pack. ◦ Shard is a percentage of hosts on which to enable the query. ▪ 0 - No hosts ▪ 100 - All hosts • Increase the shard value as confidence in the query performance increases. Page 31 @osqueryatscale @thezachw
techniques to begin sending the new query to hosts. • Ensure that you have visibility (alerting, dashboards, etc.) into the performance of osquery on those systems. • Deploy to more hosts as confidence increases. • Good monitoring dashboards really pay off at this stage. Page 32 @osqueryatscale @thezachw
use to investigate the performance of queries. • Profiling ◦ Use the profile script to preview the performance of query packs on the local machine. • Query planning ◦ Use SQLite explain query plan to begin debugging performance problems. Page 35 @osqueryatscale @thezachw
• Monitor performance and blacklists with the osquery_schedule table. • Be strategic in rollout of new queries. • Learn to use the tooling to evaluate performance. Page 39 @osqueryatscale @thezachw