tools for Ruby • At that time, I tried and failed to perform image analysis and simple data analysis with Ruby • Ruby couldn’t compete with Python and Julia for tasks I wanted to do
data processing systems • We can analyze small data by Ruby’s data tools without any hard effort to keep using Ruby • There are developers and users of Ruby’s data tools around of the world
• But necessary tool ecosystems are growing steadily by efforts of several developers and non-developers who have the similar visions • I think we have bright futures
and users who use opensource data tools with Ruby Hosting discource • discourse.ruby- data.org • It’s used for the discussion during GSoC 2018 Holding workshops • Several workshops in past RubyKaigis • The next is going to hold on the 2nd day in RubyKaigi 2019
for integrating a lot of data tools for Ruby • RubyData provides two docker images • rubydata/minimal-notebook • rubydata/datascience-notebook • These notebooks are available to use on binder
for integrating a lot of data tools for Ruby • RubyData provides two docker images • rubydata/minimal-notebook • rubydata/datascience-notebook • These notebooks are available to use on binder
of … • Jupyter Notebook and JupyterLab • SciPy stacks (scipy, numpy, pandas, matplotlib, IPython, etc.) • IRuby and data tools for Ruby including pycall.rb
try to utilize Ruby in the data science field • Developing data tools for Ruby • Using Ruby for data analysis frontend language • Integrating Rails applications with some data processing tools
Ruby community Rubyコミュニティーを超えて協⼒する 2. Acting rather than blaming ⾮難することよりも⼿を動かすことが⼤事 3. Continuous, iterative progress rather than a short, big project ⼀回だけの活発な活動よりも⼩さくてもいいので継続的に活動することが⼤事 4. The current lack of knowledge doesn't matter 現時点での知識不⾜は問題ではない 5. Ignore criticism from outsiders 部外者からの⾮難は気にしない 6. Fun! 楽しくやろう!
every month in Speee Lounge, Tokyo • Not only Red Data Tools, but also Ruby Numo people, SciRuby people, and others have attended • Like asakusa.rb, in this meetup we concentrate the development of data tools for Ruby • There are two Apache Arrow committers
Arrow committers • One is @kou, Kouhei Sutou, the founder of Red Data Tools project • Another one is @shiro615, he started his contribution to Apache Arrow as his first OSS activity in Red Data Tools meetup, and got the commit-bit in Nov 2018
the contemporary computer architecture • Single-threaded algorithms are not friendly to multi- core CPU and GPGPU • Data layout is not optimized for CPU cache • Tools are fragmented for each programming language ecosystem • Data in memory couldn’t be shared among tools
for columnar data (i.e. data frames) • Bring together database and data science communities to collaborate on shared computational technologies • Defragment data access among different tools
format • 80% computation wasted on serialization & deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality With Arrow Without Arrow https://arrow.apache.org/
writing widely used storage formats • Interacting with database and other data sources Exchanging data • Zero-copy IPC • Efficient RPC and client-server communications Computation with data • Efficient in-memory and out-of-core data frame analysis • JIT compile for vectorized expression evaluations by LLVM
for memory-efficient collection for the primitive data types • Objects of classes of Red Arrow can be passed to Python without copy by using red-arrow-pycall • You can read and write Parquet file format
integrate ActiveRecord and Apache Arrow • Arrow::RecordBatch was employed as the internal data representation of AR::Result • A RecordBatch represents a bunch of columnar table data • mysql2 was modified to generate an instance of Arrow::RecordBatch directly from a query result • The memory consumption and computation time of AR’s pluck method are compared b/w the original and Apache Arrow versions
improves the memory consumption of pluck method without the loss of computational speed • My experimental implementation can be applied only by changing the connection adapter name: “mysql2” → “arrow-mysql2” • The technologies for the systems that utilize massive data is also applicable to Web applications • The activity for Ruby’s data tools can have good effects for Rails applications • I’ll explain the detail of this experiment and the additional research now I’m performing to improve pluck method, in RubyKaigi 2019
Apache Arrow is important for the future of Ruby in the data science field • Apache Arrow is also important for Rails application • I will talk about the mechanisms how Apache Arrow improves pluck method in RubyKaigi 2019