Development of Data Science Ecosystem for Ruby

Development of Data Science Ecosystem for Ruby Kenta Murata, Speee
Inc. 2017.09.19 RubyKaigi 2017 in Hiroshima Japan 1 / 55

BigData is important in your business 2 / 55

Data Science Data Analysis Data Aggregation Machine Learning 3 /
55

Data Mining 4 / 55

Programming Languages for Data Science 5 / 55

Python R Scala 6 / 55

RubyKaigi 2016 in Kyoto I stated that Ruby was not
practically usable in data science. 7 / 55

Today's Topics 1. The current situation Now Ruby is practically
usable for data science 2. For the future What we should effort to keep Ruby available for data science 3. Request for you Shall we develop our tools and community? 8 / 55

self.introduce Kenta Murata @mrkn (github, twitter, etc.) Researcher at Speee
Inc. CRuby committer bigdecimal, enumerable-statistics, pycall, etc. 9 / 55

The current situation of Ruby in data science 10 /
55

How to use Ruby in data science 11 / 55

The past two options to use Ruby in data science
1. Ruby-only way 2. Use Python and R for data analysis and connect by JSON API 12 / 55

Ruby-only way There are large restrictions in data processing parts
Few capabilities with the existing tools 13 / 55

Use Python or R together with Ruby There are large
costs in data exchange by JSON API Development and maintenance of API endpoints JSON serialization for exchanging data Letting data processing systems refer the same database of the main application It increases the development cost of the main application 14 / 55

The third option introduced by PyCall PyCall allows us to
use a Python interpreter together with a Ruby interpreter in the same process. PyCall provides low-cost ways of data exchanging. Directly data conversion to Python data types. Sharing the same memory pointers. Use Apache Arrow data structure by red-arrow-pycall library 15 / 55

Three options available today 1. Ruby-only way 2. Using Python
and R for data analysis, and connect via JSON API or let them look the same DB 3. Use PyCall to call Python from Ruby 16 / 55

PyCall 17 / 55

For Example: Use seaborn for visualizing benchmark results Measure benchmarking
results and collect them in a pandas dataframe Visualize the results by using seaborn that is Python visualization library built on matplotlib Perform all the above things in one Ruby script 18 / 55

# Benchmark ================================================ require 'benchmark' N, L = 100, 1_000_000
ary = Array.new(L) { rand } methods, times = [], [] N.times do methods << :inject times << Benchmark.realtime { ary.inject(:+) } methods << :while # ---------------------------------- times << Benchmark.realtime { sum, i = ary[0], 1 while i < L sum += ary[i]; i += 1 end } methods << :sum # ------------------------------------ times << Benchmark.realtime { ary.sum } end # Make dataframe =========================================== require 'pandas' df = Pandas::DataFrame.new(data: { method: methods, time: times }) Pandas.options.display.width = `tput cols`.to_i puts df.groupby(:method).describe # Visualization ============================================ require 'matplotlib' plt = Matplotlib::Pyplot sns = PyCall.import_module('seaborn') sns.barplot(x: 'method', y: 'time', data: df) plt.title("Array summation benchmark (#{N} trials)") plt.savefig('bench.png', dpi: 100) 19 / 55

$ ruby sum_bench.rb time count mean std min 25% 50%
75% max method inject 100.0 0.140720 0.020082 0.126592 0.132516 0.135811 0.139753 sum 100.0 0.017629 0.001289 0.015933 0.016553 0.017336 0.018437 while 100.0 0.126714 0.012356 0.116296 0.121269 0.123295 0.127468 20 / 55

21 / 55

Example 2: Use pandas in Rails app 22 / 55

23 / 55

https://github.com/mrkn/bugs-viewer- rk2017 24 / 55

Example 3: Object recognition by Keras Detecting bboxes of objects
in a photo Keras's model of SSD300 25 / 55

PyCall makes Ruby easily usable for data manipuration, data visualization,
and machine learning 26 / 55

Python is a best friend of Ruby from now on
27 / 55

In fact, PyCall is just a wrapper library of libpython
that is written in C language 28 / 55

Try PyCall 29 / 55

PyCall is too young so it needs to be applied
for various use cases 30 / 55

https://github.com/mrkn/pycall.rb 31 / 55

Ask me if you want to try PyCall in your
business 32 / 55

33 / 55

Three options (again) 1. Ruby-only way 2. Using Python and
R for data analysis, and connect via JSON API or let them look the same DB 3. Use PyCall to call Python from Ruby 34 / 55

Python's case: only two options 1. Python-only way, that is
best practice 2. Use R only for statistical analysis methods that are unavailable in Python We can use Rpy2 for this case 35 / 55

Python can be easily used for almost all situations in
data science 36 / 55

PyCall should be a temporary way until Ruby will get
ready for data science 37 / 55

Look ahead to the near future 38 / 55

Red Data Tools project https://red-data-tools.github.io/ 39 / 55

Apache Arrow 40 / 55

Exchanging data between multiple systems in data science E.g.
Data extraction from RDBMS to client programs 41 / 55

The current way to exchange data between systems Each
system has its own internal memory format Serialize and deserialize for exchanging data wasted a lot of CPU time Similar functions are implemented in multiple systems 42 / 55

The current situation of data exchanging 43 / 55

The near future with Apache Arrow 44 / 55

The future in which Ruby can be used with Apache
Arrow 45 / 55

Red Arrow https://github.com/red-data-tools/red-arrow 46 / 55

Red Data Tools products red-arrow red-chainer red-arrow-pycall red-arrow-numo-narray red-arrow-nmatrix red-arrow-activerecord
etc. 47 / 55

Big News in Red Data Tools 48 / 55

Big News in Red Data Tools Kouhei Sutou (@kou) officially
became a mem- ber of PMC (project management committie) of Apache Arrow yesterday 49 / 55

Big News in Red Data Tools Kouhei Sutou (@kou) officially
became a mem- ber of PMC (project management committie) of Apache Arrow yesterday This means there is at least one person who de- velops Ruby-support of Apache Arrow as a core developer So you will be able to use Apache Arrow's new feature ASAP 50 / 55

Join Red Data Tools There are gitter channels both in
English and Japanese https://gitter.im/red-data-tools/en https://gitter.im/red-data-tools/ja 51 / 55

Summary Ruby has already been a programming language that
is usable in data science You can use Python tools from Ruby by using PyCall as demonstrations I performed in this talk Red Data Tools enables us to use Apache Arrow and it guarantees that Ruby will be connected to multiple data processing systems in the future But there are lots of things should be done for the future 52 / 55

Requests for you Try PyCall to make real-world use cases,
and find bugs Join Red Data Tools to contribute the future of Ruby in data science Join the workshop tomorrow 53 / 55

RubyData Workshop in RubyKaigi 2017 13:50-15:50 in Room Ran https://github.com/RubyData/rubykaigi2017
1. PyCall Lecture 2. Getting started to Red Data Tools project 54 / 55

55 / 55

Development of Data Science Ecosystem for Ruby

Development of Data Science Ecosystem for Ruby

More Decks by Kenta Murata

Other Decks in Technology

Featured

Transcript