Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kotlin ❤️ Data Science?

Kotlin ❤️ Data Science?

There is a mismatch between software engineering and data science. My talk addresses this fact, and tries to justify whether the use of Kotlin can help bring these two worlds closer together.

Preslav Rachev

January 28, 2019
Tweet

More Decks by Preslav Rachev

Other Decks in Technology

Transcript

  1. Kotlin ❤ Data Science?* Preslav Rachev @ KI labs //

    28.01.2019 * Data science and data engineering @preslavrachev / (https://preslav.me), 2019 1
  2. Who am I? — A software engineer, working at KI

    labs. — Passionate about Kotlin and data. — A genuinely curious individual who loves writing. ✏ Also, an inventor of funny faces ! " ✏ https://preslav.me @preslavrachev / (https://preslav.me), 2019 2
  3. The IT Reality of 2019 — "AI" has become a

    favourite topic among business managers and software engineers, when discussing company innovation strategies. — Tech media is only making it worse. — Data is everywhere, but getting useful knowledge is far different from what management and engineering imagine. @preslavrachev / (https://preslav.me), 2019 5
  4. AI, ML, DS?!? — AI is what brings the VC

    Money in. — ML (a.k.a sophisticated brute-force) is what gets the job done. — ML models are very limited to a given domain. — DS is the craft of finding which ML model works for a particular case, and which doesn't. ! @preslavrachev / (https://preslav.me), 2019 7
  5. Data Science Definition Data science is a "concept to unify

    statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. — Wikipedia @preslavrachev / (https://preslav.me), 2019 8
  6. Data Science Workflow 1. Form a hypothesis 2. Load data

    from various sources 3. Clean, transform and unify 4. Extract features 5. Use the features to run a model 6. Visualize and report findings 7. Support or refute the hypothesis @preslavrachev / (https://preslav.me), 2019 9
  7. Motivation There is a mismatch between software engineering and data

    science practices: — Software engineering works best when building well-defined systems — Requirements rarely change entirely, but evolve over time — Data science deals with supporting and refuting hypotheses. — Lots of uncertainty, which requires seamless exploration and visualisation @preslavrachev / (https://preslav.me), 2019 10
  8. Motivation Due to this mismatch, systems often end up becoming

    complex tech-stack mash- ups, where each side treats the other as some sort of a black box. - Difficult to maintain and requires lots of different skills and practices The question arises: Could there be a single tech stack that allows software engineers and data scientists work in peace, but also directly contribute to a single codebase? @preslavrachev / (https://preslav.me), 2019 11
  9. Kotlin == the missing link? — A multi-paradigm programming language

    with a fluent syntax — Strong community and enterprise backing — Access to the entire universe of JVM knowledge and libraries But there are a few important pieces, which are not quite there yet: — Fully integrated scripting capabilities — Playground environments (e.g. notebooks) — Data wrangling and visualization libraries that take advantage of the above @preslavrachev / (https://preslav.me), 2019 13
  10. Kotlin Top 10 Features Kotlin is a multi-paradigm programming language,

    equally easy to learn by both Java and Python programmers. My personal Top 10: 1. Static typing 2. Immutability and Null-safety 3. Higher-order functions 4. Chain-able sequences 5. Data classes 6. Extension methods 7. Sealed classes 8. Coroutines 9. Default and named arguments 10. Multi-platform support @preslavrachev / (https://preslav.me), 2019 14
  11. Static Typing List<Integer> nums = new List<>(Arrays.asList(1, 2, 3)); //

    Java val nums = listOf(1, 2, 3) // Kotlin Immutability and Null-Safety // Every variable in Kotlin must be assigned a value, // unless explicitly declared with `lateinit` val x = 100 // cannot be changed, ever! var y = 200 // this one can lateinit var z // A value will be provided later @preslavrachev / (https://preslav.me), 2019 15
  12. Higher-Order Functions and DSL support A higher-order function is a

    function that takes functions as parameters, or returns a function. route("/portal") { route("articles") { … } route("admin") { intercept(ApplicationCallPipeline.Features) { … } // verify admin privileges route("article/{id}") { … } // manage article with {id} route("profile/{id}") { … } // manage profile with {id} } } @preslavrachev / (https://preslav.me), 2019 16
  13. Data classes and Chain-able Sequences data class Person(val name: String,

    val age: Int) val people = listOf(Person("Chris Martin", 31), Person("Will Champion", 32), Person("Jonny Buckland", 33), Person("Guy Berryman", 34), Person("Mhris Cartin", 30)) println(people .asSequence() // convert to sequence .filter { it.age > 30 } // lazy eval (intermediate op) .map { it.name.split(" ").map {it[0]}.joinToString("") } // lazy eval (intermediate op) .map { it.toUpperCase() } // lazy eval (intermediate op) .toList()) // terminal operation Tip: Combine these with coroutines to construct declarative data pipelines. @preslavrachev / (https://preslav.me), 2019 17
  14. Sealed Classes sealed class ArithmeticOperation class Add(var a: Int, var

    b: Int): ArithmeticOperation() class Subtract(var a: Int, var b: Int): ArithmeticOperation() class Multiply(var a: Int, var b: Int): ArithmeticOperation() class Divide(var a: Int, var b: Int): ArithmeticOperation() fun execute(op: ArithmeticOperation) = when (op) { is Add -> op.a + op.b is Subtract -> op.a - op.b is Multiply -> op.a * op.b is Divide -> op.a / op.b } @preslavrachev / (https://preslav.me), 2019 18
  15. Extension Methods fun String.underscore() : String { return this .replace("

    ", "_") } print("hello word".underscore()) // "hello_world" Infix support infix fun Number.toPowerOf(exponent: Number): Double { return Math.pow(this.toDouble(), exponent.toDouble()) } 3 toPowerOf 2 // 9 9 toPowerOf 0.5 // 3 @preslavrachev / (https://preslav.me), 2019 19
  16. The Ecosystem No programming language in the world will do

    the job, without an abundant library ecosystem to choose and pick from. — The Kotlin Standard Library will be your first choice. Yet, by far not the only one. — Kotlin is stepping on the shoulders of giants (e.g. the JVM) — The future prospects of integrating low-level libraries together with Kotlin Native are even more promising @preslavrachev / (https://preslav.me), 2019 20
  17. What the JVM has to offer... Library Functionality Apache Hadoop

    Batch Processing Apache Spark Data Streaming ND4J scientific computing (similar to NumPy) Apache Commons Math Math and computing utils Weka ML/NLP (similar to SciPy) Tablesaw Visualization (similar to Matplotlib and Plot.ly) TensorFlow for Java Deep ML Deeplearning4j Deep ML And many, many more... @preslavrachev / (https://preslav.me), 2019 21
  18. ...besides, a young ecosystem of libs targeting Kotlin's unique features:

    Library Functionality Krangl Data wrangling (similar to Pandas) Kravis Visualisation (similar to Matplotlib and Plot.ly) Koma scientific computing (similar to SciPy) kotlin-statistics scientific computing and statistics komputation neural network for the JVM written in Kotlin and CUDA C Still, no real Pandas yet ! @preslavrachev / (https://preslav.me), 2019 22
  19. Many of the above libraries use standards for communicating input

    data (e.g. CSV) or results (e.g. trained ML models, reports, aggregated data sets, visualisations, etc). — At the very least, this means that one can create an environment, in which data scientists keep using their favourite tools, and communicate their findings with the software engineers, using those standards. — Kotlin can become a mutual ground of code understanding OK, but can we get one step further from there? @preslavrachev / (https://preslav.me), 2019 23
  20. The ultimate data scientist peace requires three more things: 1.

    Better Kotlin scripting support 2. A solid REPL (Read-Eval-Print Loop) console 3. Tools that encourage experimentation and interactive programming @preslavrachev / (https://preslav.me), 2019 25
  21. Scripting Support A large portion of the work of the

    data team involves the use and deployment of executable scripts. This is one field where Python excels off the charts Kotlin Script is unfinished, slow and painful to work with ! KEEP-75 @preslavrachev / (https://preslav.me), 2019 26
  22. KScript Is an open-source project that tries to improve the

    performance of Kotlin scripts, and reduce the friction when working with 3rd-part libs: #!/usr/bin/env kscript @file:DependsOn("de.mpicbg.scicomp:kutils:0.4") import de.mpicbg.scicomp.bioinfo.openFasta if (args.size != 1) { System.err.println("Usage: CountRecords <fasta>") kotlin.system.exitProcess(-1) } val records = openFasta(java.io.File(args[0])) println(records.count()) @preslavrachev / (https://preslav.me), 2019 27
  23. REPL Kotlin has a REPL (Read-Eval-Print Loop), but it is

    a tough beast. IntelliJ extends the Kotlin REPL and makes it a bit nicer to work with. Check out KShell as an alternative. @preslavrachev / (https://preslav.me), 2019 28
  24. Interactive Programming Also known as notebooks or playgrounds, tools like

    Jupyter allow for a unique mix of narrative and code. — Let programmers play around with data and libs in a visual, REPL-like environment — Great for sharing and explaining difficult concepts Kotlin Jupyter Kotlin Playground @preslavrachev / (https://preslav.me), 2019 29
  25. What did we learn? — Kotlin is a great language

    with a mature library ecosystem. — It lacks some of the tooling that data scientists need. — The community and JetBrains are working hard to fill the gaps. — We wouldn't have reached this far, it weren't for these folks: @ligee, @thomasnield9727, @holgerbrandl and many more around the #datascience community on Slack. @preslavrachev / (https://preslav.me), 2019 30
  26. Links — The Connection Between Data Science, Machine Learning and

    Artificial Intelligence — Awesome Kotlin - a curated list of libraries and resources @preslavrachev / (https://preslav.me), 2019 32