Upgrade to Pro — share decks privately, control downloads, hide ads and more …

데이터 분석을 위한 Scala

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.
Avatar for VCNC VCNC
December 03, 2014

데이터 분석을 위한 Scala

데이터 분석을 위한 Scala
한국 스파크 사용자 모임

개요
- Scala 개요
- 왜 Scala인가?
- Scala 기초 맛보기
- 좀만 더 파보기

정리
- Scala는 데이터 분석하기에 좋은 언어 (다른 용도로도 좋아요)
- 간결한 표현, 좋은 성능, Functional Programming
- REPL, Scripting가능
- 우아한 방식으로 원하는 개념을 구현할 수 있음

Avatar for VCNC

VCNC

December 03, 2014
Tweet

More Decks by VCNC

Other Decks in Programming

Transcript

  1. द੘ೞӝ ੹ী 1. Scalaח ౠ੿ ࠙ঠী Ҵೠغ૑ ঋ਷ ߧਊ ೐۽Ӓې߁

    ঱যੑפ׮. ࠄ ੗ܐীࢲח ؘ੉ఠ ࠙ࢳ ࠙ঠ ী ୡ੼ਸ ݏ୶য Scalaо ࢤࣗೠ ࢎۈٜਸ ਤ೧ Scala੄ ੌࠗܳ ࣗѐೞҊ ੓णפ׮. Scalaী ؀೧ ؊ ੗ࣁ൤ ঌইࠁҊ रਵन ࠙਷ Ҵղ ࢎਊ੗ Ӓܛੋ ‘ۄ झணۄ ௏٬ױ’ਸ ୶ୌ೤פ׮. 2. ੉ ੗ܐীࢲ ׮ܖҊ੗ ೞח ؘ੉ఠ ࠙ࢳ਷ R, Matlab١ਸ ࢎਊೞח Ҋә ࠙ࢳࠁ׮ח, ઱۽ ؀ਊ۝ ؘ੉ఠ੄ ࠙࢑ ୊ܻ ߂ ࠙ࢳ ࠙ঠੑפ׮.
  2. public class WordCount { public static class Map extends MapReduceBase

    implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } Word count in MapReduce (Java)
  3. public class WordCount { public static class Map extends MapReduceBase

    implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word count in Spark(Scala)
  4. Scalable Language! • рѾೠ ಴അҗ ъ۱ೠ ӝמਸ ా೧ ؊ ௾

    ೐۽Ӓ۔ਸ ٜ݅ӝ ਤೠ ঱য • Scalaо о૓ ৈ۞о૑ ౠ૚ٜ੉ ؘ੉ఠ ࠙ࢳೞӝী જ਷ Ѫٜ੉ ݆׮
  5. Scala • ই઱ рѾೠ ޙߨ (like, Python) • OOP, Functional

    Programming झఋੌ оמ • JVMীࢲ प೯, Java৬ ഐജ • જ਷ ࢿמ (== Java) • ੿੸ ఋੑ (!= Python, == Java) • REPL (Shell), Scripting * Ӓ ߆ীب જ਷ ౠ૚੉ ݆૑݅, ؘ੉ఠ ࠙ࢳ ࠙ঠ৬ ҙ۲ػ ౠ૚ ਤ઱۽ ঱әೞ৓णפ׮
  6. рѾೠ ޙߨ (Java৬ ࠺Ү) public class Person { private String

    name; private String work; public void setName(String name) { this.name = name; } public String getName() { return name; } public void setWork(String work) { this.work = work; } public String getWork() { return work; } } Person.java Job.java public class Job { public void main(String[] args) { Person kevin = new Person(); kevin.setName("Kevin"); kevin.setWork("Between"); } } job.scala class Person(val name: String, val work: String) val kevin = new Person("Kevin", "Between") ஢੉ ݽ੗ۄ.. GOOD
  7. OOP & Functional Programming • য়೧: OOP৬ Functional Programming਷ ߈؀݈੉׮?

    (X) • Scalaח Pure OOP class Person(val name: String, val work: String) val kevin = new Person("Kevin", "Between") • Scalaח Functional Programming੉ оמ val list = List(1, 2, 3) def aMultiplyFunction(x: Int) = { x * 2 } val result = list.map(aMultiplyFunction) ೣࣻо 1st-class citizen! ೣࣻܳ ؘ੉ఠ۽ р઱ೞҊ, ੋ੗۽ ֈӝח ١੄ ೯ਤо оמ
  8. JVMীࢲ प೯, Java৬ ഐജ • Scala ௏٘ܳ ஹ౵ੌೞݶ Java৬ ݃ଲо૑۽

    .class ౵ੌ੉ ա১ • JVMীࢲ प೯, Java৬ Ѣ੄ زੌೠ प೯ ࢿמਸ о૗ • Java Class Importೞৈ ࢎਊ оמ • Java fileҗ Scala fileਸ ഒਊೞৈ ஹ౵ੌب оמ
  9. ੿੸ ఋੑ ঱য • ੿੸ ఋੑ vs ز੸ ఋੑ? •

    ࢲ۽ ੢ױ੼੉ ڢ۶ೣ • ੿੸ ఋੑ ঱য੄ ੢੼: ஹ౵ੌद ఋੑ ୓ఊ, જ਷ ࢿמ • ز੸ ఋੑ ঱য੄ ੢੼: рಞೠ ௏٘੘ࢿ, ӭՔೠ ௏٘ • Scalaח ੿੸ ఋੑ ঱য • ஹ౵ੌद ఋੑ୓௼, type safety, જ਷ ࢿמ • ࠺Ү੸ ӭՔೠ type interface - ఋੑਸ ୶ۿ(type inference)ೞৈ ֍যષ • ௏٘ܳ ױࣽೞѱ ਬ૑ೞӝ ਤೠ implicit conversion١੄ ੢஖
  10. ৵ Scalaੋо? • рѾೠ ޙߨҗ ъ۱ೠ expression • Functional Programming

    • Java৬ ഐജ (= Hadoop ഐജ!) • REPL, Scripting • Apache Spark • Collection library, Pattern matching, Ӓ ৻ ݧ૓ بҳٜ
  11. рѾೠ ޙߨ, ъ۱ೠ ಴അ۱ • (׼োೞѱب) ޙߨ੉ рѾೞݶ જ׮. •

    if-else࠙ӝ ഑਷ try-catch ١੉ ݽف expression੐ // if statement is an expression! println(if (a == "A") "It's A!" else "It's not A") // try catch is an expression! val value = try { doSomeDangerousOperation } catch { case _ => "some value" } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  12. рѾೠ ޙߨ, ъ۱ೠ ಴അ۱ • ੌҙࢿ ੓ח operatorٜ // Java

    "A".equals("B") // Scala "A" == "B" case class Person(name: String, work: String) val kevin = Person("Kevin", "Between") val anotherKevin = Person("Kevin", “Between”) kevin == anotherKevin // true case class੄ ࢤࢿীח new о ೙ਃ হ׮ • ೤ܻ੸ੋ class equality
  13. Functional Programming • ӝઓ੄ ೐۽Ӓ۔ীࢲ੄ ೣࣻо ইצ, ࣻ೟੸ੋ ੄޷ীࢲ੄ ೣࣻܳ

    ࢤп೧ ࠇद׮! • y = sin(x) : Side effectо হ਺. যڃ ࢚ടীࢲب x ܳ ֍ਵݶ Ӓ ী ݏח yо ա১ • tan(x) = sin(x) / cos(x) : ೣࣻܳ ؘ੉ఠ୊ۢ ࢤпೞৈ, ౵ۄݫఠ ۽ ֈӝѢա ઑ೤ೞח ١੄ ੘স੉ оמ • y = sin(x) : yח xо ೠߣ ੿೧૑ݶ ߸ೞ૑ ঋ਺. ’߸ࣻ’ о হѱ! • ߸ٜࣻਸ immutableೞѱ ٜ݅੗! * ৘ઁ ߂ ੌࠗ ࢸݺਸ Programming Scala ଼ীࢲ ରਊ೮णפ׮.
  14. FP੄ ੉۞ೠ ౠࢿٜ੉ ৵ જ਷о? • ߡӒܳ ઴ৈળ׮ (߸ࣻী ੄೧

    ৘ӝ஖ޅೠ ز੘ী ࡅ૑חѪਸ) • ೠߣ ٜ݅য֬਷ ೣࣻܳ ޺ਸ ࣻ ੓׮ (no side effect!) • immutable ߸ࣻח ޙઁܳ ױࣽച೧ળ׮ (data share, parallelismী ъೣ)
  15. Java৬੄ ഐജࢿ • JVMীࢲ ҳز -> ݆਷ ন੄ ؘ੉ఠ ୊ܻೡ

    ٸ ࢿמ જ਺! • Java libraryٜਸ Ӓ؀۽ ഝਊ оמ • Hadoop eco-system੄ Java ௏ٜ٘ਸ Ӓ؀۽ ࢎਊೡ ࣻ ੓׮! • ৘੹ী ઓ੤ೞ؍ ௏٘ܳ ੸਷ ֢۱ਵ۽ convert೧ࢲ ࢎਊ оמ • Java ௏٘৬ ഒਊ೧ࢲ ஹ౵ੌ оמ • src/java/…, src/scala/…
  16. REPL • Read–Eval–Print Loop (aka Shell) • ࢜۽਍ ঱যܳ ࡅܰѱ

    ߓ਋Ҋ, द೷ೡ ࣻ ੓׮! • ؘ੉ఠܳ ٜৈ׮ ࠅ ҃਋, step-by-stepਵ۽ ੘স੉ оמ೧ࢲ જ׮ ী۞о աب ૊п ঌࣻ ੓׮ ؘ੉ఠܳ ׮ܖח җ੿੉ interactive೧૗!
  17. Apache Spark • ݫݽܻ ӝ߈ Ҋࢿמ ࠙࢑ ؘ੉ఠ ୊ܻ दझమ

    (ӝઓ੄ 10~100ߓ) • Scala۽ ॳৈ૗. Scala੄ collection library৬ ਬࢎೠ ੋఠಕ੉झ • Scala shellী ӝמਸ ୶оೠ Spark shell ઁҕ • ߧਊ੸ਵ۽ ࢎਊೞӝ ਤೠ ׮নೠ োҙ ೐۽ં౟ ઓ੤ • SQL, Machine Learning, Graph Analysis.. ١١ • ૑Әب ࡅܰѱ ѐߊغҊ ੓Ҋ ݆਷ ࢎۈٜ੄ ҙबਸ ߉Ҋ ੓਺
  18. Ӓ ߆ী.. • Collection library • Pattern matching • implicitэ਷

    ਋ইೠ بҳٜ • ّࠗ࠙ীࢲ ؊ ੗ࣁ൤ ׮ܙ ৘੿
  19. ؘ੉ఠ ҳઑ • List, Map, Set ١੄ collection ٜ •

    List(1, 2, 3), Map(1 -> “a”, 2 -> “b”), Set(1, 2) • Tuple • val sparkTechTalk = (“2014-12-03”, 50) • sparkTechTalk._1 • case (key, value) => println(key) • Option • ч੉ হਸ ٸ, null ؀न! (؊ ಞೞҊ, উ੹ೠ ೐۽Ӓې߁) • a = 1, a = null (ӝઓ) a = Some(1) a = None (Optionഝਊ) • a.nonEmpty, a.getOrElse(0) • Range • for (i <- 0 to 10) println(i) • (0 to 10).foreach(println) • (0 until 10) (0 to 10) (0 to -10 by -1)
  20. Collection ׮ܖӝ • (n), head, tail, last, contains, distinct, drop,

    … • Functional Combinators • map: elementী ೣࣻܳ ੸ਊೞৈ ׮ܲ ഋక۽ ߸ജ • filter: elementܳ true/false ౸߹ ೣࣻ ੸ਊ റ trueੋ ೦ݾ݅ թӣ • foreach: mapҗ ࠺त, ׮ܲഋక۽ ߸ജೞ૑ ঋҊ iteration݅ ࣻ೯ • foldLeft (foldRight, reduce): ৽ଃ੄ elementࠗఠ द੘ೞৈ ೞա ۽ ೤ஜ • ّࠗ࠙ী ࢎਊ ৘ܳ ࠇद׮
  21. Function Literal val list = List(1, 2, 3, 4) list.filter((x:

    Int) => x < 3) val testNumber1 = (x: Int) => x < 3 // function as a 1st-class object! list.filter(testNumber1) list.filter((x) => x < 3) // target typing list.filter(x => x < 3) list.filter(_ < 3) // placeholder def testNumber2(x: Int) = x < 3 // function list.filter(x => testNumber2(x)) list.filter(testNumber2(_)) list.filter(testNumber2 _) list.filter(testNumber2) ݆਷ ࠗ࠙ਸ ୷ড оמ! ࣻৌীࢲ 3 ޷݅ੋ ч ҳೞӝ
  22. val input1 = "three" case class Chart(date: String, count: Int)

    val input2 = Chart("2014-12-02", 50) val input3 = ("spark-techtalk", 100) def matchTest(x: Any): Any = { x match { case 1 => "one" case "two" => 2 case (key, value) => s"key: $key, value: $value" case Chart(date, count) => s"date: $date, count: $count" case _ => "others" } } matchTest(input1) res0: Any = others matchTest(input2) res1: Any = date: 2014-12-02, count: 50 matchTest(input3) res2: Any = key: spark-techtalk, value: 100 Pattern Matching & Case Class • Java੄ switch ~ case ৬ ࠺तೞ૑݅, ഻ঁ ъ۱ೠ بҳ ׮ܲ ઙܨ੄ ఋੑ੉ۄب ݒ஖ оמ case ഑਷ case class ഝਊೞݶ ؊਌ ಞܻ case class: ؘ੉ఠ ҳઑചী ಞܻ
  23. ৘ઁ: ۽Ӓীࢲ рױೠ ૑಴ ҳೞӝ // load log file val

    logFile = new java.io.File(path + "example_log.txt") val log = scala.io.Source.fromFile(logFile).getLines().toList // parse log and get sign up numbers case class LogEntry(dateTime: String, action: String, id: String) val logEntries = log.map(csv => csv.split(",")).map(arr => LogEntry(arr(0), arr(1), arr(2))).toList // get sign up val logEntriesToday = logEntries.filter(_.dateTime.contains("2014-12-04")) val signUp = logEntriesToday.filter(_.action == "SIGN_UP").size // active user val userIds = logEntriesToday.map(_ id) val activeUser = userIds.distinct.size
  24. Bonus: Spark Version // load log file val log =

    sc.textFile("file:///example_log.txt") // parse log and get sign up numbers case class LogEntry(dateTime: String, action: String, id: String) val logEntries = log.map(csv => csv.split(",")).map(arr => LogEntry(arr(0), arr(1), arr(2))) // get sign up val logEntriesToday = logEntries.filter(_.dateTime.contains("2014-12-04")) val signUp = logEntriesToday.filter(_.action == "SIGN_UP").count // active user val userIds = logEntriesToday.map(_ id) val activeUser = userIds.distinct.count Scala collection API৬ Ѣ੄ ৮੹൤ زੌ!
  25. Implicit Conversion • ӝמ੄ ഛ੢ਸ ಞೞѱ ೞҊरਸٸ • ৘࢚غח ఋੑਵ۽

    ߸ജೞח ೣࣻܳ ੿੄೧֬Ҋ, ੗زਵ۽ ੸ਊ implicit def stringToInt(number: String): Int = { number match { case "one" => 1 case "two" => 2 } } def printNumber(n: Int) = println(n) printNumber("one") ਗې؀۽ۄݶ, compile error. implicit conversion੉ ࢶ঱غয ੓ਵ޲۽, String => Int ۽ ੗ز ߸ജ੉ ੌযթ
  26. Implicit Conversion ഝਊ DateParser.parse("2014-12-03") // java style "2014-12-03".toDateTime // better

    solution using implicit conversion object DateParser { def parse(dateString :String) = new java.util.Date } DateParser.parse("2014-12-03") class DateConverter(val s: String) { def toDateTime = DateParser.parse(s) } implicit def string2DateConverter(s: String) = new DateConverter(s) "2014-12-03".toDateTime ؊ ૒ҙ੸੉Ҋ ੌҙࢿ ੓ח ௏٘ܳ ٜ݅ ࣻ ੓׮!
  27. Implicit Parameter • ߈ࠂ ੸ਊغח ౵ۄݫఠܳ рױೞѱ ٜ݅Ҋ रਸٸ val

    date = "2014-12-03" calculateSignUp(date) calculateActiveUser(date) calculateActionCount(date) def calculateSignUp(implicit date: String) = ... implicit val date = "2014-12-03" calculateSignUp calculateActiveUser calculateActionCount(date) • ױ, implicitਸ թߊೞݶ ൨ٜয૓׮!
  28. ੿ܻ • Scalaח ؘ੉ఠ ࠙ࢳೞӝী જ਷ ঱য (׮ܲ ਊب۽ب જইਃ)

    • рѾೠ ಴അ, જ਷ ࢿמ, Functional Programming • REPL, Scriptingоמ • ਋ইೠ ߑधਵ۽ ਗೞח ѐ֛ਸ ҳഅೡ ࣻ ੓਺
  29. ଵҊೡ݅ೠ ੗ܐ • Scala 5࠙݅ী ߓ਋ӝ http://learnxinyminutes.com/docs/scala/ • Coursera Scala

    ъ੄ https://www.coursera.org/course/progfun • Scala ߓ਋ӝ (࠶۽Ӓ) http://joelabrahamsson.com/learning-scala/ • Scala School (౟ਤఠ) http://twitter.github.io/scala_school/ko/ • Programming in Scala (ೠҴয౸) Scala੄ ହद੗ੋ ݃౯ য়؊झఃо ૒੽ ੷ࣿ, ੹Ҵ ࢲ੼ীࢲ ҳݒ оמ