In the US: offices in DC, NYC and Richmond, Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster, Tatami, etc. • @ipponusa
not versioned, not unit tested… → Not ready for production • Spark, but a lot of Spark SQL (data processing) • Machine Learning in Python (Scikit Learn) → Objective: industrialization of the code
by the Scala compiler val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) { case (df, field) => val udf: UserDefinedFunction = ... // get the cleaning UDF df.withColumn(field.name + "_cleaned", udf.apply(df(field.name))) .drop(field.name) .withColumnRenamed(field.name + "_cleaned", field.name) }
fields" in { val res = resDF.select("ID", "name", "surname").collect() val expected = Array( Row("000010", "jose", "lester"), Row("000011", "jose", "lester ea"), Row("000012", "jose", "lester") ) res should contain theSameElementsAs expected } "The cleaning process" should "parse dates" in { ... Comparison of Row objects 000010;Jose;Lester;10/10/1970 000011;Jose =-+;Lester éà;10/10/1970 000012;Jose;Lester;invalid date
of tests → multiple contexts • Setup / tear down the SparkContext for each test → slow tests • Do’s: • Use a shared SparkContext object SparkTestContext { val conf = new SparkConf() .setAppName("deduplication-tests") .setMaster("local[*]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) }
• (Gradient Boosting Trees also give good results) • Training on the potential duplicates labeled by hand • Predictions on the potential duplicates not labeled by hand