Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Transform Research Oriented Code into Ma...

tetsuya0617
September 20, 2019

How to Transform Research Oriented Code into Machine Learning APIs with Python

This is my talk in Pycon Taiwan 2019 🇹🇼 (https://tw.pycon.org/2019/en-us/events/schedule/)

tetsuya0617

September 20, 2019
Tweet

More Decks by tetsuya0617

Other Decks in Programming

Transcript

  1. How to Transform Research Oriented Code into Machine Learning APIs

    with Python Tetsuya (Jesse) Hirata @JesseTetsuya ———————————————————————————————————————————————————————————————————————————————— Software Engineer at Classi which is an EdTech company. I mostly work in both data science and engineering.
  2. Background and Purpose - Recently, Python Engineers have more opportunities

    to work with data scientists and researchers than before. - Understanding the processes to develop ML APIs can help make AI/ML projects work more smoothly
  3. 3FTFBSDI 0SJFOUFE $PEF .-"1*T  Steps to transform Research Oriented

    Code into ML APIs   3FGBDUPS $IFDL 6OEFSTUBOE .PEVMBSJ[F 
  4. 3FTFBSDI 0SJFOUFE $PEF .-"1*T  Steps to transform Research Oriented

    Code into ML APIs 6OEFSTUBOE 8IBUJT3FTFBSDI0SJFOUFE$PEF  8IBUBSF.-"1*T  )PXTIPVMEFOHJOFFSTIBOEMFSFTFBSDIPSJFOUFEDPEF
  5. Definition Research oriented code in AI/ML projects is the code

    written mainly by data scientists or researchers for figuring out the most efficient and suitable machine learning model.
  6. 1.Preparation code for accessing data 2.Pre-processing code 3.Machine learning (ML)

    code Production code (Engineers) Research oriented code (Data Scientists/Researchers) Machine Learning APIs are composed of three elements Research oriented code is developed through an iterative process and integrated into production code.
  7. Data Pre-Processing code Visually trace the code from the top

    to the bottom Easily and quickly write it
  8. ML code (a part of whole code) Easily handle input

    data and trace output data with data frame
  9. Refactor both code in Pythonic way This code builds the

    model in a much faster and simpler way
  10. 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF  1SFQSPDFTTJOHDPEF .-DPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF

    $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF⒏DJFOUBOE TVJUBCMFNBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ Three Differences between Research Oriented Code and Production Code
  11. 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQSPDFTTJOHDPEF .BDIJOFMFBSOJOHDPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF

    &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF⒏DJFOUBOE TVJUBCMFNBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ Three Differences between Research Oriented Code and Production Code 3FGBDUPS $IFDL .PEVMBSJ[F
  12. 3FTFBSDI 0SJFOUFE $PEF .-"1*T  Steps to Transform Research Oriented

    Code into ML APIs .PEVMBSJ[F   $BUFHPSJ[FSFTFBSDIPSJFOUFEDPEFJOUPQSFQBSBUJPODPEF  QSFQSPDFTTJOHDPEF BOE.-DPEF  #SFBLUIFNPVUJOUPGVODUJPOTBOENBLFUIFNUFTUBCMF  $MBSJGZJOQVUBOEPVUQVUPGUIFDPEF BOEEFpOF63*
  13. This is a page of research oriented code written with

    jupyter notebook. This code is procedural and some of them are not classified. The research oriented code seems to be tightly coupled. 2.1. Categorize research oriented code into preparation code, preprocessing code, ML code
  14. Find the code to load input data or access database

    → preparation code Find the code to make, replace, filter, or delete input data → preprocessing code Find the code to execute calculation or train data → ML code 2.1. Categorize research oriented code into preparation code, preprocessing code, ML code
  15. Module name Functions Preparation code preparation.py - Access big query,

    execute query, and load input data - Rename columns Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data ML code prediction.py - Calculate icc parameters, logistic regression, and item response theory (IRT) The research oriented code became loosely coupled 2.2. Break them out into functions and make them testable
  16. app.py @app.route("/v1/probabilities", methods=['GET']) def probabilities(): return calc_results(), 200 return get_probs(),

    200 ← noun ← the same endpoint name ← verb (+ noun) INPUT OUTPUT *item means a question INPUT: results of student answers OUTPUT: probabilities to answer questions correctly 2.3. Clarify input and output of the whole code and define URI
  17. 3FTFBSDI 0SJFOUFE $PEF .-"1*T  Steps to transform Research Oriented

    Code into ML APIs   Refactor  1. Prepare for refactoring 2. Simplify I/O in preparation code 3. Pandas → Python in preprocessing code
  18. . ᵓᴷᴷ ml_api ᴹ ᵓᴷᴷ api ᴹ ᴹ ᵓᴷᴷ app.py

    ᴹ ᴹ ᵓᴷᴷ config ᴹ ᴹ ᵓᴷᴷ prediction.py ᴹ ᴹ ᵓᴷᴷ preparation.py ᴹ ᴹ ᵓᴷᴷ preprocessing.py ᴹ ᵓᴷᴷ requirements.txt ᴹ ᵓᴷᴷ run.py ᴹ ᵋᴷᴷ tests ᴹ ᵓᴷᴷ test_app.py ᴹ ᵓᴷᴷ test_prediction.py ᴹ ᵓᴷᴷ test_preparation.py ᴹ ᵋᴷᴷ test_preprocessing.py ᵋᴷᴷ setup.py 3.1 Prepare for refactoring Narrow down requirements of each code by writing test code and take notes about requirements on the comments for refactoring (or you can tell data scientist to write comments in advance) def func(arg1, arg2): """Summary line. Extended description of function. Args: arg1 (int): Description of arg1 arg2 (str): Description of arg2 Returns: bool: Description of return value """ return True ex) Google Style #comments out or doc strings (reStructuredText style /Numpy style/Google Style)
  19. CASE STUDY: Refactoring the code to access BigQuery and GCS

    by using google cloud client libraries with Python 3.2 Simplify I/O in preparation code
  20. from google.cloud import bigquery client = bigquery.Client() query = "SELECT

    column_1, column_2, column_3 FROM `data set name` where column_1 is not NULL query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] OUTPUT: Two Dimensional Arrays + Filter Values + Drop Null OUTPUT: Two Dimensional Arrays from google.cloud import bigquery client = bigquery.Client() query = "SELECT * FROM `data set name` query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] → Preprocess the data with query as much as possible → It is faster and lower-cost than preprocess data with python Code B Code A 3.2. Simplify I/O in preparation code ex) Big Query with Python
  21. import io, csv, gzip from google.cloud import storage storage_client =

    storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file: bytes_f = result.encode() gzip_file.write(bytes_f) blob = bucket.blob(‘storage_path’) blob.upload_from_file(gzip_obj, rewind=True, content_type='application/gzip') Make bytes object and upload it from memory to GCS with Python 3.2 Simplify I/O in preparation code ex) Google Cloud Storage with Python
  22. import gcp_accessor bq = gcp_accessor.BigQueryAccessor() query = "SELECT * FROM

    `data set name` bq.execute_query(query) gcs = gcp_accessor.GoogleCloudStorageAccessor() gcs.upload_csv_gzip( ‘bucket name', ‘full path on gcs', ‘input data’) 3.2. Simplify I/O more by using wrapper import io, csv, gzip from google.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file: bytes_f = result.encode() gzip_file.write(bytes_f) blob = bucket.blob(storage_path) blob.upload_from_file(gzip_obj, rewind=True, content_type='application/gzip') from google.cloud import bigquery client = bigquery.Client() query = "SELECT * FROM `data set name` query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] google-cloud-bigquery google-cloud-storage gcp-accessor (wrapper library) (https://pypi.org/project/gcp-accessor/)
  23. All data in the api is processed using the same

    data type. This improves readability and maintainability as opposed to prioritizing processing speed 3.3. Pandas → Python in preprocessing code
  24. One day, I wondered why I struggled so much with

    refactoring of preprocessing code in research oriented code that I wrote a previous week. 3.3. Pandas → Python in preprocessing code
  25. 3.3. Pandas → Python in preprocessing code Code Styles/ Preprocessing

    Functions Pandas Python Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] Replace dataframe.fillna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique()
 (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values]
  26. 3FTFBSDI 0SJFOUFE $PEF .-"1*T  Steps to transform Research Oriented

    Code into ML APIs   $IFDL  1. Write decorators to check parameters 2. Set up production-like environments
  27. 4.1. Write decorators to check parameters Error handling Request parameter

    check Access token check Image of Decorators in APIs 3FRVFTU $MJFOU URIs preparation preprocessing calculation
  28. 4.1. Write decorators to check parameters Error handling Request parameter

    check Access token check Image of Decorators in APIs 3FRVFTU $MJFOU URIs preparation preprocessing calculation
  29. { "$schema": "http://json-schema.org/draft-04/schema#", "student_name": { "type": "string", "required": "True" },

    "student_grade": { "type": "string", "required": "True", "maximum": 120, "minimum": 1 } } curl http://localhost:5000/ -X POST -H "Content-Type: application/json" -d '{"student_name": "test_name", "student_grade": “forth-grade"}' make_name_grade.json request curl command 4.1. Write decorators to check parameters ex) Request parameter check with JSON Schema
  30. def validate_json(f): @wraps(f) def wrapper(*args, **kw): try: request.json except BadRequest

    as e: msg = “ This is an invalid json" return jsonify({"error": msg}), 400 return f(*args, **kw) return wrapper def validate_schema(schema_name): def decorator(f): @wraps(f) def wrapper(*args, **kw): try: validate(request.json, current_app.config[schema_name]) except ValidationError as e: return jsonify({"error": e.message}), 400 return f(*args, **kw) return wrapper return decorator @app.route('/', methods=['POST']) @validate_json @validate_schema('make_name_grade') def index(): if request.is_post: data = json.loads(request.data) print(data["student_name"]) print(data["student_grade"]) return "Hi! " + data["student_name"] else: return "Hi!" app.py json_validate.py This code of json_validate.py is cited from the URL: https://stackoverflow.com/questions/24238743/flask-decorator-to-verify-json-and-json-schema 4.1. Write decorators to check parameters ex) Request parameter check with JSON Schema
  31. Automate Continuous Integration Visualize data (Load Test) Deploy on GCP

    'MBTL"QQ #VJMEFS %BTI 4.2. Set up production-like environments with Flask
  32. Resources LOCUST: https://www.youtube.com/watch?v=XQ4hrbgVysk (Pycon Korea 2015) Refactoring: https://www.youtube.com/watch?v=D_6ybDcU5gc (Pycon US

    2016)
 Pytest: https://www.youtube.com/watch?v=G-MAMrJ-CSA (Pycon US 2019) Flask workshop: https://www.youtube.com/watch?v=DIcpEg77gdE (Pycon US 2015) Dash: https://www.youtube.com/watch?v=WLbQYFZc-YY (Pycon Jp 2019) google-cloud-bigquery: https://pypi.org/project/google-cloud-bigquery/ google-cloud-storage: https://pypi.org/project/google-cloud-storage/ gcp-accessor: https://pypi.org/project/gcp-accessor/0.0.1/ Flask-AppBuilder: https://flask-appbuilder.readthedocs.io/en/latest/ Python Tools that I mentioned in this talk Python Packages that I mentioned in this talk
  33. Summary 3FTFBSDI 0SJFOUFE $PEF .-"1*T    3FGBDUPS $IFDL

    6OEFSTUBOE .PEVMBSJ[F  - What is Research Oriented Code ? - What are ML APIs - How should engineers handle research oriented code ? - Categorize research oriented code into preparation code, preprocessing code, ML code - Break them out into functions and make them testable - Clarify input and output of the code, and define URI - Prepare for refactoring - Simplify I/O in preparation code - Pandas → Python in preprocessing code - Write decorators to check parameters - Set up production-like environments
  34. Tetsuya (Jesse) Hirata @JesseTetsuya ———————————————————————————————————————————————————————————————————————————————— Software Engineer at Classi which

    is an EdTech company. I mostly work in both data science and engineering. If you have an interest in how I am refactoring, in the EdTech domain, or in what our team is doing, feel free to talk to me later !!