Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OML4Py features: Using third-party Python packa...

OML4Py features: Using third-party Python packages from Python, SQL and REST

On this Office Hours for Oracle Machine Learning on Autonomous Database, Sherry LaMonica, Principal Member of Technical Staff at Oracle Machine Learning team explained how to use third-party Python packages from Python, SQL and REST via OML4Py.

So you want to deploy your Python scripts to production in the database environment? With Oracle Machine Learning for Python, you can invoke user-defined functions on database-spawned and controlled Python engines using embedded Python execution. An added benefit is the ability to easily leverage data-parallel and task-parallel invocation aided by the database environment. Join this session to learn about the options for working with third-party packages with OML4Py, both on premises and in Autonomous Database, from the Python, SQL and REST APIs.

The Oracle Machine Learning product family supports data scientists, analysts, developers, and IT to achieve data science project goals faster while taking full advantage of the Oracle platform.

Marcos Arancibia

October 05, 2021
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. OML feature highlight: Using third-party Python packages from Python, SQL,

    and REST OML Office Hours Sherry LaMonica Principal Member of Technical Staff, Oracle Machine Learning Supported by Marcos Arancibia and Mark Hornick Product Management, Oracle Machine Learning Move the Algorithms; Not the Data! Copyright © 2021, Oracle and/or its affiliates. This Session will be Recorded
  2. • Third-party packages • Introduction to Embedded Python Execution •

    Benefits and motivation • Embedded Python Execution interfaces • Demo • Q&A Goals for today Copyright © 2021, Oracle and/or its affiliates 2
  3. • A good source for state-of-the-art or new algorithms or

    functionality • Supports a wide range of functionality – Data transformation, statistical analysis, visualization, machine learning, etc. – Medical imaging, geospatial analysis, reproducible research • If a package is widely used it will quickly evolve into a robust offering through collaborative efforts • Not typically implemented for scalability and performance – Often focused on specialized topics and functionality rather than memory management or parallelism – Some packages quickly consume available memory Third-party packages 4 Copyright © 2021, Oracle and/or its affiliates.
  4. • How big is the data? – Will data fit

    into Python memory? – Will the package work on my data volumes? • Is working directly in Python more performant? – For small data, working in open-source Python may provide adequate performance • Can the data be partitioned for processing in parallel? – Partition by group or row (data-parallel) – Run n times in parallel (task-parallel), e.g., simulations • Is there equivalent in-database functionality? – In-database machine learning are scalable and parallelized Should I use a third-party Python package on database data? 5 Copyright © 2021, Oracle and/or its affiliates.
  5. There is no magic! • Performance gains are the result

    of running in data-parallel or task-parallel Python engines in the database environment • Using embedded Python does not make a Python package automatically scale – No automatic modification to third-party package code – No automation of scalable or parallel behavior for Python packages • Database environment may have more memory, CPU, or faster CPUs to increase scalability from a typical client machine Performance characteristics for open-source Python packages with Embedded Python Execution Copyright © 2021, Oracle and/or its affiliates 6
  6. Oracle Autonomous Database User-installed packages currently not available OML4Py Dependencies

    cx_Oracle 8.1.0 matplotlib 3.1.2 numpy 1.18.1 pandas 0.25.3 scikit-learn 0.23.1 scipy 1.6.1 cycler 0.10.0 joblib 0.14.0 kiwisolver 1.1.0 pyparsing 2.4.0 python-dateutil 2.8.1 pytz 2019.3 six 1.13.0 threadpoolctl 2.1.0 Oracle Database User-installed and configured packages supported Which third-party packages can be used with OML4Py? Copyright © 2021, Oracle and/or its affiliates 7
  7. Using pip • Python’s internal package manager • Installs Python

    packages from PyPI repository along with their dependencies • Often used with configuration file requirements.txt to install specified packages and versions $ pip install pandas $ pip install -r requirements.txt Package Management Copyright © 2021, Oracle and/or its affiliates. 8
  8. Using virtualenv • Tool for creating isolated Python environments •

    Use different versions of a package for distinct projects • Organize groups of packages into different isolated environments $ pip install virtualenv Package Management Copyright © 2021, Oracle and/or its affiliates. 9 Python 3.5 Python 3.9 Pandas 0.23.1 Pandas 0.25.3 virtualenv
  9. • Develop user-defined Python functions and manage these in the

    database for production deployment – Invoked using REST endpoints on Oracle Autonomous Database – Invoked using a SQL interface on Oracle Database • Python objects can be stored in the database using the OML4Py Datastore – Eliminates the need to manage objects using flat files • Datastores work with embedded Python execution for passing objects to user-defined functions – Pass non-scalar objects (e.g., a model) as arguments to user-defined functions in SQL, REST, and Python embedded execution Interfaces – Enables sharing data science work products from data scientists with application developers for deployment • Run user-defined Python functions in a data-parallel and task-parallel manner – Supports “embarrassingly parallel” solutions – Enables scoring third-party Python models at scale • User-defined function results can contain both DataFrames and images • Supports SQL, Python, and REST Interfaces Embedded Python Execution: Introduction Copyright © 2021, Oracle and/or its affiliates 11
  10. • Supplements OML4Py functionality through use of third-party packages in

    user-defined functions • Facilitates application use of Python script results – Develop/test Python scripts interactively with Python interface – Invoke Python scripts directly from SQL or REST for production applications o Python scripts stored in Oracle Database o Schedule Python scripts to run automatically via SQL • Potential for improved performance and throughput – Data-parallel and task-level parallelism – Leverage compute and memory resources of the database environment – Avoids the overhead of stacked SQL • Supports structured and image results • Retrieve results in various forms depending on application requirements: table, XML, PNG, JSON Embedded Python Execution: Benefits Copyright © 2021, Oracle and/or its affiliates 12
  11. OML4Py Interfaces for Embedded Python Execution Autonomous Database • Python

    Interface (via OML Notebooks) • REST Interface Oracle Database • Python Interface • SQL Interface Copyright © 2021 Oracle and/or its affiliates. OML4Py REST Interface Oracle Autonomous Database Oracle Database SQL Interface
  12. Python and SQL Interfaces Embedded Python Execution functions Copyright ©

    2021, Oracle and/or its affiliates 14 Python Interface Function SQL Interface Function Purpose oml.do_eval pyqEval Invoke stand-alone Python script oml.table_apply pyqTableEval Invoke Python script with full table input oml.row_apply pyqRowEval Invoke Python script one row at a time, or multiple rows in “chunks” oml.group_apply pyqGroupEval Invoke Python script on data indexed by grouping column oml.index_apply N/A (use pyqGroupEval) Invoke Python script N times
  13. REST Interface Embedded Python Execution functions Copyright © 2021, Oracle

    and/or its affiliates 15 REST Endpoint Purpose /api/py-scripts/v1/do-eval/{scriptName} Invoke stand-alone Python script /api/py-scripts/v1/table-apply/{scriptName} Invoke Python script with full table input /api/py-scripts/v1/row-apply/{scriptName} Invoke Python script one row at a time, or multiple rows in “chunks” /api/py-scripts/v1/group-apply/{scriptName} Invoke Python script on data indexed by grouping column /api/py-scripts/v1/index-apply/{scriptName} Invoke Python script N times
  14. Dependent on skill set and application requirements Python Interface •

    Develop user defined-functions • Create reproducible, shareable functions SQL and REST Interfaces • Invoke user-defined Python functions for deployment in applications – enables deployment to other platforms – increases project scalability • Schedule invocation using DBMS_SCHEDULER (SQL only) Which OML4Py Interface should I use for Embedded Python Execution? Copyright © 2021, Oracle and/or its affiliates 16
  15. • Table of Python scripts created by OML4Py user that

    can be saved and managed for use within Embedded Python Execution – Leverages security provided by Oracle Database – Python scripts are created securely • Creating Scripts – On premises, the PYQADMIN role is required for users who create and drop Python scripts – In ADB, OML_Developer role is required for users to create and drop Python scripts • Sharing Scripts – Scripts can be global or private (oml.script.create(..., is_global=True) – Functions oml.grant and oml.revoke can be used to grant or revoke the read privilege for a private Python script to other users – Only the user who created the Python script can modify and drop a private script – Every user has read access to a global script OML4Py Script Repository Copyright © 2021, Oracle and/or its affiliates 18
  16. Python, SQL, and REST interfaces Functions for managing scripts Copyright

    © 2021, Oracle and/or its affiliates 19 Python Interface SQL Interface REST Endpoint Purpose oml.script.create pyqScriptCreate N/A Registers a single user- defined Python function in the script repository oml.script.dir USER_PYQ_SCRIPTS view /api/py-scripts/v1/scripts Lists the user-defined Python functions present in the script repository oml.script.drop pyqScriptDrop N/A Drops a user-defined Python function from the script repository oml.script.load N/A Scripts are automatically loaded in the REST API when the function is called Loads a user-defined Python function from the script repository into a Python session
  17. • Repository in user schema allows storage and management of

    Python and OML4Py objects • Used for passing arguments to Python functions with embedded Python execution, especially non-scalar values for REST and SQL invocation, such as native Python ML models • Serialize and unserialize Python and OML4Py objects in a database table • Metadata stored in OML4Py's SYS schema, PYQSYS, and objects stored in user schema • Share objects with other data analysts and data scientists – Functions oml.grant and oml.revoke can be used to grant or revoke the read privilege to datastore instances – Objects can be shared as read-only with specific users or all users OML4Py Datastore Copyright © 2021, Oracle and/or its affiliates 20
  18. Python and SQL interfaces Functions for managing datastore objects Copyright

    © 2021, Oracle and/or its affiliates 21 Python Interface SQL Interface Purpose oml.ds.save N/A Saves Python objects to a named datastore in the user’s database schema oml.ds.dir USER_PYQ_DATASTORES view Lists the datastores or datastores contents available to the current user oml.ds.describe ALL_PYQ_DATASTORE_CONTENTS view Shows the objects in the specified datastore oml.ds.delete N/A Deletes one or more datastores or Python objects from a datastore oml.ds.load N/A Loads Python objects from a datastore into the user’s session
  19. Autonomous Database • Extends parallelism by enabling different service levels

    to manage the load on the system • Service levels control the degree of parallelism for jobs • Parallelism for service levels is LOW(2), MEDIUM(4), HIGH(8) – parallel=True corresponds to service level – parallel=x is limited by service level • For SQL, auto-scaling adds compute resources on demand - up to 3X for CPU and memory Blog: Machine Learning Performance on Autonomous Database https://blogs.oracle.com/machinelearning/post/machine-learning-performance-on-autonomous-database Parallelism - Embedded Python Execution Copyright © 2021, Oracle and/or its affiliates 23
  20. Oracle Database Parallelism - Embedded Python Execution Python interface Copyright

    © 2021, Oracle and/or its affiliates 24 OML4Py embedded Python function “parallel” argument • Maps to DB parallel query hint • Supported by oml.group_apply, oml.row_apply, oml.index_apply • Values – positive integer >= 2 for a specific degree of parallelism – FALSE or 1 for no parallelism – TRUE takes on the ‘data’ argument's default parallelism – NULL for the database default for the operation; serial execution by default Table Parallelism Embedded Python “parallel” argument
  21. Oracle Database Parallelism - Embedded Python Execution Python SQL interface

    Copyright © 2021, Oracle and/or its affiliates 25 Table Parallelism Parallel Query Hint Parallel query execution is prioritized by following DOP settings in order: • hint • SELECT /*+ parallel(4) */ COUNT(*) FROM IRIS; • table • ALTER TABLE IRIS PARALLEL 4;
  22. spawns Python interface for Embedded Python Execution Example of parallel

    partitioned data flow using third party package # user-defined function using sklearn def score_mod(dat): import pandas as pd import oml obj_dict = oml.ds.load(name="ds_regr",to_globals=False) regr = obj_dict["regr"] pred = regr.predict(dat[['SEPAL_LENGTH]]) return pd.concat([dat[['SPECIES','PETAL_WIDTH']], pd.DataFrame(pred, columns=['Pred_PETAL_WIDTH'])], axis=1) # invoke function in parallel on IRIS table pred = oml.row_apply(IRIS, score_mod, rows=10, parallel=True, func_value=pd.DataFrame([('a', 1, 1)], columns=['SPECIES', 'PETAL_LENGTH','PRED_PETAL_LENGTH'])) OML4Py Python Engine OML4Py Python Engine OML4Py OML Notebooks Copyright © 2021 Oracle and/or its affiliates. REST Interface Oracle Autonomous Database User tables
  23. spawns SQL interface for Embedded Python Execution Example of parallel

    partitioned data flow using third party package OML4Py Python Engine OML4Py Python Engine OML4Py OML Notebooks Copyright © 2021 Oracle and/or its affiliates. REST Interface Oracle Autonomous Database User tables SELECT * FROM table(pyqRowEval( IRIS, # table or view '{"oml_connect":1, # arguments "oml_input_type":"pandas.DataFrame", '{"SPECIES":"varchar2(10)", # return format "PETAL_LENGTH":"number", "Pred_PETAL_LENGTH":"number"}', 10, # rows per invocation 'score_mod')); # function name
  24. REST Interface for Embedded Python Execution Copyright © 2021, Oracle

    and/or its affiliates 29 <autonomous-database-url>/oml/tenants/<tenant_name>/databases/<pdb_name>/api/py-scripts/v1/<operation>/<script_name>/ py_scripts for executing user-defined functions (Python “scripts”) Cloud service URL Customer tenant name Name of pluggable database within ADB Name of script in repository do-eval table-apply group-apply index-apply row-apply Example of synchronous invocation from cURL $ curl -X POST --header "Authorization: Bearer ${token}" --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"input":"select * from IRIS", "parameters":"{\"oml_input_type\":\"pandas.DataFrame\"}", "rows":10, "parallelFlag":true, "service":“MEDIUM"}’ “$<autonomous-database-url/oml/tenants/MYTENANT/databases/MYPDB/api/py-scripts/v1/row-apply/score_mod“ Asynchronous invocation also available