Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Full-stack Data Science: How to be a One-Man Da...

Full-stack Data Science: How to be a One-Man Data Team

Slides for the talk I gave at a CognitionX meetup.

Based on my 3 years of experience working at startups, going from web developer to data engineering, to data science. A ton of tips and tricks on fast ways of doing data science.

Avatar for Greg Goltsov

Greg Goltsov

July 26, 2016
Tweet

More Decks by Greg Goltsov

Other Decks in Technology

Transcript

  1. Full-stack data science how to be a one-man data team

    Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  2. 3+ years in startups Pythonista Built backends for 1 mil+

    users Delivered to Fortune 10 Engineering → science Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  3. My journey Invest in tools that last Data is simple

    Explore literally Start fast, iterate faster Analysis is a DAG Don’t guard, empower instead What next?
  4. CS + Physics Games dev Data analyst/ engineer/viz/* Data Hacker

    Data Scientist University Touch Surgery Appear Here
  5. CREATE TABLE events ( name varchar(200), visitor_id varchar(200), properties jsonb,

    browser jsonb ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json “Hey, there’s MongoDB in my Postgres!”
  6. INSERT INTO events VALUES ( 'pageview', '1', '{ "page": "/account"

    }', '{ "name": "Chrome", "os": "Mac", "resolution": { "x": 1440, "y": 900 } }' ); INSERT INTO events VALUES ( 'purchase', '5', '{ "amount": 10 }', '{ "name": "Firefox", "os": "Windows", "resolution": { "x": 1024, "y": 768 } }' ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  7. SELECT browser->>'name' AS browser, count(browser) FROM events GROUP BY browser->>'name';

    browser | count ---------+------- Firefox | 3 Chrome | 2 Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  8. WITH new_users AS (...), unverified_users_ids AS (...) SELECT COUNT(new_user.id) FROM

    new_user WHERE new_user.id NOT IN unverified_users_ids; Postgres WITH
  9. http:/ /worrydream.com/LadderOfAbstraction # plain python col_C = [] for i,

    row in enumerate(col_A): c = row + col_B[i] col_C.append(c) # pandas df['C'] = df['A'] + df['B'] pandas What vs How
  10. Like to clean data Slice & dice data fluently http:/

    /www.forbes.com/sites/gilpress/2016/03/23/ data-preparation-most-time-consuming-least- enjoyable-data-science-task-survey-says
  11. drivendata.github.io/cookiecutter-data-science Rails new for data science Notebooks are for exploration

    Sane structure for collaboration https:/ /drivendata.github.io/cookiecutter-data-science
  12. !"" Makefile <- Makefile with commands like `make data` or

    `make train` !"" data # !"" external <- Data from third party sources. # !"" interim <- Intermediate data that has been transformed. # !"" processed <- The final, canonical data sets for modeling. # $"" raw <- The original, immutable data dump. !"" docs <- A default Sphinx project; see sphinx-doc.org for details !"" models <- Trained and serialized models, model predictions !"" notebooks <- Jupyter notebooks !"" references <- Data dictionaries, manuals, and all other explanatory materials. !"" reports <- Generated analysis as HTML, PDF, LaTeX, etc. !"" requirements.txt <- The requirements file for reproducing the env !"" src <- Source code for use in this project. # !"" data <- Scripts to download or generate data # # $"" make_dataset.py # !"" features <- Scripts to turn raw data into features for modeling # # $"" build_features.py # !"" models <- Scripts to train models and then use trained models to make # # # predictions # # !"" predict_model.py # # $"" train_model.py # $"" visualization <- Scripts to create exploratory and results oriented visualizations # $"" visualize.py
  13. dataset.readthedocs.io Just write SQL # connect, return rows as objects

    with attributes db = dataset.connect('postgresql://u:p@localhost:5432/db', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # print all the rows for row in result: print(row['country'], row['c']) # get data into pandas, that's where the fun begins! rows_df = pandas.DataFrame.from_records(rows)
  14. # sklearn-pandas mapper = DataFrameMapper([ (['age'], [sklearn.preprocessing.Imputer(), sklearn.preprocessing.StandardScaler()]), ...]) pipeline

    = sklearn.pipeline.Pipeline([ ('featurise', mapper), ('feature_selection', feature_selection.SelectKBest(k=100)), ('random_forest', ensemble.RandomForestClassifier())]) cv_params = dict( feature_selection__k=[100, 200], random_forest__n_estimators=[50, 100, 200]) cv = grid_search.GridSearchCV(pipeline, param_grid=cv_params) best_model = best_estimator_
  15. The goal is to turn data into information, and information

    into insight. – Carly Fiorina, former HP CEO