Full-stack Data Science: How to be a One-Man Data Team
Slides for the talk I gave at a CognitionX meetup.
Based on my 3 years of experience working at startups, going from web developer to data engineering, to data science. A ton of tips and tricks on fast ways of doing data science.
`make train` !"" data # !"" external <- Data from third party sources. # !"" interim <- Intermediate data that has been transformed. # !"" processed <- The final, canonical data sets for modeling. # $"" raw <- The original, immutable data dump. !"" docs <- A default Sphinx project; see sphinx-doc.org for details !"" models <- Trained and serialized models, model predictions !"" notebooks <- Jupyter notebooks !"" references <- Data dictionaries, manuals, and all other explanatory materials. !"" reports <- Generated analysis as HTML, PDF, LaTeX, etc. !"" requirements.txt <- The requirements file for reproducing the env !"" src <- Source code for use in this project. # !"" data <- Scripts to download or generate data # # $"" make_dataset.py # !"" features <- Scripts to turn raw data into features for modeling # # $"" build_features.py # !"" models <- Scripts to train models and then use trained models to make # # # predictions # # !"" predict_model.py # # $"" train_model.py # $"" visualization <- Scripts to create exploratory and results oriented visualizations # $"" visualize.py
with attributes db = dataset.connect('postgresql://u:p@localhost:5432/db', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # print all the rows for row in result: print(row['country'], row['c']) # get data into pandas, that's where the fun begins! rows_df = pandas.DataFrame.from_records(rows)