Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making smarter queries with advanced ORM resources

Making smarter queries with advanced ORM resources

A hands-on demonstration of the power of Django's ORM applied to open data from Brazil.

More Decks by Labcodes Software Studio

Other Decks in Programming

Transcript

  1. Hi :) Mariana Bedran Lesche Full-stack developer at Labcodes Software

    Studio @maribedran twitter.com gmail.com github.com https://github.com/labcodes/ dados_brasil_io
  2. Motivation Exploring the advantages and disadvantages of the advanced ORM

    resources comparing to the more usual ones in terms of performance and readability. After learning a bit of SQL I could write queries complex enough that I didn’t know how to implement using Django. Could I put the two of them together?
  3. When to use this? Ran out of database optimization strategies?

    ✔ Index tables ✔ Paginate requests ✔ Use select_related and prefetch_related on querysets ✔ Use values, values_list, only and defer on querysets ✔ Setup database cache ✔ Use assertNumQueries method on tests
  4. Methodology - Choose a big dataset - Come up with

    some complex questions - Try to answer them with the things I had only seen on the Django documentation and never tried to apply
  5. What went wrong - Procrastination - Had to validate the

    data while importing it - My machine cannot handle loading too much data to memory at once - Importing the data took forever - Putting the technical knowledge together with the domain was much harder then I had thought
  6. Open data of public interest from Brazil Brazil has a

    law stating that all public data that’s not secret must be publicly available to anyone that has interest in it: the Information Access Bill (Lei de Acesso à Informação).
  7. Real life is not that simple Though the data is

    there, the access is not guaranteed for non technical people who would benefit from the information they provide. - Not every government agency complies with the law or take a long time to make the data available - The data is spread among multiple sources - Many datasets are in non open formats - Most of the people who have interest on the data don’t have the technical skills to process it
  8. The Brasil IO project A project created by Álvaro Justen

    (@turicas) to gather public data, clean it and make it available for those who want to research it. - Datasets: https://brasil.io/datasets - GitHub: https://github.com/turicas/brasil.io - Support the project: https://apoia.se/brasilio
  9. Company partners from Brazil - Company CNPJ (unique identifier of

    the National Legal Entities Registration) - Partner’s name - Partner’s category - Legal Entity - Natural Person - Foreign Partner - Partnership category - Partner’s CNPJ (if it’s a legal entity)
  10. Chamber of Deputies spents Deputies on National Congress can spend

    money on products and services related to the parliamentary activity and receive a refund for that. - Deputy’s unique identifier - Deputy’s party - Deputy’s name - Date of the expense - Reference month - Reference year - Amount spent - Description - Company CNPJ (if it’s company) - Some other fields...
  11. # Deputy class DeputadoSerializer(serializers.ModelSerializer): class Meta: model = Deputado fields

    = [ 'id', 'nome', # name 'partido', # party 'uf', # state 'gastos' # expenses ] depth = 2
  12. class DeputadoListView(generics.ListAPIView): serializer_class = DeputadoSerializer def get_queryset(self): queryset = Deputado.objects.all().select_related(

    'partido', 'uf' # party, state ).prefetch_related( 'gastos' # expenses ) # Will show all related expenses
  13. class DeputadoListView(generics.ListAPIView): serializer_class = DeputadoSerializer def get_queryset(self): queryset = Deputado.objects.all().select_related(

    'partido', 'uf' ) # Set default month/year filters today = date.today() filters = self.request.query_params.dict() filters.setdefault('gastos_mes', today.month) filters.setdefault('gastos_ano', today.year) return queryset.prefetch_gastos(**{ # prefetch expenses field.replace('gastos_', ''): value for field, value in filters.items() if field.startswith('gastos_') # expenses filters
  14. from django.db.models import Prefetch class DeputadoQuerySet(models.QuerySet): # Deputy QuerySet def

    prefetch_gastos(self, **kwargs): # prefetch expenses # Expenses QuerySet gastos_qs = GastoCotaParlamentar.objects.select_related( 'empresa' # company ).filter(**kwargs) # Never do that on a real project! prefetch = Prefetch('gastos', queryset=gastos_qs) return self.prefetch_related(prefetch)
  15. Filters • month 3 gastos_mes=3 • year 2018 gastos_ano=2018 •

    net value greather than 100 gastos_valor_liquido__gt=100
  16. from django.db.models import Q, Sum # Avg, Max, Min, Variance

    class DeputadoQuerySet(models.QuerySet): # annotate deputy’s expenses by month/year def annotate_gasto_no_mes_por_deputado(self, mes, ano): annotation = { f'gastos_{ano}_{mes:02}': Sum( # ‘gastos_2018_01’ 'gastos__valor_liquido', # net value filter=Q( gastos__mes=mes, # month gastos__ano=ano # year ) ) } return self.annotate(**annotation)
  17. SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", SUM("politicos_gastocotaparlamentar"."valor_liquido") FILTER (

    WHERE ( "politicos_gastocotaparlamentar"."ano" = 2018 AND "politicos_gastocotaparlamentar"."mes" = 1 ) ) AS "gastos_2018_01" FROM "politicos_deputado" LEFT OUTER JOIN "politicos_gastocotaparlamentar" ON ( "politicos_deputado"."id" = "politicos_gastocotaparlamentar"."deputado_id" ) GROUP BY "politicos_deputado"."id" ORDER BY "politicos_deputado"."id" ASC LIMIT 1;
  18. from django.db.models import FilteredRelation, Q, Sum class DeputadoQuerySet(models.QuerySet): def annotate_gasto_no_mes_por_deputado2(self

    , mes, ano): return self.annotate( gastos_filtrados=FilteredRelation( # filtered expenses 'gastos', condition=Q( gastos__mes=mes, # month gastos__ano=ano, # year ) ) ).annotate(**{ f'gastos_{ano}_{mes:02}': Sum( 'gastos_filtrados__valor_liquido' ) # filtered expense’s net value })
  19. SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", SUM(gastos_filtrados."valor_liquido") AS "gastos_2018_01"

    FROM "politicos_deputado" LEFT OUTER JOIN "politicos_gastocotaparlamentar" gastos_filtrados ON ( "politicos_deputado"."id" = gastos_filtrados."deputado_id" AND ( ( gastos_filtrados."ano" = 2018 AND gastos_filtrados."mes" = 1 ) ) ) GROUP BY "politicos_deputado"."id" ORDER BY "politicos_deputado"."id" ASC LIMIT 1;
  20. from django.db.models import Avg, Sum, Q class DeputadoQuerySet(models.QuerySet): ... #

    annotate months expenses per deputy def annotate_gasto_mensal_por_deputado(self): meses = range(1, 13) # months anos = range(2009, 2019) # years annotations = { f'gastos_{ano}_{mes:02}': Sum( # ‘gastos_2018_01’ 'gastos__valor_liquido', # expenses net value filter=Q(gastos__mes=mes, gastos__ano=ano) ) for ano in anos for mes in meses } return self.annotate(**annotations)
  21. from django.db.models import Avg class DeputadoQuerySet(models.QuerySet): ... def get_media_mensal(self): #

    get months average meses = range(1, 13) # month anos = range(2009, 2019) # year aggregations = { f'media_{ano}_{mes:02}': Avg( # ‘media_2018_01’ f'gastos_{ano}_{mes:02}' # gastos_2018_01 ) for ano in anos for mes in meses } # annotate month expenses per deputy return self.annotate_gasto_mensal_por_deputado() \ .aggregate(**aggregations)
  22. from django.db.models import Exists, OuterRef class DeputadoQuerySet(models.QuerySet): # annotate companys

    def annotate_empresas(self): empresas_qs = Empresa.objects.filter( # Companys # partnerships - natural person - name sociedades__socio_pessoa_fisica__nome=OuterRef('nome'), uf=OuterRef('uf') ) return self.annotate( empresas=Exists(empresas_qs) )
  23. SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", EXISTS( SELECT U0."cnpj",

    U0."nome", U0."uf_id" FROM "empresas_empresa" U0 INNER JOIN "empresas_sociedade" U1 ON (U0."cnpj" = U1."empresa_id") INNER JOIN "empresas_pessoafisica" U2 ON (U1."socio_pessoa_fisica_id" = U2."id") WHERE ( U2."nome" = ("politicos_deputado"."nome") AND U0."uf_id" = ("politicos_deputado"."uf_id") ) ) AS "empresas" FROM "politicos_deputado" ;