Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenAI Embedding API を活用して、 高度なレコメンド機能を実装した話 - ...

OpenAI Embedding API を活用して、 高度なレコメンド機能を実装した話 - A story about implementing an advanced recommendation function using the OpenAI Embedding API

sugoikondo 近藤 豊峰

February 15, 2024
Tweet

More Decks by sugoikondo 近藤 豊峰

Other Decks in Technology

Transcript

  1. OpenAI Embedding API Λ׆༻ͯ͠ɺ ߴ౓ͳϨίϝϯυػೳΛ࣮૷ͨ͠࿩ By Atsumine Kondo - @sugoikondo

    A story about implementing an advanced recommendation function using the OpenAI Embedding API
  2. ۙ౻ ๛ๆ Atsumine Kondo • Backend / Frontend / Infra

    • Scala, Kotlin, Python, etc… • Vue.js, Nuxt.js, Next,.js etc… @sugoikondo MoneyForward.inc Group Management Solution dept Product development div
  3. • จষϕΫτϧͱ͸Կ͔͕Θ͔Δ You can learn what a text vector is

    • ↑ ͕Θ͔Δ͜ͱͰɺҎԼιϦϡʔγϣϯ͕࣮ݱͰ͖Δ • Ϩίϝϯυ, ҟৗ஋ݕग़, ෼ྨ໰୊, ݕࡧ etc… • By learning the above, you can realize the following solution • Recommendation, outlier detection, classification problems, search etc... ͜ͷൃදͰֶ΂Δ͜ͱɾͰ͖ΔΑ͏ʹͳΔ͜ͱ
  4. • 8 ݄ʹ AI ʹΑΔ࿈݁Պ໨Ϩίϝϯυ ػೳΛϦϦʔε We released an AI-based

    consolidated account recommendation in Aug. 2023 • OpenAI ࣾͷ Embedding API Λ׆༻ Using OpenAI's Embedding API Ϋϥ΢υ࿈݁ձܭʹՊ໨ϨίϝϯυػೳΛ࣮૷ ref: https://corp.moneyforward.com/news/release/service/ 20230804-mf-press-1/ We’ve Implemented a subject recommendation function in our application ※ ಛڐग़ئࡁΈ Pattent applied
  5. • ݸࣾͷצఆՊ໨ʹରͯ͠ɺҙຯతʹ͍ۙ ਌ձࣾͷ࿈݁Պ໨Λ্Ґ 3 ͭΛఏҊ͢Δ • Suggest the top three

    parent company consolidated accounts that are semantically close to the individual company's accounts. ref: https://corp.moneyforward.com/news/release/service/ 20230804-mf-press-1/ Ϋϥ΢υ࿈݁ձܭʹՊ໨ϨίϝϯυػೳΛ࣮૷ We’ve Implemented a subject recommendation function in our application
  6. • ࿩୊ੑͷߴ͔͞Βɺଟ͘ͷϝσΟΞ ͰऔΓ্͛ͯ΋Β͍·ͨ͠ɻ Due to the high profile of the

    topic, we have had a lot of media coverage. • https://cloud.watch.impress.co.jp/docs/ news/1522209.html • https://it.impress.co.jp/articles/-/25192 • https://officenomikata.jp/news/15534/ • In total, about 8 articles... ଟ͘ͷϝσΟΞͰऔΓ্͛ͯ௖͖·ͨ͠ We have had a lot of media coverage.
  7. • צఆՊ໨ͱ͸ɺࢿ࢈ͳͲͷऔҾΛه࿥͢Δࡍʹ࢖͏໊শɾݟग़͠ • Accounts are names or headings used to

    record transactions of assets, etc. आํ Debit ିํ Credit ஍୅Ո௞ Rent expenses 50,000 ී௨༬ۚ Ordinary deposit 50,000 • ͜͜Ͱ͍͏ʮ஍୅Ո௞ʯͱʮී௨༬ۚʯ͕ͦΕͧΕצఆՊ໨ • The “Rent expenses" and “Ordinary deposit" here are the accounts respectively. ྫ: Ո௞ 5 ສԁΛޱ࠲Ҿ͖མͱ͠Ͱࢧ෷ͬͨ৔߹ e.g. You paid 50,000 yen rent via direct debit. צఆՊ໨/࿈݁Պ໨ͱ͸ʁ What is an account/consolidated account?
  8. • ࿈݁ձܭจ຺Ͱ͸ɺάϧʔϓ಺ͷձࣾͷ࿈݁Պ໨ͱֹۚΛٵ্͍͛ɺͦΕΒΛ਌ձࣾͷՊ໨ Ұͭʢ࿈݁Պ໨ʣʹू໿ͤ͞Δ࡞ۀ͕͋Δɻ • In the consolidation accounting context, there

    is a process of taking the consolidated accounts and balance of the companies in the group and consolidating them into one account (consolidated account) of the parent company. צఆՊ໨/࿈݁Պ໨ͱ͸ʁ What is an account/consolidated account? ࢠձࣾA Company A ਌ձࣾ Parent Company ී௨༬ۚ Ordinary Deposit Aۜߦ Bank A ݱۚٴͼ༬ۚ Cash & Deposit ࢠձࣾB Company B
  9. ΑΓྑ͍צఆՊ໨໊ϨίϝϯυΛͲ͏࣮ݱ͢Δ͔ʁ How to achieve better account recommendations? • ୯७ͳ͍͋·͍ݕࡧɾฤूڑ཭ͳͲͰ͸ٵऩ͖͠Εͳ͍ύλʔϯ͕ଟ͍ •

    ྫ: ʮʓʓۜߦʯͱʮී௨༬ۚʯɺʮݱۚٴͼ༬ۚʯͳͲ • ւ֎ࢠձ͕ࣾ͋Δ৔߹͸ʮʓʓ BankʯͳͲ೔ຊޠҎ֎ͷϞϊ͕དྷΔέʔε΋͋Δ • Many patterns cannot be absorbed by simple fuzzy search, edit distance, etc. • Ex: “XX bank” and “Ordinary deposit”, “Cash and deposits”, etc. • If there is an overseas subsidiary, there are cases where things other than Japanese are sent. • ҙຯͷۙ͞΋Ճຯͯ͠ɺݸࣾͷצఆՊ໨ʹҰ൪͍ۙ࿈݁Պ໨ΛϨίϝϯυ ͢Δඞཁ͕͋Δɻ • It is necessary to recommend the consolidated accounts that are closest to the individual company's accounts, taking into account the proximity in meaning.
  10. ͔͠͠ɺEmbedding API ʹ͍ͭͯ࿩͢લʹɺ ·ͣ͸จষϕΫτϧ / ෼ࢄදݱʹֶ͍ͭͯͼ·͠ΐ͏ɻ But before we talk

    about the Embedding API, Let's first learn about Text vector / Embedding representation.
  11. • จষΛ਺஋/ϕΫτϧʹม׵͢Δٕज़ɾख๏ͷ͜ͱ • A technology or method of converting text

    into vectors. About Embedding / Word2Vec ”ݱ͓ۚΑͼ༬ۚ” [[-0.03455162],[-0.01306203], [ 0.01672893],…, [-0.00129271], [ 0.00694819],[-0.01055199]] • ϕΫτϧɺ෼ࢄදݱ͋Δ͍͸ຒΊࠐΈදݱͱݺ͹ΕΔ͜ͱ΋͋Δɻ • Sometimes called vector, distributed or embedded representation. ‘Cash and deposits’
  12. About Embedding / Word2Vec ex: ʮݱۚʯͱʮෛ࠴ʯ͕ͦΕͧΕ [0.6, 0.8], [-0.3, 0.4]ͱ

    ͳΔ৔߹ When "Cash" and "Liabilities" become [0.4, 0.8] and [-0.3, 0.9], respectively ුಈখ਺఺ͷ഑ྻʹͳΔ͜ͱͰɺ࠲ඪ·ͨ͸ϕΫτϧΛද͢͜ͱ͕Ͱ͖Δɻ It can represent coordinates or vectors by being a floating-point array. -0.5 0.5 1 0.5 1 ݱۚ ෛ࠴ Liabilities Cash 0
  13. 2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ ɹ ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ ϕΫτϧԽ͢ΔͱͰ͖Δ͜ͱ What you can implement when

    vectorize texts 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ ͕Ͱ͖Δ Can calculate similarity between vectors Can perform numerical operations such as addition and subtraction against vectors
  14. 2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ ɹ ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ ͕Ͱ͖Δ Can calculate similarity

    between vectors Can perform numerical operations such as addition and subtraction between vectors ← ࠓճ͸ ͬͪ͜ This time we talk about this mainly. ϕΫτϧԽ͢ΔͱͰ͖Δ͜ͱ What you can implement when vectorize texts
  15. • 2 ͭͷϕΫτϧͷؒʹͳ֯͢౓ΛٻΊΔ͜ͱ ͰɺϕΫτϧͷ޲͖ͷྨࣅ౓Λࢉग़Ͱ͖Δ • By calculating the angle between

    two vectors, the similarity of vector orientation can be calculated • ίαΠϯྨࣅ౓͕Ұൠత • + Ͱਖ਼ͷ૬ؔɺ- Ͱෛͷ૬ؔ • Cosine similarity is generally used. • Plus means positive correction, negative means negative correction 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ ݱۚ A ۜߦ ஍୅Ո௞ cos(‘ݱۚ’, ‘Aۜߦ’) = 0.85 cos(‘ݱۚ’, ‘஍୅Ո௞’) = 0.05 Rent expenses Rent expenses Cash Cash Cash Bank A Bank A Can calculate similarity between vectors
  16. จষؒͷྨࣅ౓ΛٻΊΔ͜ͱ͕Ͱ͖ΔͷͰɺ͜ΕΒιϦϡʔγϣϯ͕࣮ݱͰ͖Δ Since similarity between sentences can be determined, you can

    apply it for the below solution 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ • Ϩίϝϯυʢྨࣅ౓͕ߴ͍΋ͷʣ- Recommendations (highly similarity) • ҟৗ஋ݕग़ (ྨࣅ౓͕௿͍΋ͷ) - Outlier detection (low similarity) • ෼ྨ໰୊ʢྨࣅ౓͕͍ۙ΋ͷಉ࢜Ͱ෼ྨ͢Δʣ- Classification (Classify by its similarity) Can calculate similarity between vectors
  17. จষؒͷྨࣅ౓ΛٻΊΔ͜ͱ͕Ͱ͖ΔͷͰɺ͜ΕΒιϦϡʔγϣϯ͕࣮ݱͰ͖Δ Since similarity between sentences can be determined, you can

    apply it for the below solution 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ • Ϩίϝϯυʢྨࣅ౓͕ߴ͍΋ͷʣ- Recommendations (highly similarity) • ҟৗ஋ݕग़ (ྨࣅ౓͕௿͍΋ͷ) - Outlier detection (low similarity) • ෼ྨ໰୊ʢྨࣅ౓͕͍ۙ΋ͷಉ࢜Ͱ෼ྨ͢Δʣ- Classification (Classify by its similarity) Can calculate similarity between vectors
  18. • ϕΫτϧ͸୯ͳΔଟ࣍ݩ഑ྻͳͷͰɺ࣍ݩ਺ ͕߹͑͹Ճࢉɾݮࢉʢ߹੒ʣ͕Ͱ͖Δ • Vectors are simply multidimensional arrays, so

    they can be added or subtracted (combined) if the number of dimensions matches. • ϕΫτϧಉ࢜Λ߹੒͢Δ͜ͱͰɺෳ਺ͷϕΫ τϧͷҙຯΛ࣋ͬͨ··ɺҰͭͷϕΫτϧʹ ͢Δ͜ͱ͕Ͱ͖Δ • Vectors can be combined into a single vector with the meaning of multiple vectors 2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ IT ΦϨϯδ ۚ༥ܥ MoneyForward Can perform numerical operations such as addition and subtraction against vectors Orange Fintech
  19. ίϨʹ͍ۙ΋ͷ͕࣮ݱͰ͖Δ Something close to this can be implemented. 2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ

    Ref: https://www.google.com/ ͭ·Γɺ Can perform numerical operations such as addition and subtraction against vectors So, IT Orange Fintech
  20. ίϨʹ͍ۙ΋ͷ΋࣮ݱͰ͖Δ (લͷྫͳΒ IT ͕ώοτ) Something similar to this can also

    be implemented (IT will hit in the previous example). 2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ Ref: https://www.google.com/ ΋ͪΖΜݮࢉ΋Ͱ͖ΔͷͰɺ Of course, we can also subtract them, Can perform numerical operations such as addition and subtraction against vectors MoneyForward -Fintech -Orange
  21. ࣄྫ1: Պ໨ಉ࢜ͰͷϨίϝϯυ Example 1:Recommendation between accounts • Redis ΛϕΫτϧΩϟογϡอଘ༻ʹར༻ •

    ͜ͷ࢓૊ΈͰඅ༻Λ཈͑ΒΕ͓ͯΓɺ OpenAI ʹ͸·ͩҰԁ΋෷͍ͬͯͳ͍ɻ • Use Redis as vector cache storage • Thanks to this mechanism, we are now using the API without paying a penny to OpenAI. • Embedding API ͕࢖͓͔͑ͨ͛ͰɺGPU ΍େྔ ͷCPU/ϝϞϦΛ٧ΜͩߴՁͳϚγϯ͕ෆཁʹ • Embedding API eliminates the need for expensive machines packed with GPUs and lots of CPU/ memory
  22. ࣄྫ2: ྖऩॻͷϑϦʔϫʔυݕࡧ Example 2: Free word search for receipts Vector

    DB Receipt.pdf, jpg, etc… [[-0.03455162],[-0.01306203],…, [ 0.00694819],[-0.01055199]] User • Vectorize the text of the contents of the receipt • OCR, use ChatGPT, etc… • Then store it in Vector DB, etc • ྖऩॻͷத਎ͷςΩετΛ༧ΊϕΫτϧԽ • OCR, ChatGPT ʹ౤͛Δ etc… • ͦΕΛ Vector DB ͳͲʹอଘ͓ͯ͘͠ Vectorization Upload
  23. ࣄྫ2: ྖऩॻͷϑϦʔϫʔυݕࡧ Example 2: Free word search for receipts Vector

    DB • Ϣʔβ͕ೖྗͨ͠ݕࡧϫʔυΛϕΫτϧԽɺ DB ্ͷ஋͔Β͍ۙ͠΋ͷΛϐοΫ User • Vectorize search words entered by the user, and pick the closest ones from the values on the DB. 12/1ͷ1ສԁͷྖऩॻ A receipt of10,000 yen on December 1. [[-0.03455162],[-0.01306203],…, [ 0.00694819],[-0.01055199]] Search Receipt_Dec_1.pdf Vectorization
  24. ࣄྫ3: ͱ͋ΔྖऩॻͱྨࣅͷྖऩॻΛ୳͢ Example 3: Find receipts that are similar to

    a certain receipt. Vector DB • Text to File ͕Ͱ͖Ε͹ɺ΋ͪΖΜ File to File ࣮ͩͬͯ૷Ͱ͖ͪΌ͏ • If Text to File can be implemented, of course File to File can also be implemented. ͜Εͱྨࣅͷྖऩॻ͕΄͍͠ I need a receipt similar to this one. [[-0.03455162],[-0.01306203],…, [ 0.00694819],[-0.01055199]] Text extraction Vectorization Receipt, Dec. 1, … Search
  25. จষϕΫτϧΛੜ੒Ͱ͖Δ͜ͱͰͰ͖Δ͜ͱ What can be done by being able to generate

    sentence vectors ςΩετʹม׵Ͱ͖Δ΋ͷͳΒɺͳΜͰ΋Ϩίϝϯυ etc Λ࣮૷Ͱ͖Δɻ Anything that can be converted to text can be used to implement recommendations, etc. ͔͠΋ ChatGPT ͷ͓ӄͰɺը૾ etc ΛςΩετʹม׵͢Δෑډ΋௿͘ͳ͍ͬͯΔ Also, thanks to ChatGPT, the difficulty of converting images and other data to text has been reduced. API ͸·͚ͩͩͲ Well, the API is not yet available
  26. • OpenAI ࣾͷ Embedding API Λ׆༻͢Δ͜ͱͰɺML ΤϯδχΞ͕ ډͳ͍νʔϜͰ΋ AI ιϦϡʔγϣϯΛΧϯλϯʹ࣮ݱͰ͖ͨ

    • OpenAI's Embedding API made it easy to implement an AI solution for a team without an ML engineer • Embedding API Λ׆༻͢Δ͜ͱͰɺϨίϝϯυ΍ҟৗ஋ݕग़ɺςΩ ετ෼ྨͳͲଟ༷ͳιϦϡʔγϣϯΛ࣮ݱͰ͖Δ • Embedding APIs can be used to implement various solutions such as recommendation, outlier detection, text classification, etc. ·ͱΊ - Summary
  27. ϚωʔϑΥϫʔυ͸ɺҰॹʹ੒ ௕͍͚ͯ͠Δ஥ؒΛืू͓ͯ͠ Γ·͢ɻ We are looking for people who can

    grow with us. ࠾༻αΠτ͸ͪ͜Β → Scan this QR to visit our recruitment site