01:00
Day 1 Registration
01:00-01:30 @ NYCU
Day 1 Registration
01:30
Opening Remark
01:30-02:15 @ NYCU
Opening Remark sciwork2023: a place to code for results by Yung-Yu Chen (yyc)
02:00
Use Hypothesis, whether you like writing tests or not
Cheuk Ting Ho
02:15-02:45 @ NYCU
I bet you like writing tests. But instead of the example-based tests that we normally write, have you heard of property-based testing? By using Hypothesis, instead of thinking about what data I should test it for, it will generate test data, including Numpy and Pandas objects, for you.
02:30
Build an useful data science product in your orgnaization
zonghan
02:55-03:25 @ NYCU
Data science products within private organizations usually start with high expectation and end up low usage. This talk is about what are the common pitfalls of these data science products and shares some perspective of building useful data science projects.
03:00
Day 1 Lunch
03:25-05:00 @ NYCU
Lunch
05:00
Mastering Feature Engineering: Mining the Hidden Salary Formula with CakeResume
游騰林 TENG-LIN YU
05:00-05:30 @ NYCU
特徵工程是在建置數據模型時相當重要也最藝術的部分,除了能幫助模型捕捉到解釋變數和目標變數間的聯繫,而藝術的地方在於,進行特徵工程相當大程度取決於研究者對於領域知識、專案需求的理解,沒有一體適用的方法 在這次的演講中,我以 CakeResume 上的職缺資料為例,和大家分享我建置的薪資預測模型,以及如何經過一系列的特徵工程後,將模型的效度(R^2) 從原本的 0.06 逐步提升至 0.55 要強調的是,分享的重點並不在於模型的效度本身,而是該如何反覆分析與診斷模型的問題,有目的性的根據遇到的問題來進行對應的特徵工程,讓模型效度能滿足業務端的需求。希望讓大家能更深刻的體會特徵工程的心法與技巧
05:30
Data Contracts: Empowering Data Quality Enforcement
Shuhsi Lin
05:40-06:10 @ NYCU
In the realm of modern data engineering, ensuring data quality and fostering effective collaboration across teams are paramount. The introduction of DBT model contracts marks a pivotal advancement in this domain. These contracts provide a structured framework for defining and enforcing expectations about the output of data models. By examining the significance of DBT model contracts, this talk delves into their role in elevating data reliability, streamlining debugging processes, and optimizing resource utilization. We will explore the compelling advantages of using model contracts, from fostering collaborative data culture to enhancing change management. However, the journey isn't without its challenges. Limited platform support, potential complexity, and the need for effective communication are among the hurdles to overcome. By comprehending the transformative potential and navigating potential pitfalls, this talk aims to empower data practitioners with insights to leverage DBT model contracts effectively, ultimately elevating data quality, team efficiency, and decision-making across the organization.
06:00
Data Lakehouse Architecture Evolution and Future
Mars Su
06:20-06:50 @ NYCU
In the current data-driven world, we are always face on large data volumn storage, analytics and machine-learning application problem. In ths past, we always use database, data lake or data warehouse to store different data, includes structured data, unstructured data or semi-structured data. Although current have many related storage and tool can solve corresponding problems and scenraio, still have some limitation and imperfection. In order to improve these, one concept gradually is discussed in these year. That is a Lakehouse, which integrate data lake and data warehouse advantages so that become a powerful architecture to implement modern data stack. Based on this concept, have some completed service and tool can implement it. Includes Databricks - Delta Lake, Apache Iceberg or Apache Hudi. In this session, i will quickly describe and analyze these concept, benefits and drawbacks about database, data lake, data warehouse and lakehouse. And introduce some represent service. Lastly, i will show some demo about lakehouse so that attendees can more understand it specifically.
06:30
Day1 Afternoon Break
06:50-07:10 @ NYCU
Break
07:00
用 LLMs 進行金融新聞分析以優化量化交易
Jo Chen
07:10-07:40 @ NYCU
本演講將探討如何利用大型語言模型(LLMs)進行金融新聞的情感分析,以優化量化交易策略。首先,將簡要介紹量化交易的基本概念和 LLMs 對量化交易與分析的幫助。接著,將展示一些在金融領域已有開源或文獻的微調 LLMs 模型,並說明這些模型於金融新聞情感分析的效果,與我個人實測的經驗。 這場演講將為參與者提供不同於一般面向的量化交易方法,並深入探討如何結合先進的自然語言處理技術來提升交易效能。
07:30
來設計一個大型語言模型(LLM)對話管理平台
Simon Liu (Liu Yu-Wei)
07:50-08:20 @ NYCU
在現今數位時代之中,大型語言模型已成為人機互動的關鍵技術之一,而對話管理平台則是實現這項技術的應用不可或缺的核心要素。在本次演講中,我將提供大家了解,如何建立一個完善的大型語言模型對話管理系統。我們將討論如何在確保數據安全的前提下,建立可靠的對話系統,探討嵌入數據庫的建立與有效管理等關鍵議題。 此次演講目標,是希望能夠啟發並引導致力於將大型語言模型產品化的資料科學家、工程師和研究人員,來實現更智能化且有助於提升商業體驗的成果,幫助更深入理解如何建立一個語言模型對話管理平台。
08:30
PyLiteracy:以語言學為基礎的中文文法檢查器
陳畯田 Jonathan Chen
08:30-09:00 @ NYCU
無論是否為母語者,在繁體中文的使用上,諸如近義詞、錯別字的錯誤使用是常見的,此問題也間接導致訓練資料多來自網路的大型語言模型 (LLM)無法在中文文法檢查任務上扮演可靠的角色。然而從語言學的角度來看,僅針對正確及錯誤句的對照進行模型訓練並非最有效的方式,其實此類型錯誤與詞類和句型結構有著直接關係,若將正確的詞類及句型結構規則分析化簡之後以程式碼撰寫成模型,此類以語言學規則為本的模型即能以和人類兒童依類似方式掌握語言的使用,實現以少量語料完成高效率中文文法檢查的任務。