publications
publications in reversed chronological order.
2024
- EMNLP Findings 2024LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language GeneralizationMuhammad Farid Adilazuarda, Samuel Cahyawijaya , Alham Fikri Aji , and 2 more authors2024
Pretrained language models (PLMs) have shown remarkable generalization toward multiple tasks and languages. Nonetheless, the generalization of PLMs towards unseen languages is poor, resulting in significantly worse language performance, or even generating nonsensical responses that are comparable to a random baseline. This limitation has been a longstanding problem of PLMs raising the problem of diversity and equal access to language modeling technology. In this work, we solve this limitation by introducing LinguAlchemy, a regularization technique that incorporates various aspects of languages covering typological, geographical, and phylogenetic constraining the resulting representation of PLMs to better characterize the corresponding linguistics constraints. LinguAlchemy significantly improves the accuracy performance of mBERT and XLM-R on unseen languages by 18% and 2%, respectively compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search. LinguAlchemy enables better cross-lingual generalization to unseen languages which is vital for better inclusivity and accessibility of PLMs.
@misc{adilazuarda2024lingualchemy, title = {LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization}, author = {Adilazuarda, Muhammad Farid and Cahyawijaya, Samuel and Aji, Alham Fikri and Winata, Genta Indra and Purwarianti, Ayu}, year = {2024}, eprint = {2401.06034}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/pdf/2401.06034v2}, }
- EMNLP Main 2024Towards Measuring and Modeling "Culture" in LLMs: A SurveyMuhammad Farid Adilazuarda, Sagnik Mukherjee , Pradhyumna Lavania , and 6 more authors2024
We present a survey of 55 recent papers that aim to study cultural representation and inclusion in large language models. We observe that none of the studies define "culture," which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture." We call these aspects the proxies of cultures, and organize them across three dimensions of demographic, semantic and linguistic-cultural interaction proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of "culture," such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness and situatedness of the current methods. Based on these observations, we provide several recommendations for a holistic and practically useful research agenda for furthering cultural inclusion in LLMs and LLM-based applications.
@misc{adilazuarda2024measuring, title = {Towards Measuring and Modeling "Culture" in LLMs: A Survey}, author = {Adilazuarda, Muhammad Farid and Mukherjee, Sagnik and Lavania, Pradhyumna and Singh, Siddhant and Dwivedi, Ashutosh and Aji, Alham Fikri and O'Neill, Jacki and Modi, Ashutosh and Choudhury, Monojit}, year = {2024}, eprint = {2403.15412}, archiveprefix = {arXiv}, primaryclass = {cs.CY}, url = {https://arxiv.org/pdf/2403.15412}, }
- TrustNLP @NAACL 2024Beyond Turing: A Comparative Analysis of Approaches for Detecting Machine-Generated TextMuhammad Farid Adilazuarda2024
Significant progress has been made on text generation by pre-trained language models (PLMs), yet distinguishing between human and machine-generated text poses an escalating challenge. This paper offers an in-depth evaluation of three distinct methods used to address this task: traditional shallow learning, Language Model (LM) fine-tuning, and Multilingual Model fine-tuning. These approaches are rigorously tested on a wide range of machine-generated texts, providing a benchmark of their competence in distinguishing between human-authored and machine-authored linguistic constructs. The results reveal considerable differences in performance across methods, thus emphasizing the continued need for advancement in this crucial area of NLP. This study offers valuable insights and paves the way for future research aimed at creating robust and highly discriminative models.
@misc{adilazuarda2024turing, title = {Beyond Turing: A Comparative Analysis of Approaches for Detecting Machine-Generated Text}, author = {Adilazuarda, Muhammad Farid}, year = {2024}, eprint = {2311.12373}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/pdf/2311.12373}, }
- EMNLP Main 2024SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian LanguagesHoly Lovenia , Rahmad Mahendra , Salsabil Maulana Akbar , and 58 more authors2024
- EMNLP Main 2024Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic PromptingSagnik Mukherjee* , Muhammad Farid Adilazuarda*, Sunayana Sitaram , and 3 more authors2024
2023
- Tiny Papers @ICLR 2023The Obscure Limitation of Modular Multilingual Language ModelsMuhammad Farid Adilazuarda, Samuel Cahyawijaya , and Ayu Purwarianti2023
We expose the limitation of modular multilingual language models (MLMs) in multilingual inference scenarios with unknown languages. Existing evaluations of modular MLMs exclude the involvement of language identification (LID) modules, which obscures the performance of real-case multilingual scenarios of modular MLMs. In this work, we showcase the effect of adding LID on the multilingual evaluation of modular MLMs and provide discussions for closing the performance gap of caused by the pipelined approach of LID and modular MLMs.
@misc{adilazuarda2023obscure, title = {The Obscure Limitation of Modular Multilingual Language Models}, author = {Adilazuarda, Muhammad Farid and Cahyawijaya, Samuel and Purwarianti, Ayu}, year = {2023}, eprint = {2311.12375}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/pdf/2311.12375}, }
- Findings @ACL 2023NusaCrowd: Open Source Initiative for Indonesian NLP ResourcesSamuel Cahyawijaya , Holy Lovenia , Alham Fikri Aji , and 45 more authorsIn Findings of the Association for Computational Linguistics: ACL 2023 , Jul 2023
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
@inproceedings{cahyawijaya-etal-2023-nusacrowd, title = {{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources}, author = {Cahyawijaya, Samuel and Lovenia, Holy and Aji, Alham Fikri and Winata, Genta and Wilie, Bryan and Koto, Fajri and Mahendra, Rahmad and Wibisono, Christian and Romadhony, Ade and Vincentio, Karissa and Santoso, Jennifer and Moeljadi, David and Wirawan, Cahya and Hudi, Frederikus and Wicaksono, Muhammad Satrio and Parmonangan, Ivan and Alfina, Ika and Putra, Ilham Firdausi and Rahmadani, Samsul and Oenang, Yulianti and Septiandri, Ali and Jaya, James and Dhole, Kaustubh and Suryani, Arie and Putri, Rifki Afina and Su, Dan and Stevens, Keith and Nityasya, Made Nindyatama and Adilazuarda, Muhammad and Hadiwijaya, Ryan and Diandaru, Ryandito and Yu, Tiezheng and Ghifari, Vito and Dai, Wenliang and Xu, Yan and Damapuspita, Dyah and Wibowo, Haryo and Tho, Cuk and Karo Karo, Ichwanul and Fatyanosa, Tirana and Ji, Ziwei and Neubig, Graham and Baldwin, Timothy and Ruder, Sebastian and Fung, Pascale and Sujaini, Herry and Sakti, Sakriani and Purwarianti, Ayu}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-acl.868v2.pdf}, doi = {10.18653/v1/2023.findings-acl.868}, pages = {13745--13818}, }
2022
- SUMEval @AACL 2022IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local LanguagesMuhammad Farid Adilazuarda, Samuel Cahyawijaya , Genta Indra Winata , and 2 more authorsIn AACL’22 Workshop on Scaling Up Multilingual Evaluation , Nov 2022
Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model’s ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.
@inproceedings{adilazuarda-etal-2022-indorobusta, title = {{I}ndo{R}obusta: Towards Robustness Against Diverse Code-Mixed {I}ndonesian Local Languages}, author = {Adilazuarda, Muhammad Farid and Cahyawijaya, Samuel and Winata, Genta Indra and Fung, Pascale and Purwarianti, Ayu}, editor = {Ahuja, Kabir and Anastasopoulos, Antonios and Patra, Barun and Neubig, Graham and Choudhury, Monojit and Dandapat, Sandipan and Sitaram, Sunayana and Chaudhary, Vishrav}, booktitle = {AACL'22 Workshop on Scaling Up Multilingual Evaluation}, month = nov, year = {2022}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.sumeval-1.5/}, pages = {25--34}, }