Study reveals ChatGPT’s struggle with metaphors

President Donald Trump’s political speeches recently served as a testing ground for the capabilities and limitations of large language models. By analyzing the metaphors embedded in four major speeches, researchers not only gained insight into Trump’s rhetorical strategies but also exposed key weaknesses in artificial intelligence systems like ChatGPT when it comes to understanding figurative language in political contexts. Their findings are published in Frontiers in Psychology.

Large language models, or LLMs, are computer programs trained to understand and generate human language. They work by analyzing vast amounts of text—such as books, websites, and conversations—and learning statistical patterns in how words and sentences are used. LLMs like ChatGPT can write essays, summarize documents, answer questions, and even hold conversations that feel natural.

However, they do not truly understand language the way humans do. Instead, they rely on pattern recognition to predict what words are likely to come next in a sentence. This can lead to convincing results in many situations, but it also means the models can misinterpret meaning, especially when language is abstract or emotionally charged.

To test how well a large language model can detect metaphors in political speech, the researchers selected four of Donald Trump’s speeches from mid-2024 to early 2025. These included his Republican nomination acceptance speech after surviving an assassination attempt, his post-election victory remarks, his inaugural address, and his speech to Congress. These texts, totaling over 28,000 words, were chosen because they are filled with emotionally charged and ideologically driven language, often using metaphor to frame political issues in ways that resonate with supporters.

The researchers used a method called critical metaphor analysis to examine the text. This method focuses on how metaphors influence political thinking and shape public attitudes. They then adapted this method for use with ChatGPT-4, prompting the model to go through a step-by-step process: understand the context of the speech, identify potential metaphors, categorize them by theme, and explain their likely emotional or ideological impact.

The large language model was able to detect metaphors with moderate success. Out of 138 sampled sentences, it correctly identified 119 metaphorical expressions, giving it an accuracy rate of around 86 percent. But a closer look revealed several recurring problems in the model’s reasoning. These issues provide insight into the limitations of artificial intelligence when it tries to interpret complex human communication.

One of the most common mistakes was confusing metaphors with other forms of expression, such as similes. For example, the model misinterpreted the phrase “Washington D.C., which is a horrible killing field” as metaphorical when it is more accurately described as a literal, emotionally charged comparison. The model also tended to overanalyze simple expressions.

In one case, it flagged the phrase “a series of bold promises” as metaphorical, interpreting it as a spatial metaphor when no such figurative meaning was intended. The model also struggled to correctly classify names and technical terms. For instance, it treated “Iron Dome,” the name of Israel’s missile defense system, as a metaphor instead of a proper noun.

These missteps show that while LLMs can detect surface-level patterns, they often lack the ability to understand meaning in context. Unlike humans, they do not draw on lived experience, cultural knowledge, or emotional nuance to make sense of language. This becomes especially apparent when analyzing political rhetoric, where metaphor is often used to tap into shared feelings, histories, and identities.

The study also tested the model’s ability to categorize metaphors based on shared themes or “source domains.” These categories include concepts like Force, Movement and Direction, Health and Illness, and the Human Body. For example, Trump frequently used phrases like “We rise together,” “Unlock America’s glorious destiny,” and “Bring law and order back,” which were successfully classified as Movement or Force metaphors. These metaphors help convey ideas of progress, strength, and control—key themes in campaign messaging.

However, the model performed poorly in less common or more abstract categories, such as Cooking and Food or Plants. In the Plants category, it failed to detect any relevant metaphors at all. In Cooking and Food, it produced several false positives, identifying metaphors that human reviewers judged to be literal. These results suggest that LLMs are more reliable when working with familiar, frequently used metaphor types and less reliable in areas that require nuanced understanding or cultural context.

To verify their findings, the researchers compared the AI-generated results with those produced by traditional metaphor analysis tools, such as Wmatrix and MIPVU. The results were strongly correlated overall, but some differences stood out. ChatGPT was faster and easier to use, but its accuracy varied widely across metaphor categories. In contrast, the traditional methods were slower but more consistent in identifying metaphors across all categories.

Another issue the study uncovered is that LLM performance depends heavily on how prompts are written. Even small changes in how a question is asked can affect what the model produces. This lack of stability makes it harder to reproduce results and undermines confidence in the model’s reliability when dealing with sensitive material like political speech.

The researchers also noted broader structural problems in how LLMs are trained. These models rely on enormous datasets scraped from the internet, much of which is uncurated and not annotated for meaning. As a result, LLMs may lack exposure to metaphorical language in specific cultural, historical, or political contexts. They may also pick up and reproduce existing biases related to gender, race, or ideology—especially when processing emotionally or politically loaded texts.

The researchers conclude that while large language models show promise in analyzing metaphor, they are far from replacing human expertise. Their tendency to misinterpret, overreach, or miss subtleties makes them best suited for assisting researchers rather than conducting fully automated analysis. In particular, political metaphors—which often rely on shared cultural symbols, deep emotional resonance, and implicit ideological framing—remain difficult for these systems to understand.

The study, “Large language models prompt engineering as a method for embodied cognitive linguistic representation: a case study of political metaphors in Trump’s discourse,” was authored by Haohan Meng, Xiaoyu Li, and Jinhua Sun.