I used LLaVA and GPT-4o to successfully identify an image’s prime keywords, returning an ordered list of descriptors.
“This script utilizes LLaVA (Large Language and Vision Assistant) to identify images based on keyword descriptors, iterating 35 times by default. It generates two outputs: a list of primary keywords and a corresponding list indicating the frequency of each keyword’s identification. The aggregate_keywords()
function further enhances this process by consolidating similar terms to minimize redundancy. Increasing the number of iterations can yield a more refined set of accurate visual descriptors by providing additional opportunities for identification.”
https://github.com/jaymasl/image-recognition-python
Original prompt: “(worried lobster driving a car with a speech bubble that says “i’m late for work!”), (text that says “jaykrown” in the bottom right)”

Here is the console output of the script:

Using 35 iterations, it was able to return “lobster” and “car” almost every time. It also got “late for work” as text in the image successfully. The interesting thing is that it got “humor,” which could be due to the comic and speech bubble format.
Leave a Reply