Data extraction

Chain-of-density summarization

GenAI capability Text summarization
Prompting strategy Chain-of-thought prompting
Requirements LLMs with file upload and large context window (> 100,000 tokens)
Academic study Adams et al. (2023)

Preparation: Upload a paper (PDFs).

Prompt

You will generate increasingly concise, entity-dense summaries of the
above article. The summaries should be written for an academic audience.
Repeat the following 2 steps 5 times.
- Step 1. Identify 1-3 informative entities (“;” delimited) from the
  article which are missing from the previously generated summary.
- Step 2. Write a new, denser summary of identical length which covers
  every entity and detail from the previous summary plus the missing entities.
A missing entity is:
    - Relevant: to the main story.
    - Specific: descriptive yet concise (5 words or fewer).
    - Novel: not in the previous summary.
    - Faithful: present in the article.
Anywhere: located anywhere in the article.

Python pseudocode for structured data extraction from tables

GenAI capability Data extraction
Prompting strategy Zero-shot prompting
Requirements LLMs with file upload and large context window (> 100,000 tokens)
Academic study Wagner et al. (2026)

Preparation: Upload a paper (PDFs).

Prompt

1. Define utility functions:
    - md_to_df(Markdown_text): Converts Markdown table text to a
      pandas DataFrame.
    - extract_table_from_image(url): Extracts table data from an
      image at the given URL and returns as Markdown text.
2. Define the MarkdownDataFrame data structure:
    - Use pandas.DataFrame as the base structure.
    - Apply a BeforeValidator that converts Markdown text to a
      DataFrame (md_to_df function).
    - Apply a PlainSerializer to convert a DataFrame to Markdown
      text (using DataFrame.to_Markdown() method).
    - Define JSON schema for validation.
3. Define the Table class with two attributes: caption and dataframe:
    - caption: String to store the table’s caption.
    - dataframe: Stores the table data as a MarkdownDataFrame,
      which is essentially a pandas DataFrame that can serialize
      to/from Markdown.
4. Main process to extract and represent a table from an image:
    - Call extract_table_from_image(url) to extract the Markdown
      representation of the table from the image.
    - Create an instance of the Table class, setting caption as needed
      and dataframe as the Markdown representation converted to a DataFrame.
    - Use the Table instance to manipulate or access the table’s data
      and caption.
    - To serialize the Table instance’s dataframe back to Markdown, use
      the PlainSerializer functionality implicitly via the class’s structure.

References

Adams, G., Fabbri, A., Ladhak, F., Lehman, E., & Elhadad, N. (2023). From sparse to dense: GPT-4 summarization with chain of density prompting. Proceedings of the 4th New Frontiers in Summarization Workshop, 68–74. https://doi.org/10.18653/v1/2023.newsum-1.7
Wagner, G., Prester, J., Mousavi, R., Lukyanenko, R., & Paré, G. (2026). Generative artificial intelligence for literature reviews. To Be Accepted at Journal of Information Technology. https://doi.org/TODO