Changes for page The AISOP recipe
Last modified by Paul Libbrecht on 2025/06/15 23:32
Change comment:
There is no comment for this version
Summary
-
Page properties (2 modified, 0 added, 0 removed)
Details
- Page properties
-
- Author
-
... ... @@ -1,1 +1,1 @@ 1 -XWiki. andisk1 +XWiki.AISOPAdmin - Content
-
... ... @@ -57,97 +57,6 @@ 57 57 * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course. 58 58 * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]]. 59 59 60 - 61 -==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ==== 62 - 63 -Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here. 64 - 65 -**Prerequisites** 66 - 67 -~1. Tesseract must be installed 68 - 69 -(% class="box" %) 70 -((( 71 -##tesseract ~-~-version## 72 -))) 73 - 74 -2. Poppler must be installed 75 - 76 -(% class="box" %) 77 -((( 78 -##brew install poppler## 79 -))) 80 - 81 -**Code** 82 - 83 -(% class="box" %) 84 -((( 85 -##for pdf in *.pdf; do 86 -# Extract the base name of the PDF without the extension 87 -basename="${pdf%.pdf}" 88 -\\# Convert PDF to PNGs 89 -pdftoppm -png "$pdf" "$basename" 90 -\\# Create a text file with the same name as the PDF. 91 -for png in "$basename"-*.png; do 92 - tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/ */ /g' >> "$basename.txt" 93 - echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt" 94 -done 95 -done 96 -rm *.png## 97 -))) 98 - 99 -**Explanation** 100 - 101 -~1. ##for pdf in *.pdf; do ...; done## 102 - 103 -• Loops through all PDF files in the directory. 104 - 105 -2. ##basename="${pdf%.pdf}"## 106 - 107 -• Extracts the filename of the PDF without the .pdf extension. 108 - 109 -3. ##pdftoppm -png "$pdf" "$basename"## 110 - 111 -• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc. 112 - 113 -4. ##for png in "$basename"-*.png; do ...; done## 114 - 115 -• Processes only the PNG files generated from the current PDF. 116 - 117 -5. ##tesseract "$png" stdout -l deu ~-~-oem 1## 118 - 119 -• Performs OCR on the PNG file. 120 - 121 -6. ##tr '\n' ' '## 122 - 123 -• Replaces line breaks with spaces. 124 - 125 -7. ##sed 's/ */ /g'## 126 - 127 -• Reduces multiple spaces to a single space. 128 - 129 -8. ##>> "$basename.txt"## 130 - 131 -• Appends the recognized text to a text file with the same name as the PDF. 132 - 133 -9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"## 134 - 135 -• Adds a separator line (~-~--) after each page. 136 - 137 -**Result** 138 - 139 -• A separate text file is created for each PDF, e.g.: 140 - 141 -• file1.txt for file1.pdf 142 - 143 -• file2.txt for file2.pdf 144 - 145 -• The OCR results of all pages from the respective PDF are written into this text file. 146 - 147 -• Each page is separated by a separator line (~-~--). 148 - 149 -• Temporary PNG files are deleted at the end. 150 - 151 151 === 1.3: Annotate Text Fragments === 152 152 153 153 It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.