<
From version < 9.1 >
edited by AISOP Admin
on 2025/01/14 21:23
To version < 10.1 >
edited by Andreas Isking
on 2025/02/28 18:15
>
Change comment: There is no comment for this version

Summary

Details

Page properties
Author
... ... @@ -1,1 +1,1 @@
1 -XWiki.AISOPAdmin
1 +XWiki.andisk
Content
... ... @@ -57,6 +57,97 @@
57 57  * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course.
58 58  * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]].
59 59  
60 +
61 +==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ====
62 +
63 +Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here.
64 +
65 +**Prerequisites**
66 +
67 +~1. Tesseract must be installed
68 +
69 +(% class="box" %)
70 +(((
71 +##tesseract ~-~-version##
72 +)))
73 +
74 +2. Poppler must be installed
75 +
76 +(% class="box" %)
77 +(((
78 +##brew install poppler##
79 +)))
80 +
81 +**Code**
82 +
83 +(% class="box" %)
84 +(((
85 +##for pdf in *.pdf; do
86 +# Extract the base name of the PDF without the extension
87 +basename="${pdf%.pdf}"
88 +\\# Convert PDF to PNGs
89 +pdftoppm -png "$pdf" "$basename"
90 +\\# Create a text file with the same name as the PDF.
91 +for png in "$basename"-*.png; do
92 + tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/  */ /g' >> "$basename.txt"
93 + echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt"
94 +done
95 +done
96 +rm *.png##
97 +)))
98 +
99 +**Explanation**
100 +
101 +~1. ##for pdf in *.pdf; do ...; done##
102 +
103 +• Loops through all PDF files in the directory.
104 +
105 +2. ##basename="${pdf%.pdf}"##
106 +
107 +• Extracts the filename of the PDF without the .pdf extension.
108 +
109 +3. ##pdftoppm -png "$pdf" "$basename"##
110 +
111 +• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc.
112 +
113 +4. ##for png in "$basename"-*.png; do ...; done##
114 +
115 +• Processes only the PNG files generated from the current PDF.
116 +
117 +5. ##tesseract "$png" stdout -l deu ~-~-oem 1##
118 +
119 +• Performs OCR on the PNG file.
120 +
121 +6. ##tr '\n' ' '##
122 +
123 +• Replaces line breaks with spaces.
124 +
125 +7. ##sed 's/  */ /g'##
126 +
127 +• Reduces multiple spaces to a single space.
128 +
129 +8. ##>> "$basename.txt"##
130 +
131 +• Appends the recognized text to a text file with the same name as the PDF.
132 +
133 +9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"##
134 +
135 +• Adds a separator line (~-~--) after each page.
136 +
137 +**Result**
138 +
139 +• A separate text file is created for each PDF, e.g.:
140 +
141 +• file1.txt for file1.pdf
142 +
143 +• file2.txt for file2.pdf
144 +
145 +• The OCR results of all pages from the respective PDF are written into this text file.
146 +
147 +• Each page is separated by a separator line (~-~--).
148 +
149 +• Temporary PNG files are deleted at the end.
150 +
60 60  === 1.3: Annotate Text Fragments ===
61 61  
62 62  It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.

Need help?

If you need help with XWiki you can contact: