Extract raw text from attached file

It would be extremely helpful to have this simple automation action:

  • Extract raw text from attached file

What file types did you have in mind?

(I don’t think you would want to extract raw text from an attached .jpg file :wink: )

Problem

Up to now, external content as well as code snippets I need to insert in entities their rich text field. This is a big problem, because:

  • ProseMirror enforces a strict document schema. It automatically validates, escapes, and restructures input, which can alter formatting, remove unsupported elements, and escape special characters. This makes it difficult to preserve the original structure or exact content of external text.
  • Also, a rich text field is unable to contain very large content (it rejects that)

Proposal

We need:

  1. A replacement of a rich text field that better holds exact text and code. (Similar to the (currently invisible) JSON field, but just a generic text field would be great to have, as well as access of file content)
  2. Automations to fetch the text from inside attached files.

This serves two perposes:

  • Exact content/code (no distortion by ProseMirror)
  • Large content (e.g. long scripts or long transcripts, can be attached as file, instead of crammed in a rich text field either loads very slow or simply rejects the amount of text)

Suggested file formats to support for text extraction:

  1. Markdown (.md)
    – Structured knowledge blocks
    – Semantic entity conversion

  2. Plain Text (.txt)
    – Logs or raw transcripts
    – Unstructured input for NLP

  3. Microsoft Word (.docx)
    – Meeting notes
    – Headings and tables

  4. PDF (text-based)
    – Reports or whitepapers
    – Legal or archival records

  5. CSV (.csv)
    – Tabular task or metric data
    – Rows mapped to entities

  6. JSON (.json)
    – Config or schema files
    – Structured database population

  7. YAML (.yaml/.yml)
    – Workflow or deployment configs
    – Nested structure definitions

  8. HTML/XHTML (.html/.xhtml)
    – Static documentation pages
    – Article bodies or metadata

  9. Code Files (.js, .py, .sh, etc.)
    – Logic or automation scripts
    – Snippets linked to functions

  10. Rich Text Format (.rtf)
    – Formatted notes
    – Legacy content migration

1 Like

Perhaps you could automate sending your uploaded (attached to entity) files to an external service that would do the needed extraction/conversion, then store the result into another field/entity.
File API

I understand that, but given that Fibery is such a workflow-centric platform, I’m pointing to a gap in what many users would reasonably expect out of the box.

The ability to extract and work with the raw text of attached files—especially for content like transcripts, scripts, and code—is a foundational capability for many knowledge and automation workflows.

Even as custom integration I would not like it, for something so basic it adds significant friction to an internal process experience.

A counter-argument might be that ā€œdocument conversionā€ is too broad and deep a function to be a core feature (with many differing needs), and thus it would make sense to use an integration (of some sort) to outsource this to a dedicated service, so it can be better customized for each use case.

But it would be great if implementing such an integration was simpler!

2 Likes

I don’t mean to sound churlish, but although it may feel like ā€˜something so basic’ to you, I don’t imagine it is a part of many people’s knowledge workflows, let alone ā€˜a foundational capability’.

I’d even go as far as to say that Fibery is a ā€˜data-centric’ platform, rather than a 'workflow-centric’ platform.

Anyway, I’m willing to be proven wrong - let’s see what the votes say.

I’d say Fibery is an ā€˜insight-centric’ platform, or its moving towards being that, seen the developments in the last year.
If Fibery becomes the AI-co-operating system of an organization, then its insight-generating workflows based. @mdubakov ?

1 Like