# Uploading an Existing Dataset

If you have already collected feedback outside BoundaryAI — survey exports, support-ticket dumps, interview transcripts, or document libraries — you can upload it directly into a Feedback Group. The platform parses your file, infers what each column or section means, runs the same analytical pipeline as native surveys, and lands the result in your dashboards within minutes.

This page covers the upload flow end-to-end: supported formats, column mapping, document ingest, and the post-import processing pipeline.

For other ways to bring feedback in, see [Creating a Survey](/boundaryai-docs/basics/surveys/editor.md), [Connecting a platform](/boundaryai-docs/basics/connect-to-your-existing-systems.md), and [Web scraping](/boundaryai-docs/basics/social-listening.md). For the container that holds uploaded data, see [Feedback Groups](/boundaryai-docs/basics/feedback-groups.md).

***

### When to upload

Use the upload flow when:

* You have **historical data** in spreadsheets you want to analyse alongside new feedback.
* You have **interview transcripts** (text or JSON) and want them clustered into themes.
* You have **a stack of documents** — PDFs, Word files — that contain qualitative feedback you would otherwise have to read manually.
* You exported responses from another survey tool and want to keep tracking them in BAI.

Uploads land inside a Feedback Group as a regular survey, so they participate fully in cross-source analytics, Super-Themes, flags, and reports.

***

### Supported file types

BoundaryAI accepts two broad classes of files:

#### Structured files (column-mapped)

| Format                 | Extension       | Notes                                                            |
| ---------------------- | --------------- | ---------------------------------------------------------------- |
| Microsoft Excel        | `.xlsx`, `.xls` | Multi-sheet workbooks supported; you pick which sheet to import. |
| Comma-separated values | `.csv`          | UTF-8 recommended; other encodings auto-detected.                |

Structured files go through the **mapping flow**: BoundaryAI shows you a preview of your columns and suggests a question type for each, which\
you can review and adjust before importing.

#### Document files (auto-extracted)

| Format                     | Extension | Notes                                                                              |
| -------------------------- | --------- | ---------------------------------------------------------------------------------- |
| PDF                        | `.pdf`    | Text-based PDFs; scanned/image-only PDFs are not OCR'd.                            |
| Word documents             | `.docx`   | Tables and lists preserved.                                                        |
| JSON interview transcripts | `.json`   | Speaker turns auto-detected; supports common call-centre and meeting-tool exports. |

Document files skip column mapping and go through an AI conversion step that extracts question-and-answer pairs (or theme/evidence pairs for\
free-form transcripts) before landing in the same analysis pipeline.

#### Limits

* **File size**: up to **500 MB** per file. Large files are uploaded directly to cloud storage in chunks, so a slow network does not force a restart.
* **Columns**: up to **1,000 columns** per spreadsheet. Beyond that, the mapping table becomes unwieldy in the browser.
* **Rows**: no hard cap; very large files just take longer to process.
* **Multi-document upload**: drop multiple `.pdf` or `.docx` files at once and BoundaryAI merges them into a single survey with a *Source* column tracking which file each row came from.

***

### The upload flow

The flow has up to five steps, depending on what you upload. Single-sheet files skip step 2; documents skip steps 2 and 3.

#### Step 1 — Upload your file

1. Open the Feedback Group you want the data to land in.
2. Click **Upload Data**.
3. Drag-and-drop your file into the upload area, or click to pick from your device.

Files start uploading immediately. Large files use a resumable upload session, so the progress bar reflects real bytes-on-the-wire rather than waiting for the whole file to finish.

#### Step 2 — Pick a sheet *(multi-sheet Excel only)*

If your workbook contains more than one sheet, BoundaryAI shows a sheet picker. Select the sheet that holds your feedback data and continue. Single-sheet workbooks skip this step automatically.

#### Step 3 — Review the column mapping

BoundaryAI scans the first rows of your file and **auto-suggests a field type for every column**, using both the column header and a sample of the values. Each column gets a dropdown so you can confirm or override the suggestion.

See Column mapping in detail below for the full list of field types and how the AI inference works.

#### Step 4 — Confirm survey details

Fill in:

* **Survey name** — required. Used to identify this dataset inside the Feedback Group; pick something descriptive (e.g. *"Q1 2026 — Customer*\
  \&#xNAN;*Satisfaction Export"*).
* **Survey language** — required. Sets the source language for sentiment, theming, and translation. Affects analysis quality, so set it\
  accurately.
* **Selected file** — read-only confirmation of which file is about to be imported.

#### Step 5 — Configure flags *(optional)*

If you want **flags** applied to long-form text columns at import time, set them up here. You can:

* Pick from the **predefined flag library** — eight ready-made categories such as *Urgency / Critical*, *Safety & Wellbeing*, *Ethical & Legal*, *Customer Churn*, *Feature Request*, *Positive Feedback*, each with three sub-flags.
* Add **custom flags** with your own name, description, and detection criteria.
* **Skip** flag setup entirely and add flags later from the survey's analysis page.

For more on what flags are and how they work, see Flags.

***

### Column mapping in detail

This is where the platform learns what your spreadsheet actually contains.

#### Available field types

| Field type          | Use it for                                                                                                                              |
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| **Single Choice**   | Columns where every cell is one of a small set of values (Yes/No, satisfaction labels).                                                 |
| **Multiple Choice** | Columns where each cell may contain several values, often comma- or pipe-separated.                                                     |
| **Short Answer**    | Brief free-text — names, one-line comments, identifiers.                                                                                |
| **Long Answer**     | Detailed open-ended responses (the columns the AI will analyse for themes and sentiment).                                               |
| **Linear Scale**    | Numeric ratings on a custom range (1–5, 1–7, 1–10 outside NPS).                                                                         |
| **NPS**             | The standard 0–10 likelihood-to-recommend column.                                                                                       |
| **Metadata**        | Context columns that should not be analysed but should be available for segmentation — IDs, timestamps, demographics, channel, country. |
| **Ignored**         | Columns that shouldn't be imported at all.                                                                                              |

The richest analytical signal comes from **Long Answer** columns; metadata columns power segmentation in the analysis view.

#### How auto-mapping works

The auto-mapping is driven by AI plus a layer of deterministic overrides:

* **AI inference** reads the column header and a sample of the values for context, then proposes a type.
* **Deterministic overrides** kick in for high-confidence patterns the AI sometimes misses — for example, columns whose values are exclusively *Yes / No / True / False* are forced to **Single Choice**, and headers like *"Reason for…"*, *"Tell us about…"*, or *"Comments"* with\
  sentence-length samples are forced to **Long Answer**.
* **Multilingual support** — the override layer recognises common patterns in English, French, and Spanish, so a column called *"Pourquoi*\
  \&#xNAN;*recommanderiez-vous…"* is auto-detected as a long-answer free-text column without needing manual review.
* **Conservative default** — columns the AI cannot confidently classify default to **Metadata**, which is non-destructive: they will not be analysed, but their values are still imported and remain available for segmentation.

#### Required fields

Before you can finalise the mapping, BoundaryAI checks that:

* The survey has a **name**.
* At least **one column is mapped to a question type** (i.e. not all columns are Metadata or Ignored).
* The file contains **at least one row of data**.

If any of these fail, the validation dialog tells you exactly which columns to revisit. You can fix the mapping in place and continue without\
re-uploading the file.

#### Tips for clean mapping

* Use **clear, human-readable column headers** before uploading. *"Q1\_LongAns"* gives the AI nothing to work with; *"Why did you choose this*\
  \&#xNAN;*option?"* is unambiguous.
* **Split combined fields** into separate columns. A single *"Major, Semester"* column should be two columns — *Major* and *Semester* — so each can be mapped independently and segmented separately.
* **Use standardised date formats** (`YYYY-MM-DD` or your local long format) — Excel serial-number dates are detected and converted, but consistent formatting reduces edge cases.
* **Mark internal columns as Ignored** rather than deleting them — that way you keep one master file and the upload still imports cleanly.
* **Identifiers stay consistent across uploads** if you intend to update or join data later.

***

### Document and transcript ingest

When you upload a `.pdf`, `.docx`, or `.json` interview file, BoundaryAI bypasses column mapping and runs an AI conversion step instead.

#### Documents (PDF, DOCX)

The system extracts text from the document, identifies question-and-answer pairs (or natural sectioning where there are no explicit questions), and converts the result into a structured survey with two columns: a theme and the verbatim text behind it. The output is then run through the same theme-and-sentiment pipeline as a normal upload.

PDFs must be **text-based**. Image-only or scanned PDFs are not currently OCR'd; convert them to searchable PDFs first.

#### Multi-document upload

Drop multiple `.pdf` or `.docx` files at the same time and BoundaryAI processes them in parallel, then merges the results into a single survey. A *Source* column on every imported row tracks which document the text came from, so you can segment the analysis by document.

This is useful for things like:

* A library of customer interview write-ups (one Word doc per interview).
* A folder of feedback emails or letters exported as PDFs.
* A set of focus-group transcripts.

#### JSON interview transcripts

For transcripts produced by call-centre tooling or meeting platforms, BoundaryAI accepts JSON exports. The parser detects the speaker/content\
schema heuristically and falls back to AI parsing for vendor-specific formats. Speaker labels are preserved so the analysis can distinguish what the participant said from what the interviewer said.

Practical limits: up to roughly 10,000 turns per file (sized to the upload timeout), and consecutive turns from the same speaker are merged so\
the AI sees full statements rather than fragmented utterances.

***

### What happens after import

Importing is the start, not the finish. Once the file is in, BoundaryAI runs an asynchronous pipeline that:

1. **Cleans the text** — removes stray control characters, normalises encoding, and strips obvious noise.
2. **Detects the language** of each response (which can differ from the survey's source language for mixed datasets).
3. **Scores sentiment** for every open-ended answer.
4. **Detects themes** by clustering similar answers and labelling each cluster.
5. **Suggests flags** based on the flag library you selected (if any).
6. **Enriches metadata** — source provenance, timestamps where inferable.

A progress toast in the corner of the screen tracks the pipeline through each stage. You can keep working while it runs; when processing completes, the survey appears in the Feedback Group's source list and the *Open analysis* link becomes active.

If the pipeline fails part-way (rare — usually a malformed cell or a network blip), the toast reports the error and you can re-upload the file once the underlying issue is fixed.

***

### Validation, errors, and recovery

#### Errors that block import

These stop the import outright; you must fix them before continuing:

* **No question column** — all columns mapped to Metadata or Ignored. Map at least one column to a question type.
* **Empty file** — no rows of data after the header.
* **Bad encoding** — the file is not in a text encoding the parser recognises. Re-export as UTF-8.
* **Corrupted or password-protected file** — re-export without the password.
* **File over 500 MB** — split into smaller files or remove unused columns.

#### Warnings (non-blocking)

These let you continue but flag potential issues:

* **Sparse rows** — many empty cells. The import proceeds; affected rows simply contribute less signal.
* **All-empty column** — defaults to Metadata; you can change it before continuing.

#### Recovery

If the validation dialog catches an issue, you can **fix the mapping in place and retry without re-uploading** — the file stays parsed in memory between attempts, so corrections are quick.

***

### Best practices

* **Clean column names before uploading.** *"How likely are you to recommend us?"* gives the AI a much stronger signal than *"Q4\_NPS"*. The five minutes you spend tidying headers saves you ten minutes of mapping review.
* **Keep one column = one variable.** Compound columns like *"Major, Semester"* should always be split into separate columns.
* **Standardise dates and identifiers.** A consistent ISO date and a stable user/email column keep cross-upload joins clean.
* **Mark unused columns as Ignored, don't delete them.** Keeping the master file intact makes future re-uploads predictable.
* **Set the survey language correctly.** Mismatched language degrades sentiment and theme quality more than people expect.
* **Use multi-document upload for qualitative libraries.** Don't manually concatenate ten interview docs into one — drop them all in and let\
  BoundaryAI keep the source attribution.
* **Configure flags during upload when you already know what you are looking for.** Defining them up front means the analysis surfaces those\
  signals on the first pass instead of needing a re-run.
* **Match the upload to the right Feedback Group.** Uploading into *Individual Surveys* is fine for a one-off, but a recurring dataset belongs in a custom group so it can be tracked over time.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boundaryai.gitbook.io/boundaryai-docs/basics/uploading-an-existing-dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
