Converting Alpaca to ChatML Conversation Format

-- Convert Alpaca format to Conversation format
WITH 
source_view AS (
  SELECT * FROM train  -- Change 'train' to your desired view name here
)
SELECT 
  [
    struct_pack(
      "from" := 'user',
      "value" := CASE 
                   WHEN input IS NOT NULL AND input != '' 
                   THEN instruction || '\n\n' || input
                   ELSE instruction
                 END
    ),
    struct_pack(
      "from" := 'assistant',
      "value" := output
    )
  ] AS conversation
FROM source_view
WHERE instruction IS NOT NULL 
  AND output IS NOT NULL;

Why?

Differences between Alpaca and ChatML Conversation Format:

  1. Alpaca Format:

    • The Alpaca format usually has three columns: instruction, input, and output.
  2. ChatML Conversation Format:

    • The ChatML Conversation format is a JSON format that contains a list of messages.
    • Each message has a from field, which can be either system, user, or assistant.
    • The value field contains the message content.

Example

yahma/alpaca-cleaned

You can run this query through via the sql_console in the Hugging Face Hub here.

Alpaca to ChatML

Final Dataset