LLM 기반 문장 평가 프롬프트 설계하기

카테고리 없음

LLM 기반 문장 평가 프롬프트 설계하기

O3O2 2025. 11. 27. 23:36

외국어 학습 앱에서 AI가 사용자의 문장을 정확하고 일관되게 평가하려면 어떻게 프롬프트를 설계해야 할까?

들어가며

외국어 학습 앱을 개발할 때 가장 까다로운 부분 중 하나는 사용자가 작성한 문장을 어떻게 평가할 것인가다. 전통적인 방식은 정답과의 정확한 일치 여부만 판단했지만, LLM(Large Language Model)을 활용하면 문법, 자연스러움, 문맥 적합성 등 다양한 측면에서 인간 튜터처럼 평가할 수 있다.

이 글에서는 GPT API를 활용해 외국어 학습자의 문장을 평가하는 프롬프트를 설계하는 방법을 다룰 것.

1. 평가 기준(Rubric) 정의하기

프롬프트를 작성하기 전에, 무엇을 평가할 것인지 명확히 정의해야 한다. 외국어 문장 평가에서 일반적으로 사용되는 기준은 다음과 같다.

핵심 평가 항목

항목 설명 가중치 예시

문법 정확성 (Grammar)	시제, 주어-동사 일치, 관사, 전치사 등 문법 규칙 준수	30%
어휘 적절성 (Vocabulary)	문맥에 맞는 단어 선택, 콜로케이션	25%
자연스러움 (Naturalness)	원어민이 실제로 사용할 법한 표현인지	25%
과제 완수도 (Task Completion)	주어진 질문/상황에 적절히 응답했는지	20%

레벨별 차등 기준

학습자의 수준에 따라 평가 기준을 다르게 적용해야 함.

초급(A1-A2): 기본 문법과 의미 전달에 집중
중급(B1-B2): 자연스러움과 다양한 표현 사용 평가
고급(C1-C2): 뉘앙스, 관용표현, 문체까지 평가

2. 기본 프롬프트 구조

효과적인 평가 프롬프트는 다음 요소를 포함할 필요가 있음.

2.1 역할 정의 (Role Definition)

You are an experienced language tutor specializing in [TARGET_LANGUAGE] 
for Korean speakers. Your role is to evaluate learner sentences with 
encouraging but accurate feedback.

역할을 명확히 지정하면 LLM이 해당 페르소나에 맞는 평가 톤과 기준을 유지한다!

2.2 평가 컨텍스트 제공

## Context
- Learner Level: Intermediate (B1)
- Learning Goal: Preparing for business English presentation
- Target Vocabulary: ["quarterly report", "revenue", "growth"]
- Prompt Given to Learner: "Describe your company's performance last quarter."

컨텍스트가 없으면 LLM은 문장만 보고 판단하게 되어, 학습자의 의도나 수준에 맞지 않는 피드백을 줄 수 있음....

2.3 평가 루브릭 명시

## Evaluation Criteria
Evaluate the following sentence based on these criteria:

1. **Grammar (0-25 points)**
   - 25: No errors
   - 20: Minor errors that don't impede understanding
   - 15: Some errors but meaning is clear
   - 10: Frequent errors affecting clarity
   - 5: Major errors making sentence difficult to understand

2. **Vocabulary (0-25 points)**
   - 25: Precise, contextually appropriate word choices
   - 20: Good vocabulary with minor issues
   - 15: Adequate vocabulary, some awkward choices
   - 10: Limited vocabulary affecting expression
   - 5: Inappropriate or incorrect word usage

3. **Naturalness (0-25 points)**
   - 25: Sounds completely natural to native speakers
   - 20: Mostly natural with slight awkwardness
   - 15: Understandable but clearly non-native
   - 10: Unnatural phrasing throughout
   - 5: Very unnatural, translated feel

4. **Task Completion (0-25 points)**
   - 25: Fully addresses the prompt with relevant details
   - 20: Addresses prompt well
   - 15: Partially addresses prompt
   - 10: Minimally addresses prompt
   - 5: Does not address the prompt

점수 기준을 구체적으로 명시하면 평가의 일관성이 크게 향상된다고 함.

3. Few-Shot 예시 활용하기

LLM은 예시를 통해 평가 기준을 더 정확히 이해함. Few-shot prompting을 활용해 좋은 평가와 나쁜 평가의 예시 제공 필요.

좋은 예시 (Good Example)

## Example 1
**Learner Sentence:** "Our company achieved 15% growth in revenue last quarter."
**Evaluation:**
{
  "grammar_score": 25,
  "grammar_feedback": "Perfect grammar usage with correct tense and structure.",
  "vocabulary_score": 23,
  "vocabulary_feedback": "Excellent use of 'achieved' and 'revenue'. Consider 'year-over-year' for more precision.",
  "naturalness_score": 24,
  "naturalness_feedback": "Very natural business English expression.",
  "task_score": 25,
  "task_feedback": "Directly addresses the prompt with specific data.",
  "total_score": 97,
  "corrected_sentence": null,
  "overall_feedback": "Excellent work! Your sentence is professional and clear."
}

경계 사례 (Edge Case Example)

## Example 2
**Learner Sentence:** "Last quarter our company revenue is grow 15%."
**Evaluation:**
{
  "grammar_score": 12,
  "grammar_feedback": "Tense error: 'is grow' should be 'grew'. Word order could be improved.",
  "vocabulary_score": 20,
  "vocabulary_feedback": "Good use of key terms. 'Revenue grew' is more natural than 'company revenue is grow'.",
  "naturalness_score": 10,
  "naturalness_feedback": "The sentence structure feels translated from Korean.",
  "task_score": 22,
  "task_feedback": "Successfully conveys the main information about growth.",
  "total_score": 64,
  "corrected_sentence": "Our company's revenue grew by 15% last quarter.",
  "overall_feedback": "Good effort! Focus on past tense for completed actions."
}

4. 출력 형식 지정하기

평가 결과를 앱에서 파싱하려면 일관된 JSON 형식으로 응답을 받아야 한다.

출력 스키마 정의

## Output Format
Respond ONLY with a valid JSON object. Do not include any text outside the JSON.

{
  "grammar_score": <integer 0-25>,
  "grammar_feedback": "<specific feedback in Korean>",
  "vocabulary_score": <integer 0-25>,
  "vocabulary_feedback": "<specific feedback in Korean>",
  "naturalness_score": <integer 0-25>,
  "naturalness_feedback": "<specific feedback in Korean>",
  "task_score": <integer 0-25>,
  "task_feedback": "<specific feedback in Korean>",
  "total_score": <integer 0-100>,
  "corrected_sentence": "<corrected version or null if perfect>",
  "key_improvements": ["<improvement 1>", "<improvement 2>"],
  "positive_points": ["<strength 1>", "<strength 2>"],
  "overall_feedback": "<encouraging summary in Korean, 2-3 sentences>"
}

JSON 파싱 안정성 높이기

실제 서비스에서는 LLM이 가끔 잘못된 형식으로 응답할 수 있음. 이를 대비한 전략:

import json
import re

def parse_llm_response(response_text: str) -> dict:
    """LLM 응답에서 JSON을 안전하게 추출"""
    
    # 1. 마크다운 코드 블록 제거
    cleaned = re.sub(r'```json\s*', '', response_text)
    cleaned = re.sub(r'```\s*', '', cleaned)
    
    # 2. 앞뒤 공백 및 설명 텍스트 제거
    cleaned = cleaned.strip()
    
    # 3. JSON 객체 추출 시도
    json_match = re.search(r'\{[\s\S]*\}', cleaned)
    if json_match:
        try:
            return json.loads(json_match.group())
        except json.JSONDecodeError:
            pass
    
    # 4. 파싱 실패 시 기본값 반환
    return {
        "error": "Failed to parse response",
        "raw_response": response_text
    }

5. 프롬프트 최적화 기법

5.1 Chain-of-Thought (CoT) 적용

복잡한 문장의 경우, LLM이 단계적으로 분석하도록 유도할 것.

Before providing scores, analyze the sentence step by step:
1. First, identify any grammatical errors
2. Then, check vocabulary appropriateness
3. Next, assess how natural the expression sounds
4. Finally, evaluate task completion

After your analysis, provide the JSON evaluation.

5.2 Negative Prompting

원하지 않는 행동을 명시적으로 금지할 것.

## Important Guidelines
- Do NOT be overly lenient. Maintain consistent standards.
- Do NOT give perfect scores unless the sentence is truly flawless.
- Do NOT provide feedback in English unless the learner level is Advanced.
- Do NOT repeat the same feedback phrases across different evaluations.

5.3 온도(Temperature) 설정

평가의 일관성을 위해 낮은 temperature를 권장.

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.3,  # 낮은 값으로 일관성 확보
    max_tokens=500
)

6. 완성된 프롬프트 템플릿

# System Prompt
You are an experienced {target_language} tutor for Korean speakers. 
Evaluate learner sentences accurately but encouragingly.

# User Prompt Template
## Learner Context
- Level: {learner_level}
- Goal: {learning_goal}
- Target Words: {target_vocabulary}
- Original Prompt: "{prompt_given}"

## Learner's Response
"{learner_sentence}"

## Evaluation Criteria
[... 위에서 정의한 루브릭 ...]

## Examples
[... Few-shot 예시 ...]

## Output Format
[... JSON 스키마 ...]

Evaluate the learner's response now.

7. 실전 적용 시 고려사항

비용 최적화

간단한 문장은 GPT-3.5-turbo로 평가
복잡하거나 고급 레벨은 GPT-4 사용
프롬프트 캐싱 활용 (동일 루브릭 재사용)

레이턴시 관리

스트리밍 응답 비활성화 (JSON 파싱 시)
비동기 처리로 사용자 경험 개선
타임아웃 및 재시도 로직 구현

품질 모니터링

평가 결과 로깅 및 분석
사용자 피드백 수집 (평가가 도움이 되었는지)
주기적인 프롬프트 개선

LLM 기반 문장 평가는 단순한 API 호출이 아닌, 교육학적 고려와 프롬프트 엔지니어링의 결합입니다. 명확한 루브릭 정의, 적절한 예시 제공, 구조화된 출력 형식이 성공의 열쇠다.

현재글LLM 기반 문장 평가 프롬프트 설계하기

No.0302

techno-0302 님의 블로그 입니다.

Today :
Yesterday :

No.0302