Python 程式設計/字串

概述

Python 中字串概覽

str1 = "Hello"                # A new string using double quotes
str2 = 'Hello'                # Single quotes do the same
str3 = "Hello\tworld\n"       # One with a tab and a newline
str4 = str1 + " world"        # Concatenation
str5 = str1 + str(4)          # Concatenation with a number
str6 = str1[2]                # 3rd character
str6a = str1[-1]              # Last character
#str1[0] = "M"                # No way; strings are immutable
for char in str1: print(char) # For each character
str7 = str1[1:]               # Without the 1st character
str8 = str1[:-1]              # Without the last character
str9 = str1[1:4]              # Substring: 2nd to 4th character
str10 = str1 * 3              # Repetition
str11 = str1.lower()          # Lowercase
str12 = str1.upper()          # Uppercase
str13 = str1.rstrip()         # Strip right (trailing) whitespace
str14 = str1.replace('l','h') # Replacement
list15 = str1.split('l')      # Splitting
if str1 == str2: print("Equ") # Equality test
if "el" in str1: print("In")  # Substring test
length = len(str1)            # Length
pos1 = str1.find('llo')       # Index of substring or -1
pos2 = str1.rfind('l')        # Index of substring, from the right
count = str1.count('l')       # Number of occurrences of a substring

print(str1, str2, str3, str4, str5, str6, str7, str8, str9, str10)
print(str11, str12, str13, str14, list15)
print(length, pos1, pos2, count)

另請參閱正則表示式一章，瞭解 Python 中關於字串的高階模式匹配。

字串操作

相等性

如果兩個字串具有 *完全* 相同的內容，則它們相等，這意味著它們具有相同的長度，並且每個字元都具有一一對應的位置關係。許多其他語言透過身份來比較字串；也就是說，只有當兩個字串在記憶體中佔據相同空間時，它們才被視為相等。Python 使用 is 運算子來測試字串以及一般任何兩個物件的標識。

示例

>>> a = 'hello'; b = 'hello' # Assign 'hello' to a and b.
>>> a == b                   # check for equality
True
>>> a == 'hello'             #
True
>>> a == "hello"             # (choice of delimiter is unimportant)
True
>>> a == 'hello '            # (extra space)
False
>>> a == 'Hello'             # (wrong case)
False

數值

在字串上可以進行兩種類似數值的操作——加法和乘法。字串加法只是連線的另一種名稱，它只是將字串拼接在一起。字串乘法是重複加法或連線。因此

>>> c = 'a'
>>> c + 'b'
'ab'
>>> c * 5
'aaaaa'

包含

有一個簡單的運算子 'in'，如果第一個運算元包含在第二個運算元中，則返回 True。這也適用於子字串

>>> x = 'hello'
>>> y = 'ell'
>>> x in y
False
>>> y in x
True

請注意，'print(x in y)' 也將返回相同的值。

索引和切片

與其他語言中的陣列類似，字串中的單個字元可以透過一個整數來訪問，該整數代表它在字串中的位置。字串 s 中的第一個字元將是 s[0]，第 n 個字元將位於 s[n-1] 處。

>>> s = "Xanadu"
>>> s[1]
'a'

與其他語言中的陣列不同，Python 還使用負數來反向索引陣列。最後一個字元的索引為 -1，倒數第二個字元的索引為 -2，依此類推。

>>> s[-4]
'n'

我們還可以使用“切片”來訪問 s 的子字串。s[a:b] 將為我們提供一個從 s[a] 開始到 s[b-1] 結束的字串。

>>> s[1:4]
'ana'

這些都是不可分配的。

>>> print(s)
>>> s[0] = 'J'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment
>>> s[1:3] = "up"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support slice assignment
>>> print(s)

輸出（假設錯誤被抑制）

Xanadu
Xanadu

切片的另一個特性是，如果開頭或結尾為空，它將根據上下文預設設定為第一個或最後一個索引

>>> s[2:]
'nadu'
>>> s[:3]
'Xan'
>>> s[:]
'Xanadu'

你也可以在切片中使用負數

>>> print(s[-2:])
'du'

要理解切片，最簡單的方法是不要計算元素本身。這有點像在手指之間的空間而不是用手指計算。列表的索引方式如下

Element:     1     2     3     4
Index:    0     1     2     3     4
         -4    -3    -2    -1

因此，當我們要求 [1:3] 切片時，這意味著我們從索引 1 開始，在索引 2 結束，並取它們之間的所有內容。如果你習慣於 C 或 Java 中的索引，這可能有點令人不安，直到你習慣它為止。

字串常量

字串常量可以在標準字串模組中找到。例如，string.digits 等於 '0123456789'。

連結

Python "string" 模組文件 -- python.org

字串方法

有一些方法或內建字串函式

capitalize
center
count
decode
encode
endswith
expandtabs
find
index
isalnum
isalpha
isdigit
islower
isspace
istitle
isupper
join
ljust
lower
lstrip
replace
rfind
rindex
rjust
rstrip
split
splitlines
startswith
strip
swapcase
title
translate
upper
zfill

僅強調的專案將被涵蓋。

is*

isalnum()、isalpha()、isdigit()、islower()、isupper()、isspace() 和 istitle() 屬於此類別。

要比較的字串物件的長度必須至少為 1，否則 is* 方法將返回 False。換句話說，len(string) == 0 的字串物件被認為是“空”或 False。

isalnum 如果字串完全由字母數字字元（即沒有標點符號）組成，則返回 True。
isalpha 和 isdigit 對僅字母字元或僅數字字元分別類似地工作。
isspace 如果字串完全由空格組成，則返回 True。
islower、isupper 和 istitle 如果字串分別為小寫、大寫或標題大小寫，則返回 True。未區分大小寫的字元是“允許的”，例如數字，但字串物件中必須至少有一個區分大小寫的字元才能返回 True。標題大小寫意味著每個單詞的第一個區分大小寫的字元為大寫，而任何緊隨其後的區分大小寫的字元為小寫。奇怪的是，'Y2K'.istitle() 返回 True。這是因為大寫字元只能跟隨未區分大小寫的字元。同樣，小寫字元只能跟隨大寫或小寫字元。提示：空格是未區分大小寫的。

示例

>>> '2YK'.istitle()
False
>>> 'Y2K'.istitle()
True
>>> '2Y K'.istitle()
True

標題、大寫、小寫、交換大小寫、首字母大寫

分別返回轉換為標題大小寫、大寫、小寫、反轉大小寫或首字母大寫的字串。

title 方法將字串中每個單詞的第一個字母大寫（並將其餘字母小寫）。單詞被識別為由非字母字元（例如數字或空格）分隔的字母字元子字串。這可能會導致一些意外的行為。例如，字串“x1x”將轉換為“X1X”而不是“X1x”。

swapcase 方法將所有大寫字母轉換為小寫字母，反之亦然。

capitalize 方法類似於 title，只是它將整個字串視為一個單詞。（即它將第一個字元大寫，並將其餘字元小寫）

示例

s = 'Hello, wOrLD'
print(s)             # 'Hello, wOrLD'
print(s.title())     # 'Hello, World'
print(s.swapcase())  # 'hELLO, WoRld'
print(s.upper())     # 'HELLO, WORLD'
print(s.lower())     # 'hello, world'
print(s.capitalize())# 'Hello, world'

關鍵詞：轉換為小寫，轉換為大寫，小寫，大寫，小寫，大寫。

count

返回字串中指定子字串的個數。例如

>>> s = 'Hello, world'
>>> s.count('o') # print the number of 'o's in 'Hello, World' (2)
2

提示：.count() 區分大小寫，因此此示例只計算小寫字母 'o' 的個數。例如，如果你執行

>>> s = 'HELLO, WORLD'
>>> s.count('o') # print the number of lowercase 'o's in 'HELLO, WORLD' (0)
0

strip、rstrip、lstrip

返回字串的副本，其中開頭（lstrip）和結尾（rstrip）的空格被移除。strip 同時移除兩者。

>>> s = '\t Hello, world\n\t '
>>> print(s)
         Hello, world

>>> print(s.strip())
Hello, world
>>> print(s.lstrip())
Hello, world
        # ends here
>>> print(s.rstrip())
         Hello, world

請注意開頭和結尾的製表符和換行符。

Strip 方法也可以用來移除其他型別的字元。

import string
s = 'www.wikibooks.org'
print(s)
print(s.strip('w'))                # Removes all w's from outside
print(s.strip(string.lowercase))   # Removes all lowercase letters from outside
print(s.strip(string.printable))   # Removes all printable characters

輸出

www.wikibooks.org
.wikibooks.org
.wikibooks.

請注意，string.lowercase 和 string.printable 需要匯入 string 語句

ljust、rjust、center

將字串左對齊、右對齊或居中對齊到給定的欄位大小（其餘部分用空格填充）。

>>> s = 'foo'
>>> s
'foo'
>>> s.ljust(7)
'foo    '
>>> s.rjust(7)
'    foo'
>>> s.center(7)
'  foo  '

join

將給定序列用字串作為分隔符連線在一起

>>> seq = ['1', '2', '3', '4', '5']
>>> ' '.join(seq)
'1 2 3 4 5'
>>> '+'.join(seq)
'1+2+3+4+5'

map 可能在這裡有幫助： (它將 seq 中的數字轉換為字串)

>>> seq = [1,2,3,4,5]
>>> ' '.join(map(str, seq))
'1 2 3 4 5'

現在 seq 中可以包含任意物件，而不僅僅是字串。

find, index, rfind, rindex

find 和 index 方法返回給定子序列第一次出現的索引。如果未找到，find 返回 -1，但 index 會引發 ValueError。rfind 和 rindex 與 find 和 index 相同，只是它們從右到左搜尋字串（即找到最後一次出現）

>>> s = 'Hello, world'
>>> s.find('l')
2
>>> s[s.index('l'):]
'llo, world'
>>> s.rfind('l')
10
>>> s[:s.rindex('l')]
'Hello, wor'
>>> s[s.index('l'):s.rindex('l')]
'llo, wor'

因為 Python 字串接受負下標，所以 index 可能更適合用於如所示情況，因為使用 find 反而會產生意外值。

replace

replace 的工作方式就像它聽起來那樣。它返回字串的副本，其中第一個引數的所有出現都被第二個引數替換。

>>> 'Hello, world'.replace('o', 'X')
'HellX, wXrld'

或者，使用變數賦值

string = 'Hello, world'
newString = string.replace('o', 'X')
print(string)
print(newString)

輸出

Hello, world
HellX, wXrld

注意，原始變數 (string) 在呼叫 replace 後保持不變。

expandtabs

用適當數量的空格替換製表符（預設每個製表符的空格數 = 8；這可以透過將製表符大小作為引數傳遞來更改）。

s = 'abcdefg\tabc\ta'
print(s)
print(len(s))
t = s.expandtabs()
print(t)
print(len(t))

輸出

abcdefg abc     a
13
abcdefg abc     a
17

注意，儘管這兩個字串看起來相同，但第二個字串 (t) 的長度不同，因為每個製表符都用空格而不是製表符字元表示。

要使用製表符大小為 4 而不是 8

v = s.expandtabs(4)
print(v)
print(len(v))

輸出

abcdefg abc a
13

請注意，每個製表符並不總是被計算為八個空格。相反，製表符將計數“推”到下一個八的倍數。例如

s = '\t\t'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

輸出

 ****************
 16

s = 'abc\tabc\tabc'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

輸出

 abc*****abc*****abc
 19

split, splitlines

split 方法返回字串中單詞的列表。它可以接受一個分隔符引數，而不是使用空格。

>>> s = 'Hello, world'
>>> s.split()
['Hello,', 'world']
>>> s.split('l')
['He', '', 'o, wor', 'd']

注意，在這兩種情況下，分隔符都不包含在分割的字串中，但允許空字串。

splitlines 方法將多行字串分解成多個單行字串。它類似於 split('\n')（但也接受 '\r' 和 '\r\n' 作為分隔符），不同的是，如果字串以換行符結尾，splitlines 會忽略該最終字元（見示例）。

>>> s = """
... One line
... Two lines
... Red lines
... Blue lines
... Green lines
... """
>>> s.split('\n')
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines', '']
>>> s.splitlines()
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines']

split 方法也接受多字元字串字面量

txt = 'May the force be with you'
spl = txt.split('the')
print(spl)
# ['May ', ' force be with you']

Unicode

在 Python 3.x 中，所有字串（型別 str）預設包含 Unicode。

在 Python 2.x 中，除了 str 型別之外，還有一個專門的 unicode 型別：u = u"Hello"; type(u) is unicode。

內部幫助中的主題名稱為 UNICODE。

Python 3.x 的示例

v = "Hello Günther"
- 在原始碼中直接使用 Unicode 程式碼點；這必須採用 UTF-8 編碼。
v = "Hello G\xfcnther"
- 使用 \xfc 指定 8 位 Unicode 程式碼點。
v = "Hello G\u00fcnther"
- 使用 \u00fc 指定 16 位 Unicode 程式碼點。
v = "Hello G\U000000fcnther"
- 使用 \U000000fc 指定 32 位 Unicode 程式碼點，其中 U 大寫。
v = "Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
- 使用 \N 後跟 unicode 點名稱指定 Unicode 程式碼點。
v = "Hello G\N{latin small letter u with diaeresis}nther"
- 程式碼點名稱可以是小寫。
n = unicodedata.name(chr(252))
- 獲取給定 Unicode 字元的 Unicode 程式碼點名稱，這裡是 ü。
v = "Hello G" + chr(252) + "nther"
- chr() 接受 Unicode 程式碼點並返回包含一個 Unicode 字元的字串。
c = ord("ü")
- 產生程式碼點編號。
b = "Hello Günther".encode("UTF-8")
- 從 Unicode 字串建立位元組序列 (bytes)。
b = "Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
- 透過 decode() 方法將位元組解碼為 Unicode 字串。
v = b"Hello " + "G\u00fcnther"
- 丟擲 TypeError: can't concat bytes to str。
v = b"Hello".decode("ASCII") + "G\u00fcnther"
- 現在它可以工作了。
f = open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
- 使用特定編碼開啟檔案以供讀取，並從中讀取。如果沒有指定編碼，則使用 locale.getpreferredencoding() 的編碼。
f = open("File.txt", "w", encoding="UTF-8"); f.write("Hello G\u00fcnther"); f.close()
- 以指定編碼寫入檔案。
f = open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
- -sig 編碼意味著任何前導位元組順序標記 (BOM) 會自動被剝離。
f = tokenize.open("File.txt"); lines = f.readlines(); f.close()
- 根據檔案中存在的編碼標記（如 BOM）自動檢測編碼，剝離標記。
f = open("File.txt", "w", encoding="UTF-8-sig"); f.write("Hello G\u00fcnther"); f.close()
- 以 UTF-8 編碼寫入檔案，在開頭寫入 BOM。

Python 2.x 的示例

v = u"Hello G\u00fcnther"
- 使用 \u00fc 指定 16 位 Unicode 程式碼點。
v = u"Hello G\U000000fcnther"
- 使用 \U000000fc 指定 32 位 Unicode 程式碼點，其中 U 大寫。
v = u"Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
- 使用 \N 後跟 unicode 點名稱指定 Unicode 程式碼點。
v = u"Hello G\N{latin small letter u with diaeresis}nther"
- 程式碼點名稱可以是小寫。
unicodedata.name(unichr(252))
- 獲取給定 Unicode 字元的 Unicode 程式碼點名稱，這裡是 ü。
v = "Hello G" + unichr(252) + "nther"
- chr() 接受 Unicode 程式碼點並返回包含一個 Unicode 字元的字串。
c = ord(u"ü")
- 產生程式碼點編號。
b = u"Hello Günther".encode("UTF-8")
- 從 Unicode 字串建立位元組序列 (str)。type(b) is str。
b = u"Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
- 透過 decode() 方法將位元組（型別 str）解碼為 Unicode 字串。
v = "Hello" + u"Hello G\u00fcnther"
- 連線 str（位元組）和 Unicode 字串，不會出錯。
f = codecs.open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
- 使用特定編碼開啟檔案以供讀取，並從中讀取。如果沒有指定編碼，則使用 locale.getpreferredencoding() 的編碼 [VERIFY]。
f = codecs.open("File.txt", "w", encoding="UTF-8"); f.write(u"Hello G\u00fcnther"); f.close()
- 以指定編碼寫入檔案。
- 與 Python 3 變體不同的是，如果被告知透過 \n 寫入換行符，它不會寫入作業系統特定的換行符，而是寫入字面意義上的 \n；這在 Windows 上會有所不同。
- 為了確保像文字模式一樣的操作，可以寫入 os.linesep。
f = codecs.open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
- -sig 編碼意味著任何前導位元組順序標記 (BOM) 會自動被剝離。

連結

Unicode HOWTO for Python 3, docs.python.org
Unicode HOWTO for Python 2, docs.python.org
Processing Text Files in Python 3, curiousefficiency.org
PEP 263 – Defining Python Source Code Encodings, python.org
unicodedata — Unicode Database in Python Library Reference, docs.python.org
Get a list of all the encodings Python can encode to, stackoverflow.com

外部連結

"String Methods" chapter -- python.org
Python "string" 模組文件 -- python.org

上一篇：數字

索引

下一篇：列表